Lessons learned from buying, connecting, and operating domains
Free Trial

Script what you don't automate (so you can automate it later)

Amelia Aronsohn's profile picture Amelia Aronsohn

My primary function by time spent here at DNSimple is something along the lines of "Automationeer". I rarely do a task. Instead I figure out a way to automate a task so that it can be done consistently. In a business context, there are not a lot of things that are only done once, yet we often look at things as a "one off" if they come up irregularly.

To avoid this, I have a "done and improved" approach to doing work. The idea is to not only do a procedure, but also improve it each time, even if by a single step. Iterating over things means I am not over-investing in pushing automation to the max with every task I take on, but I am still working towards automating myself out of a job. During any new tasks, I will write down everything as I go along before documenting it into an Operating Procedure. Then, each time I use an Operating Procedure to do something, I update it. Touching it every time makes sure it stays up to date, and I am taking the time to improve.

Let's jump in with an example! I have a database on Heroku that has tables that require pruning. It's a small, hobby-sized database that, after four or five years, has hit a point where it needs a little pruning to stay in its plan.

Ok. Sooo uhh. Step one: I have to use the Heroku tool to log into the database. I do that… mmm somehow. Ok. Now I need to check the schema and get some approximate row counts. I know I have a gnarly query against the Postgres system tables to find this information somewhere. Once I find this, I run it and discover most of my rows are in two tables. Well, I need to see how much I can remove with a reasonable retention… I will run a few SELECT count(*) commands, and off the top of my head pick 90 days to retain, which will remove a few hundred thousand records… ok great, two DELETE FROM table WHERE created_at < '2019-01-05'; and I am good to go! That cost about 30 minutes, and I will never worry about the task again… until I have to. If I walked away now, the next person to do this would have to walk through this entire process again. Even more likely is when it comes up, I will be delegated to do it, because I have the institutional knowledge.

However, while I did that, I took some notes. So I cleaned those up, and it ended up looking like this:

  1. brew install heroku/brew/heroku to install app
  2. heroku login as my DNOps user
  3. heroku pg:psql --app dnmessaging
  4. Events & notifications are the normal problem tables
  5. SELECT count(*) FROM notifications WHERE processed = 't' AND created_at < '2018-11-05'; to make sure we are deleting a sufficient amount of rows.
  6. SELECT count(*) FROM events AND created_at < '2018-11-05'; to make sure we are deleting a sufficient amount of rows.
  7. DELETE FROM notifications WHERE processed = 't' AND created_at < '2018-11-05';
  8. DELETE FROM events AND created_at < '2018-11-05';
  9. Re-check all row counts
  10. Drink tea like a boss

When you read a set of instructions like this, it's easier to see the points I could automate or make more efficient. When I was processing how to complete the task, it was harder to see where my repetition or improvement points were. Now I've saved this under our Standard Operating Procedures list for the dnmessaging app. I'm done, and I've improved the procedure over its prior state of "undocumented and undone". I could improve it further now, but I have other, more pressing things to do.

The next time this problem came up, some undetermined time later, I pulled up the SOPs for dnmessaging and saw I had already left myself instructions on how to do it. Skimming those instructions, I saw I could improve. So in the spirit of Kaizen, it was time to iterate. It looked like I already figured out most of the work, and with some SQL magic I could make those queries reusable and therefore scriptable.

DELETE FROM notifications WHERE needs_processing = 'f' AND created_at < (now() - interval '90 days');
DELETE FROM events WHERE created_at < (now() - interval '90 days');

We saved this to a git repo for storing tools, scripts, and a few verifying commands that is shared with the team. My new set of documentation is:

  1. cd .../dnsimple_maintence_scripts/
  2. brew install heroku/brew/heroku to install app
  3. heroku login as my DNOps user
  4. heroku pg:psql --app dnmessaging --file=dnmessaging_prune.sql
  5. Verify the number of rows deleted is sufficient
  6. Drink tea like a boss

It ended up taking the same 30 minutes as the last time I ran this command. But the next time we get alerts, anyone with credentials has a very small amount of work to do to take care of these tasks, and they will likely be able to improve it even further. If the issue starts happening more frequently, we can turn this into a job in Heroku, possibly with some alerting & reporting so that we are aware if growth is speeding up. For now, I'm willing to call this task done and improved here!

This was a basic example, but we have hundreds of them at DNSimple; We store verification and maintenance SOPs in our wiki and maintenance code & files in a central repo. We put most of our commands into a Thorfile where possible, and a bin/ directory for bigger shell scripts. Almost all of our maintenance commands are consolidated in a standard manner and can be broken down into a single command with prompts. Most of these scripts we try to work into pipelines or job runners. Our application repositories are the same. Right now you can take a brand new machine, install homebrew, check out a repo, run ./bin/setup, and a fully working and complete rails + sidekiq app is running on your laptop for development! If you have a large company that onboards a lot of developers, having something set up a script, vagrant, or docker to instantly provision development is indispensable. But on small teams, where dev hours are far more scarce, we can fall prey to losing hours of dev time to laptop setups.

Think about all the tasks you do in a week; if you spent an extra bit of time here and there to hit "done and improved" instead of just "done", how much time would you save? How much could you delegate down the chain until the task was handled solely by computers? If you want to get your Domain Management closer to "done and improved", I recommend checking us out. With our API and webhooks, there are a lot of workflow improvements to be found!

Share on Twitter and Facebook

Amelia Aronsohn's profile picture

Amelia Aronsohn

Kaizen junkie, list enthusiast, automation obsessor, unrepentant otaku, constantly impressed by how amazing technology is.