Building and operating distributed systems is difficult. At DNSimple, we operate systems across five data centers as well as a myriad of systems running outside of our primary network. Our software is written in multiple languages, including Ruby, Go and Erlang. There are many challenges to running systems this way, but there are also advantages. In some cases speed to market is more important than specific low-level performance, while in other cases we need to get the highest level of performance possible out of a system.
In this post I will describe what we consider when building and operating our systems and the tools we use to address each of those facets.
The Pieces of the Puzzle
If you operate any long-running service, you need to monitor each service and its dependent services. At minimum you need to know if a service is no longer responding to requests, but you may also want to monitor other things about the service, such as exceeding certain limits set for level-of-service.
Any time something goes wrong you'll need to have a way to see what happened at and around the time the incident started. Perhaps even further back if a system failure occurs due to gradual degradation. It is critical that you aggregate your logs in a place where all developers can observe their current output and search back through history.
Services may perform poorly or return unexpected responses. To understand how your systems behave you'll need to be able to look at their behavior over time. For this, metrics gathering and analysis tools are necessary.
As the number of distributed services you operate grows, you'll need to document the services in a high-level fashion. Most documentation tools for software only document the software API and beyond that it's common to create a high-level README for the software itself, but in a complex distributed system it is the interactions between these systems that is so critical.
Each system connected to the Internet provides one more point of entry for people with malicious intentions. At minimum it is important to consider the appropriate level of encryption, authentication and access control for inter-service communication as well as interfaces for administrative control.
Any service that has a persistent data store requires regular backups. Having backups in place is only part of the process, you need to verify that the data is actually recoverable, which means having some sort of plan for regularly testing backup recovery.
Sometimes service state goes awry and the service just needs to be stopped and started with fresh state. In a distributed envrionment you need to have mechanisms in place to easily control each of the services you have deployed. Additionally, this type of service control often occurs under duress, so it is important to have control tools documented and to minimize the number of steps required to execute a control sequence.
You've got to get your services out to all of your systems, and for that you'll need solid deploy tools. In a heterogenous environment the deployment mechanisms will often vary significantly, thus it is important that deployment processes are developed and documented as part of any service development.
How We Do It
Now that I have outlined the pieces of the puzzle, here is information specific to how we implement the pieces of the puzzle today.
At DNSimple we monitor services both internally from the node as well as externally for services that outward facing. We use a variety of tools to do this, including:
These tools are then connected with our chat system HipChat and our notification system PagerDuty. All alerts show up in our main HipChat room, and certain alerts trigger an incident in PagerDuty, which then notifies whoever is on call. This combination of traditional paging and notification plus in-chat notification has the benefit of making any sort of incident visible to the entire team, not just whomever is on call.
We rely on two services for logging: Papertrail and LogEntries. Both systems have their strengths and weaknesses. We use Papertrail as the short-term aggregator of all of our system and application logs. LogEntries on the other hand is only used for our primary Rails application, and the data is kept indefinitely.
For our Rails application metrics we currently rely on NewRelic. For everything else we send data to Librato. In some cases our applications send data directly to Librato, often as part of a scheduled job. In other cases, our services publish metrics via an internal API and we have a small service that queries the metrics API, filters the data a bit, and then sends it off to Librato. One of the benefits of publishing metrics via an API is that it lets different services consume those metrics and do different things with them, as opposed to relying upon only a single service for the metrics.
We document almost everything in Github wikis. Wikis take quite a bit of work to maintain as they require constant pruning and shaping, but having everything in the same location where we build our projects and track issues around them is quite nice.
We attempt to minimize the number of permanent data stores to limit the amount of backups we track. Backups are automatically created and stored in Amazon S3 on a regular schedule. We also run multiple Postgres followers, which are available in case our master database fails.
Services are currently controlled with runit, which is often monitered on the node by monit. To control services anyone with SSH access may shell in and control from the node directly. We are now experimenting with a new tool developed in-house that provides remote control functionality, however it is still only deployed in a limited fashion. This tool provides an HTTP interface for executing specific scripts on nodes, allowing it to be controlled from a central location - in our case, our chat room.
All systems are configured and deployed using chef. We run an internal chef server and many of our machines have their chef client running on regular intervals. Some machines only run chef client on demand, and in these cases we currently trigger a chef client run manually from our chat room.
As long as we're in business we will be operating systems. We continue to look for new ways to improve our operational environment. Here are some of the things we're looking at for the future:
The first three items in this list are all designed to make it easier to deploy entire systems. Docker helps build portable containers which are designed to be used in both local development and in production systems. Deis and Flynn both provide platforms for deploying and operating services in a quick fashion. Consul is a system for service registration and discovery as well as for manging distributed configuration information.
Our goal is to create reliable systems that are still easy to maintain and enhance. With the tools above we have a start, but we continue to look for ways to improve and advance our entire operations.