<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DNSimple</title>
	<atom:link href="http://blog.dnsimple.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.dnsimple.com</link>
	<description></description>
	<lastBuildDate>Wed, 22 May 2013 17:05:09 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Incident report: DNS changes blocked (May 20)</title>
		<link>http://blog.dnsimple.com/incident-report-dns-changes-blocked-may-20/</link>
		<comments>http://blog.dnsimple.com/incident-report-dns-changes-blocked-may-20/#comments</comments>
		<pubDate>Wed, 22 May 2013 17:05:09 +0000</pubDate>
		<dc:creator>Darrin Eden</dc:creator>
				<category><![CDATA[Updates]]></category>

		<guid isPermaLink="false">http://blog.dnsimple.com/?p=10286231987</guid>
		<description><![CDATA[On May 20th, beginning at about 11:30am Pacific time (18:30 UTC), we experienced a partial service outage. A component failure blocked our customers from updating their DNS records. We apologize for any trouble this caused DNSimple customers. Highlights Detected time: 11:40 PDT Resolved time: 23:43 PDT Severity: All customers wanting to change DNS records. Pros: Our alerting...]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>On May 20th, beginning at about 11:30am Pacific time (18:30 UTC), we experienced a partial service outage. A component failure blocked our customers from updating their DNS records. We apologize for any trouble this caused DNSimple customers.</p>
<p><strong>Highlights</strong></p>
<ul>
<li><span style="font-size: 13px; line-height: 19px;">Detected time: 11:40 PDT</span></li>
<li>Resolved time: 23:43 PDT</li>
<li>Severity: All customers wanting to change DNS records.</li>
</ul>
<p><strong>Pros: Our alerting system mostly worked.</strong></p>
<p><strong>Cons: Our redirector service was out of sync and we were not alerted directly.</strong></p>
<p><strong>Overview</strong></p>
<p>Long story short — we have a single point of failure around a MySQL database. Our name servers rely on MySQL to replicate changes. Maintenance side effects apparently caused a MySQL database to fail slowly. In actuality the physical server hosting the write-version of this database was failing slowly. These failures surfaced as high latency response times. Eventually this resulted in connection timeouts. When I attempted to reboot the physical machine it wasn’t booting normally. Next I migrated the container to our Tokyo data center. I restored the container from a backup database and then synced all the unicast name servers. Unfortunately, I forgot that our redirection service also relies on MySQL replication working. It wasn&#8217;t repaired until after a DNSimple customer reported the issue.</p>
<p><strong>Remediation</strong></p>
<p>We will add additional alerts around our redirector service. We will migrate MySQL to a high availability configuration. Customers may also want to consider migrating to DNSimple&#8217;s <a title="Anycast Signup" href="https://dnsimple.com/anycast">anycast network currently in testing</a>. It does not have <em>this</em> single point of failure.</p>
<div class="shr-publisher-10286231987"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://blog.dnsimple.com/incident-report-dns-changes-blocked-may-20/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Application maintenance window (May 15)</title>
		<link>http://blog.dnsimple.com/application-maintenance-window-may-15/</link>
		<comments>http://blog.dnsimple.com/application-maintenance-window-may-15/#comments</comments>
		<pubDate>Mon, 13 May 2013 14:07:34 +0000</pubDate>
		<dc:creator>Darrin Eden</dc:creator>
				<category><![CDATA[Updates]]></category>

		<guid isPermaLink="false">http://blog.dnsimple.com/?p=10286231981</guid>
		<description><![CDATA[We have scheduled a one hour maintenance window beginning at 6pm Pacific time on Wednesday, May 15 (Thursday, May 16, 2013 at 0100 UTC). This will be used to update network routers at each of DNSimple&#8217;s data centers to increase overall availability. We expect the DNSimple application and redirection service to each be unavailable for less...]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>We have scheduled a one hour maintenance window beginning at 6pm Pacific time on Wednesday, May 15 (Thursday, May 16, 2013 at 0100 UTC). This will be used to update network routers at each of DNSimple&#8217;s data centers to increase overall availability. We expect the DNSimple application and redirection service to each be unavailable for less than five minutes. Name service will not be affected. We apologies for any inconveniences this may cause our customers.</p>
<div class="shr-publisher-10286231981"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://blog.dnsimple.com/application-maintenance-window-may-15/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Announcing Advanced Editor v1 end of life</title>
		<link>http://blog.dnsimple.com/announcing-advanced-editor-v1-end-of-life/</link>
		<comments>http://blog.dnsimple.com/announcing-advanced-editor-v1-end-of-life/#comments</comments>
		<pubDate>Fri, 10 May 2013 13:56:58 +0000</pubDate>
		<dc:creator>Simone Carletti</dc:creator>
				<category><![CDATA[Updates]]></category>

		<guid isPermaLink="false">http://blog.dnsimple.com/?p=10286231976</guid>
		<description><![CDATA[Today we're officially deprecating our advanced editor v1.]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>Almost 2 year ago we <a href="http://blog.dnsimple.com/advanced-editor-v-apis/">announced the second version</a> of our <a href="http://support.dnsimple.com/articles/advanced-editor">advanced DNS editor</a>.</p>
<p>At the very beginning, we gave the users the option to switch back and forth between versions, to assist in transitioning to the new editor. Later, the new version become the default for all new accounts. Even today there are still a minor number of accounts using the old editor.</p>
<p>Today we&#8217;re officially deprecating our advanced editor v1. Customers with the legacy version will be upgraded to the advanced editor v2 on Monday, May 20th, 2013. The support and code for the old editor will be definitively removed on Monday, June 3rd, 2013.</p>
<p>If you&#8217;re using the old editor and you want to upgrade today, please log and switch to the V2 editor or <a href="http://support.dnsimple.com/contact">contact us</a>. We encourage the users using the old editor to upgrading as soon as possible.</p>
<div class="shr-publisher-10286231976"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://blog.dnsimple.com/announcing-advanced-editor-v1-end-of-life/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Heroku Service &#8211; Specify Your App Name</title>
		<link>http://blog.dnsimple.com/heroku-service-specify-your-app-name/</link>
		<comments>http://blog.dnsimple.com/heroku-service-specify-your-app-name/#comments</comments>
		<pubDate>Sat, 04 May 2013 17:26:37 +0000</pubDate>
		<dc:creator>Anthony Eden</dc:creator>
				<category><![CDATA[Features]]></category>
		<category><![CDATA[Updates]]></category>

		<guid isPermaLink="false">http://blog.dnsimple.com/?p=10286231971</guid>
		<description><![CDATA[Today we deployed a change to how the Heroku one-click service works. Heroku recently released their European region, and with that change they deprecated the &#8220;proxy&#8221; subdomain for Heroku apps. To make DNSimple&#8217;s one-click service compatible with this change, we now require your Heroku app name, as it appears in your herokuapp.com subdomain. When you...]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>Today we deployed a change to how the Heroku one-click service works. Heroku recently released their European region, and with that change they deprecated the &#8220;proxy&#8221; subdomain for Heroku apps. To make DNSimple&#8217;s one-click service compatible with this change, we now require your Heroku app name, as it appears in your herokuapp.com subdomain. When you add Heroku just enter your app name part of the herokuapp.com domain and you should have an ALIAS for your apex domain and a CNAME for &#8220;www&#8221;. You can always add or change these later.</p>
<p>If you plan on using SSL with Heroku then you&#8217;ll need to change the name that &#8220;&#8221; and &#8220;www&#8221; point to to the host name that Heroku provides when you set up your SSL certificate.</p>
<p>Finally make sure to add the custom domain in your Heroku app for both yourdomain.com and www.yourdomain.com. If you have any questions feel free to email us: <a href="mailto:support@dnsimple.com">support@dnsimple.com</a>.</p>
<div class="shr-publisher-10286231971"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://blog.dnsimple.com/heroku-service-specify-your-app-name/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Application maintenance window (April 21st)</title>
		<link>http://blog.dnsimple.com/application-maintenance-window-april-21st/</link>
		<comments>http://blog.dnsimple.com/application-maintenance-window-april-21st/#comments</comments>
		<pubDate>Thu, 18 Apr 2013 16:10:24 +0000</pubDate>
		<dc:creator>Darrin Eden</dc:creator>
				<category><![CDATA[Updates]]></category>

		<guid isPermaLink="false">http://blog.dnsimple.com/?p=10286231963</guid>
		<description><![CDATA[In order to update several low level components of our software stack we have scheduled a maintenance window beginning Sunday, April 21 at 0400 UTC (Saturday, April 20th at 9PM Pacific time). It will remain open for two hours. We expect the DNSimple application and API to be unavailable for less than twenty minutes. Name service will not...]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>In order to update several low level components of our software stack we have scheduled a maintenance window beginning Sunday, April 21 at 0400 UTC (Saturday, April 20th at 9PM Pacific time). It will remain open for two hours. We expect the DNSimple application and API to be unavailable for less than twenty minutes. Name service will not be affected. We apologize for any inconveniences this may cause our customers. Updates will be posted to our <a title="Twitter feed" href="https://twitter.com/dnsimple">Twitter feed</a>.</p>
<p>Update 1: We will also take the redirector offline for 5 minutes to update low level components on that system as well. If you are using URL forwarding this will result in your URL forwarding to be offline while the system is unavailable. We apologize for any inconvenience.</p>
<div class="shr-publisher-10286231963"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://blog.dnsimple.com/application-maintenance-window-april-21st/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Sporadic ALIAS record resolution failure</title>
		<link>http://blog.dnsimple.com/sporadic-alias-record-resolution-failure/</link>
		<comments>http://blog.dnsimple.com/sporadic-alias-record-resolution-failure/#comments</comments>
		<pubDate>Wed, 10 Apr 2013 18:54:05 +0000</pubDate>
		<dc:creator>Darrin Eden</dc:creator>
				<category><![CDATA[Post-mortem]]></category>

		<guid isPermaLink="false">http://blog.dnsimple.com/?p=10286231957</guid>
		<description><![CDATA[On Saturday, April 6th, around 7:49 AM UTC NS2 (name server 2) stopped responding to queries. At 10:32 AM UTC, NS3 stopped responding to queries as well. During the outage NS1 and NS4 picked up the additional traffic. This abnormal load caused a portion of ALIAS resolutions to fail &#8212; directly affecting DNSimple customers. We...]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>On Saturday, April 6th, around 7:49 AM UTC NS2 (name server 2) stopped responding to queries. At 10:32 AM UTC, NS3 stopped responding to queries as well. During the outage NS1 and NS4 picked up the additional traffic. This abnormal load caused a portion of ALIAS resolutions to fail &#8212; directly affecting DNSimple customers.</p>
<p>We received notification that there were issues resolving queries immediately after the first resolver began failing, however the person on-call did not understand how severe the issue was and thus the issue was not escalated. As this was during the middle of the night in North America the next engineer to observe the problem wasn&#8217;t online until several hours later.</p>
<p>While complex systems fail in interesting ways it generally takes the form of several interdependent pieces failing simultaneously. This is not unexpected. What is expected is that, as a team, we have the responsibility, knowledge, and practice to respond quickly and recover in a calm and determined fashion. Unfortunately this incident demonstrated our team wasn&#8217;t communicating around alerts effectively. I apologize for failing to focus on this most important aspect.</p>
<p><strong>Why did it happen?</strong></p>
<p>Preconditions:</p>
<ol>
<li><span style="font-size: 13px;">NS3 moved to our new physical infrastructure recently in response to a hosting provider&#8217;s network change. These hosts have many CPU cores.</span></li>
<li><span style="font-size: 13px;">NS2 moved to a host with more CPU cores as a side effect of a system upgrade.</span></li>
<li><span style="font-size: 13px;">We have an open issue with our ALIAS resolution software where, at a relatively low frequency, it blocks indefinitely.</span></li>
<li><span style="font-size: 13px;">We use a monitoring program (monit) to detect this condition is nearing and restart that part of the software preemptively. While less than ideal it will suffice until we permanently solve this issue.</span></li>
<li><span style="font-size: 13px;">The monitoring software detects this condition as a function of CPU utilization.</span></li>
</ol>
<p>When we ran our software on hosts with many cores it no longer detected the case we were targeting. I noticed this and made an incorrect assumption about how monit behaved. I made a change to the command logic that I hoped would improve the situation. Unfortunately it silently failed to take any action instead.</p>
<p>When NS2 blocked it added pressure to the remaining hosts. When NS3 blocked pressure increased again. NS1 and NS4 continued to respond normally to most queries, but the additional processing requirements of ALIAS records resulted in a portion of those query type being dropped.</p>
<p><strong>How did we respond and recover?</strong></p>
<p>Once Anthony came online and discovered the issue he began looking into its source. He discovered processes on both NS2 and NS3 using 100% of the CPU which was blocking the name server from responding to further requests. He restarted processes on NS2 and NS3 which allowed the name servers to begin resolving queries again. Once NS2 and NS3 were back online there were no additional failures reported for NS1 and NS4.</p>
<p><strong>How can we prevent similar unexpected issues from occurring again?</strong></p>
<p>I eventually realized monit calculates CPU as a multiple of the number of cores. We now automatically configure monit based on the number of cores for a given host. I also recalled the particulars of how monit executes a command and corrected the incomplete logic appropriately. I have a high degree of confidence in the current monit configuration and behavior appears to have stabilized.</p>
<p>As a team we have clarified our on-call escalation policy:</p>
<ol>
<li><span style="font-size: 13px;">If an alert is raised &#8211; always respond until it is resolved.</span></li>
<li>If an alert doesn&#8217;t have a solution linked directly and the fix is not immediately apparent &#8211; escalate the issue to a larger team. Make sure the solution is linked after the fact.</li>
</ol>
<p>Finally, we are actively working on delivering re-architected DNS software that is markedly more resilient. That project is approaching production quality and we hope to write more about it in the near future.</p>
<div class="shr-publisher-10286231957"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://blog.dnsimple.com/sporadic-alias-record-resolution-failure/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Advancing DNS Into the 21st Century</title>
		<link>http://blog.dnsimple.com/advancing-dns-into-the-21st-century/</link>
		<comments>http://blog.dnsimple.com/advancing-dns-into-the-21st-century/#comments</comments>
		<pubDate>Mon, 01 Apr 2013 14:55:28 +0000</pubDate>
		<dc:creator>Anthony Eden</dc:creator>
				<category><![CDATA[Updates]]></category>
		<category><![CDATA[advanced technology]]></category>
		<category><![CDATA[shifting paradigms]]></category>

		<guid isPermaLink="false">http://blog.dnsimple.com/?p=10286231950</guid>
		<description><![CDATA[Modern web applications benefit from fast DNS responses. DNS response times may still add anywhere from 10s to 100s of milliseconds to your applications overall latency. DNS providers are constantly working to reduce lookup times using technology such as caching and the use of BGP protocols. Even web browser makers like Google have joined in...]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>Modern web applications benefit from fast DNS responses. DNS response times may still add anywhere from 10s to 100s of milliseconds to your applications overall latency. DNS providers are constantly working to reduce lookup times using technology such as caching and the use of BGP protocols. Even web browser makers like Google have joined in with fast public resolvers and even improvements in web browsers. You may have already heard of this term: pre-fetching. Well we&#8217;ve gone one step further&#8230;</p>
<p><strong>Introducing Precog DNS</strong></p>
<p><img class="size-medium wp-image-10286231951" alt="precog_1" src="http://blog.dnsimple.com/wp-content/uploads/2013/04/precog_1-300x151.jpg" width="300" height="151" /></p>
<p>This new patent-pending technology revolutionizes DNS by providing answers to your questions before you even ask them. That&#8217;s right, our precog-powered DNS servers can now provide DNS responses before you ask them! How is this possible? It&#8217;s all thanks to pre-cognition. This breakthrough technology means our servers look up your DNS questions prior to receiving them and then send them to you in parallel to your request. This results in negative response times &#8211; our answers come before your questions. This means that there is not only no latency, you actually get better response times for all of your other services!</p>
<p><strong>Let me Have It!</strong></p>
<p>We&#8217;re working on finalizing the service for public release. We&#8217;ll let you know as soon as it&#8217;s ready. In fact, we already knew that you wanted the service, so there&#8217;s no need to even tell us who you are, we&#8217;ll be in touch shortly.</p>
<div class="shr-publisher-10286231950"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://blog.dnsimple.com/advancing-dns-into-the-21st-century/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>ns3 IP Address is Changing</title>
		<link>http://blog.dnsimple.com/ns3-ip-address-is-changing/</link>
		<comments>http://blog.dnsimple.com/ns3-ip-address-is-changing/#comments</comments>
		<pubDate>Tue, 19 Mar 2013 16:29:55 +0000</pubDate>
		<dc:creator>Anthony Eden</dc:creator>
				<category><![CDATA[Updates]]></category>
		<category><![CDATA[ip address]]></category>
		<category><![CDATA[ip address change]]></category>
		<category><![CDATA[ns3]]></category>

		<guid isPermaLink="false">http://blog.dnsimple.com/?p=10286231944</guid>
		<description><![CDATA[Due to a combination of factors, including availability of new hardware and changes from one of our server providers, we will be moving ns3 from its current IP address to a new IP address. The new IP address is:   50.31.225.68 We will be changing our records to reflect this new IP address on 7...]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>Due to a combination of factors, including availability of new hardware and changes from one of our server providers, we will be moving ns3 from its current IP address to a new IP address. The new IP address is:</p>
<p style="text-align: left;">  50.31.225.68</p>
<p>We will be changing our records to reflect this new IP address on 7 April 2013. We will continue operating the old IP address until 20 April 2013, at which point the old IP address will stop responding to DNS queries.</p>
<ul>
<li>If you are using our vanity name server feature and your domain is not registered through us then you will need to change your ns3 IP address sometime between 8 April and 19 April to ensure that customers are not attempting to resolve DNS queries against a server which is no longer active.</li>
<li>If you are using vanity name servers but registered with us then we&#8217;ll make the change automatically for you.</li>
<li>If you are using ns1.dnsimple.com &#8211; ns4.dnsimple.com then you should not have to make any changes.</li>
</ul>
<p>Feel free to submit any questions to <a title="Contact DNSimple Support" href="mailto:support@dnsimple.com">support@dnsimple.com</a> &#8211; we&#8217;ll be happy to help.</p>
<div class="shr-publisher-10286231944"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://blog.dnsimple.com/ns3-ip-address-is-changing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A Golang Redirection Service</title>
		<link>http://blog.dnsimple.com/a-golang-redirection-service/</link>
		<comments>http://blog.dnsimple.com/a-golang-redirection-service/#comments</comments>
		<pubDate>Wed, 13 Mar 2013 16:26:03 +0000</pubDate>
		<dc:creator>Anthony Eden</dc:creator>
				<category><![CDATA[Learning]]></category>
		<category><![CDATA[clojure]]></category>
		<category><![CDATA[erlang]]></category>
		<category><![CDATA[golang]]></category>
		<category><![CDATA[ruby]]></category>

		<guid isPermaLink="false">http://blog.dnsimple.com/?p=10286231939</guid>
		<description><![CDATA[One of the services that we provide at DNSimple is URL redirection. You enter a special record in your DNSimple DNS for a domain and when we receive a DNS query for an A name of that record we return the IP address of our redirection service. If the next request is an HTTP request...]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><div>
<p>One of the services that we provide at DNSimple is URL redirection. You enter a special record in your DNSimple DNS for a domain and when we receive a DNS query for an A name of that record we return the IP address of our redirection service. If the next request is an HTTP request it will go through our redirection service which then looks up the URL to redirect to, does a bit of URL manipulation to ensure that the path and query strings are retained during the redirect, and then returns a 302 response to the client.</p>
<p><strong>From Ruby to Clojure</strong></p>
<p>From the beginning of DNSimple we were using a simple Ruby rack application on unicorn to provide the redirection. For several years it has worked very well, but as redirection traffic has grown so has the resource usage of the service. During the last few months we decided it was time to move from Ruby to something else. Initially I rewrote the application in Clojure, which was not too difficult, however we waited to deploy the service due to other tasks which were higher priority. Recently we attempted to deploy the new Clojure application, but ran into some snags along the way. Most of the issues were likely due to our own deficiencies, but it stopped our progress none-the-less. More importantly, after re-evaluating the code that I had previously written for the Clojure application, I decided that the resulting service code was more difficult to understand than the original Ruby application. No doubt this was likely due to our relative inexperience with Clojure when compared to Ruby, but regardless, I could not ignore this fact. Ultimately I decided that I didn&#8217;t really want to put this Clojure service in production. Once it was in production we&#8217;d want to see it through to completion and live with it for a while.</p>
<p><strong>From Clojure to Go</strong></p>
<p>At this point I had been itching to try my hand at Go. Given that this redirection service was a well understood problem it seemed like a good first project to get Go into production. After about 3 hours of work I had a working Go version of the redirector. Translating from the Ruby version was not too difficult in this case because the Ruby version was already fairly procedural. The Go code turned out to be similar in size to the Ruby code, coming in at 184 lines compared to Ruby&#8217;s 119. This is likely due to the simple and singular purpose of the app (which is something to think about in general).</p>
<p><strong>Deploying</strong></p>
<p>With this new redirector in hand and working locally, Darrin and I agreed to take it to production quickly.</p>
</div>
<div>Here&#8217;s what Darrin has to say about deploying and operating the application:</div>
<blockquote>
<div id="file-go-production-LC1">I&#8217;ve found operating Go processes in production to be relatively painless. We use our own flavor of continuous delivery. Opscode&#8217;s <a href="http://www.opscode.com/chef/">Chef</a> deploys our redirector service. When Chef runs the source repository is synced from GitHub and compiles a Go binary if there&#8217;s a code change. For process supervision we rely on the <a href="https://github.com/opscode-cookbooks/runit/blob/master/CHANGELOG.md#v100">runit service resource</a>. We use <a href="http://ddollar.github.com/foreman/">Foreman</a> to generate a runit configuration from a Procfile and set several environment variables. Standard output/error from the Go process is logged to a rotating file by <a href="http://smarden.org/runit/">runit</a> and ingested into <a href="https://papertrailapp.com/">Papertrail</a> via <a href=" https://github.com/papertrail/remote_syslog">remote_syslog</a>. We also use <a href="http://mmonit.com/monit/">monit</a> to monitor things like HTTP responses and confirm resource utilization is within bounds. Should monit detect trouble it&#8217;ll restart the process and alert an operator.</div>
</blockquote>
<div>
<p>Bottom line: deployment was a breeze, and we had the application running in production shortly after deciding to do so.</p>
<p><strong>Challenges</strong></p>
<p>To be clear, our first foray into Golang was not without issues. When we first deployed we found that the application memory usage was growing steadily and after few hours the application would die. Naturally this was pretty disappointing, but given it is the first time we&#8217;ve put Go into production, not too surprising. Initially I had used panic() in a couple places, which I removed. The application was still dying, so after digging a bit I found that I had used log.Fatal without fully understanding what it does. I assumed it was a log-level, however it actually logged the message and then called system.Exit(0). Whoops.</p>
<p>After fixing these basic issues, the application was still dying. The problem looked to be around opening syslog connections. After removing the syslog code we were still seeing issues. At this point we had narrowed the issue down to too many open file handles. A bit more digging and I found that the issue was around the default timeout values set by the Go HTTP server. The default settings allowed clients to use HTTP keep-alive to keep their connection indefinitely, essentially resulting in an ever-growing collection of open file handles. Setting the default read and write timeouts to a sufficiently low value stopped the problem completely. Since tracking down these issues, our Go code has been rock solid. It&#8217;s been running steadily in production with no issues for the last month. Both memory and CPU usage are at least an order of magnitude lower than the Ruby application.</p>
<p><strong>The Bottom Line</strong></p>
<p>Does this mean we&#8217;re switching everything over to Go? Not by a long shot. At DNSimple we use Ruby, Python, Erlang and Go for different purposes. We&#8217;ll likely continue doing this since it&#8217;s both fun and let&#8217;s us have a wide range of tools at our disposal. I will definitely say that the resource usage aspects of Go in comparison with something Erlang or Ruby does make it attractive as a tool for certain classes of applications, and that you should give it a try if you haven&#8217;t already. We&#8217;ve already implemented another internal tool in Go as well and will likely use it for more services internally where memory and CPU usage are critical.</p>
</div>
<div class="shr-publisher-10286231939"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://blog.dnsimple.com/a-golang-redirection-service/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Partial DNS outage</title>
		<link>http://blog.dnsimple.com/partial-dns-outage/</link>
		<comments>http://blog.dnsimple.com/partial-dns-outage/#comments</comments>
		<pubDate>Fri, 08 Mar 2013 23:59:17 +0000</pubDate>
		<dc:creator>Darrin Eden</dc:creator>
				<category><![CDATA[Post-mortem]]></category>

		<guid isPermaLink="false">http://blog.dnsimple.com/?p=10286231909</guid>
		<description><![CDATA[On Friday, March 8th we had a partial DNS outage. Available and well performing DNS is a critical utility for the businesses our customers are running. We take this responsibility seriously. We continue to investigate the reason behind this outage and are working hard to prevent it from reoccurring. I would like to apologize to any of our customers...]]></description>
				<content:encoded><![CDATA[<!-- Start Shareaholic LikeButtonSetTop Automatic --><!-- End Shareaholic LikeButtonSetTop Automatic --><p>On Friday, March 8th we had a partial DNS outage. Available and well performing DNS is a critical utility for the businesses our customers are running. We take this responsibility seriously. We continue to investigate the reason behind this outage and are working hard to prevent it from reoccurring. I would like to apologize to any of our customers who were affected by this event.</p>
<p><strong>What we experienced</strong></p>
<p>A few minutes past noon Pacific time (20:00 UTC) we began reading Tweets from several customers indicating there may be a problem. Throughout the event we received no alerts from various third-party monitoring tools we rely on. Nor did we receive any alerts from our internal monitoring systems. Since we expect our alerting system to light up if there&#8217;s a DNS issue we considered it a surprising start to our investigation. We began by taking a high level view of the system from a network perspective to detect anomalies. The first tool we used is Boundary. In the following graph it seems Google started sending us a higher than normal number of questions.</p>
<p><a href="http://blog.dnsimple.com/wp-content/uploads/2013/03/Screen-Shot-2013-03-08-at-3.17.49-PM.png"><img class="alignnone size-medium wp-image-10286231910" alt="Screen Shot 2013-03-08 at 3.17.49 PM" src="http://blog.dnsimple.com/wp-content/uploads/2013/03/Screen-Shot-2013-03-08-at-3.17.49-PM-300x166.png" width="300" height="166" /></a></p>
<p>&nbsp;</p>
<p>We took a deeper look at the traffic through packet captures. Google is almost exclusively asking for TXT records supporting <a title="DMAC" href="http://en.wikipedia.org/wiki/DMARC">DMARC</a> from each of the domains we host several times over. At this point I&#8217;m speculating Google activated a new system that may have overloaded our resolvers. The overall traffic pattern seems to be subsiding, but I&#8217;m not certain as to why (or even if) this was the cause of the event.</p>
<p><strong>What some customers experienced</strong></p>
<p>Listening to our customers&#8217; reports we noticed a trend pointing to a geographically isolated event. Several customers mentioned resolution seemed normal on the East Coast while the West Coast was still having trouble. We also received a majority of reports that the problem affected New Relic clients more than other services. Many were triggered by New Relic alerting it was unable to resolve DNS for their application. For what it&#8217;s worth the DNSimple web application uses New Relic and our own name resolution service. We weren&#8217;t alerted by New Relic that there was a problem with name resolution.</p>
<p><strong>Correlation does not imply causation</strong></p>
<ol>
<li>Our clients on one service, in one geographic area contributed a majority of the outage reports.</li>
<li>Our distributed, third-party monitoring tools we unable to detect an outage.</li>
<li>We received an anomalous rise in traffic from Google at roughly the same time.</li>
<li>Our name servers are globally distributed and largely isolated, but are unicast based at the moment.</li>
</ol>
<p>From these observations I would expect either all customers to be affected and the alerting system to detect it or the traffic pattern to be remain normal while some other factor affected a region or service. That these events occurred practically simultaneously may be related, but at this point I don&#8217;t understand how one would cause the other.</p>
<p><strong>What we plan to do about it</strong></p>
<p>I&#8217;m always disappointed when customers notice a problem before our monitoring system does. We will be working on another level of internal monitoring tools increasing our geographic diversity to detect issues. We will also investigate additional alerts that may be tied closer to pattern changes in network traffic. Finally, we are investing heavily in software and hardware that will dramatically increase the capacity and capability of our DNS service.</p>
<p><strong>Summary</strong></p>
<p>I am very sorry we weren&#8217;t able to detect and respond to this issue before it affected DNSimple customers. While we continue to research the reasons behind this outage please know we are working hard to live up to the trust you&#8217;ve put in us to help operate an important part of the internet and your business.</p>
<p><strong>Update</strong></p>
<p>After corresponding with a New Relic engineer it seems Google DNS may indeed have had a spot of trouble today. Because Google&#8217;s DNS is <a href="http://en.wikipedia.org/wiki/Anycast">anycast</a> based I believe it lends credence to a geographically concentrated outage.</p>
<p><strong>Update 2</strong></p>
<p>Our working theory is a spam botnet is involved in these events. Over the last several days we&#8217;ve seen a few query spikes (see image below) from Google and one from Yahoo seemingly related to SPF and DMARC (i.e. spam prevention) questions. An important fact I realized, following the initial event, is our query rate limiting system was part of the problem. Rate limiting is one defense we employ against a sudden and sustained traffic spike from a given IP address. We automatically throttle inbound questions in an attempt to protect our DNS software from being overwhelmed. If we assume those email networks are being flooded with messages such that an exceptionally high number of queries are generated it likely triggered our throttling defense. If the volume was significant enough many otherwise valid, &#8220;normal&#8221; queries from Google would timeout unanswered. As such this apparently reflects in Google&#8217;s Public DNS and finally on services like New Relic that rely on that DNS service.</p>
<p>As of 1PM Pacific time on March 11th we&#8217;ve whitelisted the range of IP addresses we&#8217;ve seen from Google and Yahoo. We continue to monitor the situation and react as we learn more.</p>
<p><a href="http://blog.dnsimple.com/wp-content/uploads/2013/03/Screen-Shot-2013-03-11-at-1.51.49-PM.png"><img class="alignnone size-medium wp-image-10286231937" alt="Screen Shot 2013-03-11 at 1.51.49 PM" src="http://blog.dnsimple.com/wp-content/uploads/2013/03/Screen-Shot-2013-03-11-at-1.51.49-PM-300x188.png" width="300" height="188" /></a></p>
<div class="shr-publisher-10286231909"></div><!-- Start Shareaholic LikeButtonSetBottom Automatic --><!-- End Shareaholic LikeButtonSetBottom Automatic -->]]></content:encoded>
			<wfw:commentRss>http://blog.dnsimple.com/partial-dns-outage/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Page Caching using disk: enhanced

Served from: blog.dnsimple.com @ 2013-05-25 00:38:25 -->