Life is so ironic. 5 days ago, while going through the videos from GopherCon 2016, I watched an interesting talk from Francesc Campoy titled Understanding nil. I could never imagine that, a few days later, I would have been inspired to write a blog post about nil.

The reasons I'm here writing about nils is because a few days ago we discovered a critical security issue in our code, introduced a few hours before by a refactoring. After investigation, it turned out that the reason we introduced that bug was because we didn't properly consider nil as a value.

It is worthwhile to mention that the DNSimple app alone (this is just one component of our infrastructure) has over 93% test coverage. As you may imagine, it's quite a stunning result for a 6 year old web app composed of 3734 files and 146070 lines of code. In addition to our test coverage, we have a consistent peer review of each pull-request, and we use pull-requests for any change to our application. We care about code quality, maintenance, and indeed we care about security as well.

Still, this bug slipped through all of our checks and ended up in production. I want to share with you what I think we could have handled differently and the lesson we learned from it.

It can't be nil

It all started in 2011 with a simple assumption: this field can't be nil. The field is a string database field designed to store a unique token that is used to determine certain user permissions. We created the following migration:

create_table "membership_invitations", force: :cascade do |t|
  # ...
  t.string   "token",      limit: 255
end

add_index "membership_invitations", ["token"], name: "index_membership_invitations_on_token", using: :btree

and we validated the presence and uniqueness of the field.

validates :token, presence: true, uniqueness: true

If I would have written the code today it would probably look very different. But this is back in 2011, and in any case I'll return to possible issues later on in this post.

Then we had code like the following to fetch an invitation, if present, and attach it to a user.

if invitation = MembershipInvitation.find_active_invitation(session[:membership_token])
  invitation.accept(user)
end

And the finder method is defined as:

class MembershipInvitation < ActiveRecord::Base
  def self.find_active_invitation(token)
    where(token: token).first
  end
end

So far so good. The token can't be nil, because we have the validation and because we set it when the invitation is created. If we look at our business logic, it works because if the membership_token is empty the method will return a nil invitation that won't be accepted by the user.

It's all good. Isn't it?

Of course not. There are a number of improvements that could, and probably should, be made to the code above. Some of them are blatantly obvious, others are more subtle.

With the information available to us today, it's easier to look at the code above and see the errors almost immediately. If you were to look back on your own code from 2011, I'm sure you'd find a few instances that make you stop and question your past decisions. Likely the same will be true for today's code five years from now. The best we can do is write our code in such a way that allows for adapting to future best practices, and being mindful to update legacy code once it begins to show its age.

Do you want to know what the issues are here? Keep on reading.

NULL is a value

One thing I learned in all these years of web development, and Ruby on Rails development, is that Ruby on Rails developers have been notoriously bad at database design. For years we have been so dependent on ActiveRecord, that we ignored the most common database design best practices. Sometimes, we even consider them "not the Rails way", and I still very frequently read questions on StackOverflow where users are scared to write a SQL fragment in a where or join condition, because they feel "it's not the Rails way".

To be clear, I am not necessarily blaming Ruby on Rails here. I blame the laziness of the developers (myself included).

Database architects are very well aware that NULL and empty strings are not the same. Likewise, we should treat nil as a value, and effectively plan for it.

What could go wrong, you may be asking. Well, let me explain what happened.

The refactoring

A few weeks ago, after not changing those pieces of code much in 5 years, and while working on our new API v2, we refactored this model. As part of the refactoring, we merged this model into another one that contained the memberships that were accepted. We used the token, along with other flags, to determine if the record was an invitation or not.

As part of the refactoring, several methods were changed, but the finder method and the invitation if-condition above were left unchanged. What happened is easy to spot now that the code is reduced to a few lines.

In the new model, the nil could be an acceptable value, thefore in case the token in the session was empty, the find_active_invitation effectively returned the first record with a nil value, instead of nil.

We had tests. Tests were covering, among other things, the possibility that the token was blank "". Unfortunately, in this particular code we treated nil and empty string as the same thing, even though they are not (as a sidenote: in Go the zero value for a String is an empty string, not nil). Therefore, the bug was not immediately spotted by our tests.

nil is a value

The main lesson we learned from this issue is that nil is a value and we have to be more conscious of that with the cost of sometimes being verbose, explicit, or having code that doesn't completely reflect "the Ruby way" or "the Rails way" (but we already went down this path long ago, introducing extra layers and other changes to facilitate long term maintenance).

So let's take a look at actions we are considering to introduce in the future to properly handle null as a value.

Explicit argument validation

If you don't expect a nil value then fail fast and complain when nil is given.

def self.find_active_invitation(token)
  raise ArgumentError, "Token is blank" if token.blank?
  where(token: token).first
end

I'm pretty sure several Rubyists would be sticking up their noses, labelling this change as Java-oriented and complaining that Ruby is not a typed language. While it may be true, the context where we operate deserves an extra level of security. After all, we are human beings and we write fallible code.

We have a similar safe-check in other critical parts of our application, for example in the billing section where we prevent customers from being billed for empty or negative values, and it worked very well for us.

We also decided to validate other kinds of expectations. Notably, it's quite common to see code like this (just as an example, this is not taken from our codebase):

def self.search_for(string)
  where("email = ?", string)
end

Delegating the string conversion to ActiveRecord, or any underlying mechanism, may lead to potential casting issues. For example, if you expect a String-like value and you pass nil, using nil.to_s will return an empty string. Likewise, to_s will always return a string representation of an object:

nil.to_s
# => ""
%w( foo ).to_s
# => "[\"foo\"]"

User.where('email = ?', Object.new).to_sql
 => "SELECT \"users\".* FROM \"users\" WHERE (email = '--- !ruby/object {}\n')"

This may not be what you want. In a method like the one before, where I expect a string value, we may want to explicitly check whether the argument represents a string using the to_str method instead of to_s.

nil.to_str
NoMethodError: undefined method `to_str' for nil:NilClass
%w( foo ).to_str
NoMethodError: undefined method `to_str' for ["foo"]:Array
"foo".to_str
 => "foo"

We can still satisfy the duck-typing Ruby approach by checking if the input responds to to_str, instead of relyting on a type checking.

def self.search_for(string)
  string.respond_to?(:to_str) or
    raise(ArgumentError, "Expected string `#{string.class}` to quack like a string")
  
  where("email = ?", string.to_str)
end

nil and empty string tests

When writing tests, remember that an empty string and nil are two different values. In Ruby on Rails we often forget this difference as we are used to call blank? or present? that will happily return the same result in both cases.

nil.blank?
# => true
"".blank?
# => true
"  ".blank?
# => true

Whilst this is handy, it can also be very dangerous. When writing tests you should always handle both cases.

it "raises ArgumentError if the input is blank" do
  expect { described_class.find_active_invitation(nil) }.to raise_error(ArgumentError, "Token is blank")
  expect { described_class.find_active_invitation("") }.to raise_error(ArgumentError, "Token is blank")
end

Database validation

One possible mistake that we made years ago was to not add the proper database flags to that table. Specifically, each field that can't be null must be flagged as not-nullable in the database schema.

To define a non-null field, you can pass the null: false option in a migration:

class AddAutoRenewToCertificates < ActiveRecord::Migration
  def change
    add_column :certificates, :auto_renew, :boolean, null: false, default: false
  end
end

A not-null constraint would probably have rung a bell during the code review.

We have been adopting this good design rule consistently in the last few years, and I'm pretty sure most of you did the same. It is also worth revisiting old models and properly migrate them to non-nullable when necessary. In this case, the change_column_null is very useful:

class DeleteDigestFromDomains < ActiveRecord::Migration
  def change
    remove_column :domains, :digest, :string
    remove_column :arpa_domains, :digest, :string
    change_column_null :zones, :digest, false
  end
end

You may be wondering what about the present validation in the model? It is still worth it to add the validation in the model, especially since it will prevent the record from being saved and it will add a friendly error message to the object that you can show back to the user.

It's important to note that there are still a number of possibilities to bypass that validation and store a nil value to the database, unless you tell the database to reject nil values for such field. For example, if you use update_column instead of update_attributes or update_attribute, the update will skip the validations (allowing you to store a nil value for that field).

Database-level constraints are also necessary if you have other code that may make changes to the same database (e.g. a micro-service that talks to the same database), as the writes will likely bypass the validations in your Ruby on Rails application.

Leverage database constraints

I mentioned at the beginning of my post that there were several improvements available to the code we wrote back in 2011. Some of them have been discussed in the previous section but I still want to highlight some database-related improvements as, in my experience, Ruby on Rails developers tend to be reluctanct to delegate some database-tasks to the database.

We've already seen how to prevent null values to be stored to a database column. Another important validation that must always be performed at least at database level is the uniqueness validation. In fact, the ActiveRecord uniqueness validation still has a chance of collision.

The only way to guarantee that a token is effectively unique is to add an unique index, instead of a standard index:

def change
  add_index :membership_invitations, :token, unique: true
end

As for the presence, it doesn't hurt to add both the ActiveRecord validation and the database constraint. ActiveRecord validation will work in most of the cases and it will return a nice error, but in the unlikely event of collision the database will return an error that will prevent writing inconsistent data.

Likewise, database foreign key constraints are a valuable safety net. Let's say our :membership_invitation belongs to a domain_id. Using a foreign key constraint will prevent both accidental nil values and inconsistent values.

This is a recipe for a good, not-null foreign key field:

def change
  # add a not-null field
  add_column :membership_invitations, :domain_id, :integer, null: false
  # add an index to improve querying, no need to be unique
  add_index :membership_invitations, :domain_id
  # add a foreign_key for consistency
  add_foreign_key :membership_invitations, :domains, on_delete: :cascade
end

Legacy code is the most vulnerable code

Before you conclude that all of DNSimple's code looks like that, it is fair to say that almost all of our code follows the best practices described in this article.

The issue here affected a portion of our legacy code. The DNSimple app is a 6 year old code base, and we still have some bits and pieces that haven't been updated in a while. The membership code is one of them: it was introduced in 2011, and it has been touched very few times since then.

Legacy code is traditionally more vulnerable than modern code.

Updating legacy code is a big effort, which also translates to a cost both in terms of time and money (actually, time costs money). In most cases, it's not feasible to stop an entire business and rewrite all the entire legacy code each time we develop/extract/codify a new best practice or convention. In some cases there are also additional complexities to factor in, such as running migrations on tables with millions of records that are constantly updated, such as our primary DNS records table.

However, when security is involved, it's critical to plan legacy code reviews. Over the last several years we've spent an incredible number of hours in refactoring legacy code to improve maintenance and stability, and some of these refactorings led to new architectural decisions like the ones I mentioned in my talk this year at RailsConf. We learned from this mistake, and we'll plan more targeted legacy code reviews in the upcoming weeks.

Conclusion

There are several lessons we learned from this story. Some of the lessons learned are not even mentioned in this article.

We learned that being explicit is preferrable, even if sometimes we pay the cost of expressiveness. We learned that legacy code is more vulnerable code, and reviewing legacy code with more frequency or before a refactoring is important.

We also learned that nil is not irrelevant, null it's a value and we should keep it in mind.