Engineering

Tales of a Chef Workflow: Data Bags

Aaron Kalin on May 18, 2017

One of the many features of Chef is something called a Data Bag. Simply put, this allows you to store a blob of JSON based data on a Chef server that is shared across your Chef environments. If you have organizational level data that must be shared and not unique across environments, this is a great, easy system to store and retrieve this data. For this article, my example is the list of our network blocks at DNSimple. We have quite a bit of address space with the amount of hardware we have deployed and we share this data in various cookbooks to know which systems are on our network. This comes in handy when we want to put in firewall rules to allow only traffic from within our own networks, etc.

As I mentioned earlier, data bags are basically a bucket into which you put blobs of JSON data known as a data bag item into. So for example, we could have a 'dnsimple' data bag with a 'networks' item that is a block of JSON data like this:

{
  "id": "networks",
  "digitalocean": [
    "1.2.3.4/32",
    "5.6.7.0/24"
  ]
}

This data bag item has an array of network blocks from digital ocean in this example. Thus, in a cookbook, we can access this array like so:

networks = data_bag_item('dnsimple', 'networks')
digital_ocean_networks = networks['digitalocean']

At this point, we can simply loop over the array, pick out an entry from it, etc. Why wouldn't you store this in an attribute? Well, that is also an option here, but remember that unless you actively delete an attribute after it's used, Chef will automatically save that attribute in the node data on the Chef server. This may seem trivial, but if you manage a lot of systems in Chef, this adds to the size of the node data updates, and multiplied over many systems that would make storing this in attributes a potentially costly effort.

Another benefit of using data bags is that they are indexed by a Chef server. If you need to look up data in a data bag for use in a knife command script, this is possible:

admins = search(:users, 'groups:sysadmin')
puts 'Sysadmin accounts:'
admins.each do |admin|
  puts admin['username']
end

This searches the users data bag for user items that have the 'groups' key, which contains the entry sysadmin to then display back to the console as a list of sysadmin group usernames. You can do much more complex scripts here, but this should illustrate how easy and valuable having the indexing can be with data bags.

An alternative

One of the drawbacks to using data bags is that they are not versioned at all. Consider it a blob of shared, globally accessible data, regardless of Chef environment. While this is handy, and it is indexed, it's not the best choice if you have cookbooks that change behavior based on data. You can track the json file for a data bag, and you can live edit this data on the chef server itself if you like, but you must remember to save it back to your version control if you do. This drawback is what lead us to an interesting solution: we encoded the above mentioned data bag into a shared library method.

We have a shared library cookbook that provides a bunch of internal helper methods we use in a variety of cookbooks, making it an excellent candidate to house our networks data bag item. We simply created a new method inside the library helpers file we already had, which ended up looking like this:

# In a cookbook's libraries/helpers.rb file
module DNSimple
  module Helpers
    def dnsimple_networks
      networks = {
        'digitalocean' => [
          '1.2.3.4/32', # Internal server
          '5.6.7.8/32'  # External server
        ]
      }
      networks.values.flatten.sort
    end
  end
end

class Chef::Recipe; include DNSimple::Helpers; end

The above code makes the dnsimple_networks method available in our recipes. You may notice that last line only getting the values and throwing away the keys, but this is by design. We use those to help organize this list and provide readability. This is another advantage of storing this in ruby versus JSON. Commenting and data structure is much easier to control, especially if you keep the resulting data from the method consistent. This allows us great flexibility and very imporantly, version control over this living data structure. Now if you have data that changes fairly frequently, data bags may still be your answer here instead of what we did. The network information in that data structure for us tends to stay fairly static and doesn't change too often so we can push it through our typical release workflow.

Another drawback to this alternative approach is that this data is no longer indexed into Chef. We could opt to do this if we have the method save the data back to an attribute which will expose this information in chef search, or even filter which data is exposed there.

# In a recipe file
node.default['dnsimple']['networks'] = dnsimple_networks

That above code can be in a default recipe which will expose the data in that method to Chef node attributes for easy searching.

Conclusion

Data bags aren't always the best fit for storing arbitrary data in Chef. I'm also not the first to write about this particular technique. It is one of many useful utilities with Chef and since it can now be used locally without a Chef Server as of Chef 12, they have even more usefulness than before. However, be weary that are not versioned which can make them dangerous for data driven cookbooks. You may find that making a library method like we did is a solid option to give you more visible control over the data and allow for things you normally cannot do with a data bag. For example, we have since modified that above method to accept additional networks as a parameter for local testing purposes. The sky is the limit with this feature which is both a good and bad thing sometimes.

Share on Twitter and Facebook

Aaron Kalin

Software and Server maintainer by day, board and video game geek by night.

We think domain management should be easy.
That's why we continue building DNSimple.

Try us free for 30 days

4.3 out of 5 stars.

Based on Trustpilot.com and G2.com reviews.

Learning

Debugging DNS

In this post we share some tips and tools you can use to troubleshoot DNS issues on your own like a pro.

Ole Michaelis

Engineering

Two years of squash merge

A retrospective of the last two years where we adopted --squash as our default merge strategy for git branches.

Simone Carletti

Engineering

Technical reasons behind the ALIAS record

In this article I will try to explain the technical reason behind the ALIAS record and important limitations of the CNAME record you need to know.

Simone Carletti