May 13, 2014
The Supermarket project not only replaces the existing Opscode Community Site codebase, but also switches its database from MySQL to PostgreSQL. As such, one of our pre-launch tasks is to devise a data migration script that we can run to import all relevant Community Site data into Supermarket. We’ll also use this script to keep Supermarket up-to-date with the Community Site during the soft launch period, where they’ll run in parallel.
This post details a refactoring in the data migration codebase, which was motivated by wanting a more granular way to diagnose records whose migrations raised an exception. I might also contend that this refactoring is motivated by valuing ease and joy at work. Regardless, the gist is that we take what you might call a Method Object and have it instead implement Enumerable
for great good.
At a high level, our data migration looks like this:
records_which_should_be_imported.each do |record|
importer = SomeImporter.new(record)
ActiveRecord::Base.transaction do
begin
importer.call
rescue
raise ActiveRecord::Rollback
end
end
end
Where SomeImporter.new(record)
attempts to establish the state required to perform an import, and importer.call
uses that state to build and save Supermarket’s object graph. We wrap importer.call
in an ActiveRecord transaction so that we don’t leave the database in a messy state if the import fails.
It’s worth noting that these importer
objects grew out of a similar desire to debug with ease, but are not in practice all that easy to use. If we have a record
which causes importer
to raise an exception, we need to do some gymnastics to get at the invalid object:
error = nil
begin
SomeImporter.new(record).call
rescue => e
error = e
end
error.record # and only if the error is an ActiveRecord error
Given the relatively poor state of the data on the existing Community Site, it quickly becomes desirable to cut to the proverbial chase. In particular, we care about three things when importing a record fails:
Have we imported all of the data upon which this record depends?
For example, we want to skip a cookbook record if we hadn’t already imported that cookbook’s owner.
Which records will we attempt to save in Supermarket, if any?
Some importers, such as the Category importer, just save one record in Supermarket. Others, such as the Cookbook importer, save at least two records. Those two importers only create new records in Supermarket; others update existing records.
What happens when we try to save each record in Supermarket?
In the case of Cookbooks failing to import, for example, it’s important to know whether the Cookbook data is invalid or whether its associated CookbookVersion records are invalid.
If we shift our approach so that importer
implements Enumerable
to iterate over the records it deems need to be saved without saving them, we can answer all of these questions with ease. So, where an importer’s call
method may have once looked like this:
def call
something = Something.new
something.save!
end
We rename it to each
and write it like this:
def each
yield ::Something.new
end
Our import now looks like this:
records_which_should_be_imported.each do |record|
importer = SomeImporter.new(record)
ActiveRecord::Base.transaction do
begin
importer.each(&:save!)
rescue
raise ActiveRecord::Rollback
end
end
end
From the console, it’s easier to see what we’re trying to import, and once we include Enumerable
in each importer class, we’ve got a lot more flexibility with regard to how we can debug failing imports:
importer = SomeImporter.new(record)
first_invalid_record = importer.find { |r| !r.valid? }
Using Enumerable#to_a
, we can refactor the migration to only open a transaction if there are records to import.
records_which_should_be_imported.each do |record|
importer = SomeImporter.new(record)
new_records = importer.to_a
if new_records.any?
ActiveRecord::Base.transaction do
begin
new_records.each(&:save!)
rescue
raise ActiveRecord::Rollback
end
end
end
end
My (admittedly informal) benchmarks indicate that this change results in a 10% speedup. It seems reasonable to attribute it to performing fewer queries inside of each transaction, and to nearly eliminating empty transactions altogether.
Enumerable
is one of my favorite features of the Ruby core library, and I’m pleased with how it lends itself here to make a messy task like data migration relatively clean.
This post is a flapjack, which means I originally wrote it for the internal FullStack blog and have republished it here. Any mysterious, unexplained context was probably obvious to the team at the time.