Move Rows from MySQL to Hadoop with Pipefish

Have you ever wanted to send your MySQL query results directly to a file in HDFS? Well, now you can with Pipefish.

Why does this exist?

Our Analytics team needs daily table snapshots from our production databases. Not such a big deal, except that the analysts use tools like Impala and Hive to crunch their numbers. That means we need to ship these rows from MySQL into the Hadoop filesystem (HDFS).

First pass

My first solution was to simply script the export and import using the mysql and hive CLI tools. That worked out to three distinct steps:

  1. Read from MySQL, write to a tmp file.
  2. Read from the tmp file, write to HDFS.
  3. Remove the tmp file.

This was OK for few small tables, but the size (>50GB) and the number (dozens) of tables grew. Two problems quickly cropped up:

  • Writing so much tmp data often filled up the filesystem.
  • Intermediate storage in tmp files slowed the process significantly.

Second pass

Decomposing the tables into smaller, bite-size chunks offered a way to smooth out the tmp file problem, but it brought other problems:

  • The process needed monotonically increasing primary key with no gaps. Why?
    • A primary key for fast selection of a chunk of rows.
    • No (or few) gaps so that each chunk was about the same size.
  • Some place to track what chunk we were on.
  • No point-in-time consistency. The script can't lock the source table indefinitely so we trade off a little accuracy.
  • Disk intermediation still not solved.

Third pass

What next? Could I invent a way to disintermediate the tmp file? Could I do it in a way that was fun and interesting? Yes, because:

  • Both Hive and Impala read delimited text files.
  • I like writing in C. This would be the fun part.
  • I knew MySQL had well-documented client libraries in C, but what about Hadoop? Wasn't it all Java?
  • After a bit of research I found a C api for HDFS


Less than 200 lines of C later, I had my utility:

  $ pf --defaults_file=~/passwd --hdfs_path="/user/hive/warehouse/my.db/table.txt" --overwrite --sql="select * from my.table"

The passwd file is formatted as a MySQL config file:

  $ cat ~/passwd
  host     = my_production_server
  user     = copy_user
  password = showmesomethinggood
  database = production

The --overwrite flag does what you might expect. Without the flag, the utility attempts to append, so a file must already exist.

The --sql parameter takes a valid SQL query. It need not be simply select * from... Use this flexibility to transform data as you like.

The usage in summary:

  $ pf --help
  Using JAVA_HOME=/usr/lib/jvm/java-6-openjdk-amd64
   --db='db name'
   --sql='sql statement'
  user, host, db, and password settings override those specified in defaults_file.

The JNI and libhdfs

The HDFS C library uses JNI so there must be a working JAVA_HOME. The pf command is actually a bash script wrapper that:

  1. Tries to find JAVA_HOME,
  2. Builds up a very long Java CLASSPATH,
  3. Executes the pipefish binary.

To do

  1. Handle HDFS write failures more gracefully.
  2. Implement some sense of rollback on failure.
  3. Optionally partition tables.
  4. Option for different delimiters.
  5. Winnow the CLASSPATH.

So why pipefish? I enjoy husbanding marine creatures, and I like fish-based names of software. The pipefish just seemed appropriate.

- Joseph (Beau) Rothrock

posted in: · · · · · · ·

Testing Rack-based APIs with Cucumber and RSpec

Yesterday, I mentioned that we’d recently released lookout-rack-test. Today I’m going to talk about how to use it for testing Rack-based JSON APIs. One of the things we like to use this gem for is writing Cucumber scenarios as our API specification, making sure that everybody agrees on what an API looks like before we start implementing.

Specifying API behavior with Cucumber

Before using Lookout::Rack::Test::Cucumber, you need to tell it where to find your app: = TestApp
  require 'lookout/rack/test/cucumber'

This sets us up to be able to hit your routes, because we’ve included its routes in our World object.

The key to the step definitions in Lookout::Rack::Test::Cucumber is the requests - you can GET/PUT/POST/DELETE a route, of course:

  Scenario: I should be able to GET a route, and check the response status
    When I GET "/"
    Then the response should be 200

but we’ve also defined steps that let you do a request with a JSON body or a table of parameters:

  Scenario: POST with JSON body
    When I POST to "/json" with the JSON:
        "some_key" : "some_value",
        "another_key" : "another_value"
    Then the response should be 201

  Scenario: POST with params
    When I POST to "/" with:
      | Name       | Value      |
      | this_param | this_value |
      | that_param | that_value |
    Then the response should be 201

Since we’re talking about testing APIs, it’s likely the response body will be JSON, so we have a step for that as well:

  Scenario: Response body containing JSON
    When I GET "/json"
    Then the response should be 200
    And I should receive the JSON:
      { "key" : 1 }

And for redirects:

  Scenario: Redirect response
    When I GET "/redirect"
    Then I should be redirected to "/"

Finally, and perhaps most importantly, we have a module method template_vars that points to the instance variable of the same name. We use template_vars to fill Liquid templates in params, URIs, and JSON. As such, we can do things like have a method that factories up some data, and refer to that data in our tests. In the following test, the step I have a user might factory up a user and put its id and state values into template_vars['user_id'] and template_vars['user_state'], respectively:

  Scenario: Liquid templates
    Given I have a user
    When I GET "/users/{{user_id}}"
    The response should be 200
    And I should receive the JSON:
        "id" : {{user_id}},
        "state" : "{{user_state}}"

You can also combine this with capture groups in your step definitions. For instance, suppose that we want our /users/:user_id route not to return users with the archived state.

  When /^I have a user with the (.*) state$/ do |state|
    user = FactoryGirl.create :user, :state => state
    template_vars['user_id'] =
    template_vars['user_state'] = user.state

would allow multiple tests relating to the user’s state:

  Scenario: New user
    Given I have a user with the new state:
    When I GET "/users/{{user_id}}
    Then the response should be 200
    And I should receive the JSON:
        "id" : {{user_id}},
        "state" : "{{user_state}}"

  Scenario: Archived
    Given I have a user with the archived state:
    When I GET "/users/{{user_id}}
    Then the response should be 404

Deeper API testing with RSpec

Like with Cucumber, the first thing you want to do is set up the application and models:

require 'lookout/rack/test'
require 'lookout/rack/test/rspec'
include Lookout::Rack::Test::RSpec

setup_models assumes that the Models object passed to it has a .setup and .unsetup method, which will be run around your test examples marked :speed => :slow:

  describe 'TestModels`, :speed => :slow do
    context 'A model with an id' do
      subject { }

      it { should respond_to :id }

These examples will have factories loaded for them, and will setup and teardown the models for each example.

setup_routes simply requires that it be passed a class that defines a Rack application, something it can call .new on. Having done that, you can use the usual Rack::Test::Methods in tests of :type => :route:

  describe '/api/public/v1/my_route', :type => :route do
    describe 'GET'
      subject(:response) { get "/api/public/v1/my_route" }

      its(:status) { should be 200 }
      its(:body) { should_not be_empty }
      it 'should do something else' do
        expect(Model.count).not_to be 0

Ensuring that different sets of examples are tagged with :slow or :route makes it much easier to have some RSpec examples which run very fast, and some RSpec examples which perform slower, more integration-level testing.

Hopefully you find lookout-rack-test useful in testing and documenting your own Rack-based JSON APIs!

- Ian Smith

posted in: · · · · · · ·

Lookout::Rack::Test and Lookout::Rack::Utils

At Lookout, we build a lot of Rack applications - some on Rails, some on Sinatra. One of the nice things about Sinatra is that it is much simpler and lighter-weight than Rails. There’s more you have to build/include for yourself, yes, but it cuts down on the magic happening just out of sight and gives you more control. This of course has its downsides, sometimes you do want all of the things Rails gives you.

We recently released two gems containing some of the helper code we’ve written for our Sinatra apps: lookout-rack-utils and lookout-rack-test.

lookout-rack-utils consists of various utilities having to do with logging (to a file, or to logstash via our rack-requestash gem), Graphite tracking, i18n, and request handling.

lookout-rack-test contains code useful for testing Rack apps with RSpec and Cucumber, examples of which can be found in the repo’s spec/ and features/ directories.

Both gems are under the MIT License; forks and pull requests are welcome.

Share and enjoy!

- Ian Smith

posted in: · · · · · · · ·