Handling Paginated Resources in Ruby

The thing with paginated data is we can't get it all at once.

Let's say we're using the Trello API. There are a number of Trello endpoints that return paginated data sets, such as boards, lists, cards, and actions (like comments, copies, moves, etc).

If we're querying for Trello cards marked as completed each month since last January, for example, we may need to request several pages of "cards".

In most cases, Trello will provide a default limit, typically 50, on the number of resources returned in a single request. But what if you need more than that? In this post, we'll examine a few ways to collect paginated results in Ruby.

Trello World

The Trello developer docs provide a quickstart in javascript - here's the unofficial Ruby version.

While logged into your Trello account (you'll need one first), retrieve your app key. We won't need the "secret" for this article.

Next, you'll generate an app token. Paste the following URL into your browser with your app key subsituted for the placeholder.

https://trello.com/1/authorize?expiration=never&scope=read,write,account&response_type=token&name=Trello%20World&key=YOUR_KEY

Now that we have an app key and token, we can make authenticated requests to the Trello API. As a quick test, paste the following url with your own key and token as pararameters into your web browser (or use curl) to read your member data.

https://api.trello.com/1/members/me?key=YOUR_KEY&token=YOUR_TOKEN

You should see a JSON response with attributes like your Trello id, username, bio, etc.

Script mode

Now let's fetch some paginated data in Ruby. For the following examples, we'll be using Ruby 2.2.

To make HTTP requests, we'll also use the http.rb, but feel free to subsitute with your HTTP client of choice. Install the gem yourself with gem install http or add it to your Gemfile:

# Gemfile

gem "http"

To make things easier for the remainder, add the key and token as environment variables in your shell. For Mac/Linux users, something like this will work:

# command line
export TRELLO_APP_KEY=your-key
export TRELLO_APP_TOKEN=your-token

Now, let's run Ruby version of our Trello World test.

# trello_eager.rb
require "http"

def app_key
  ENV.fetch("TRELLO_APP_KEY")
end

def app_token
  ENV.fetch("TRELLO_APP_TOKEN")
end

url = "https://api.trello.com/1/members/me?key=#{app_key}&token=#{app_token}"
puts HTTP.get(url).parse

If it worked correctly, you should see the same result we saw in your browser earlier.

Let's extract a few helpers to build the url. We'll use Addressable::URI, which is available as a dependency of the http.rb gem as of version 1.0.0.pre1 or otherwise may be installed as gem install addressable or gem "addressable" in your Gemfile:

require "http"
require "addressable/uri"

def app_key
  ENV.fetch("TRELLO_APP_KEY")
end

def app_token
  ENV.fetch("TRELLO_APP_TOKEN")
end

def trello_url(path, params = {})
  auth_params = { key: app_key, token: app_token }

  Addressable::URI.new({
    scheme: "https",
    host: "api.trello.com",
    path: File.join("1", path),
    query_values: auth_params.merge(params)
  })
end

def get(path)
  HTTP.get(trello_url(path)).parse
end

Let's Paginate

Now we'll add an alternative method to #get that can handle pagination.

MAX = 1000

def paginated_get(path, options = {})
  params  = options.dup
  before  = nil
  max     = params.delete(:max) { 1000 }
  limit   = params.delete(:limit) { 50 }
  results = []

  loop do
    data = get(path, { before: before, limit: limit }.merge(params))

    results += data

    break if (data.empty? || results.length >= max)

    before = data.last["id"]
  end

  results
end

Given a path and hash of parameter options, we'll build up an array of results by fetching the endpoint and requesting the next set of 50 before the last id of the previos set. Once either the max is reached or no more results are returned from the API, we'll exit the loop.

As a starting point, this works nicely. We can simply use paginated_get to collect up to 1000 results for a given resource without the caller caring about pages. Here's how we can grab the all the comments we've added to Trello cards:

def comments(params = {})
  paginated_get("members/me/actions", filter: "commentCard")
end

comments
#=> [{"id"=>"abcd", "idMemberCreator"=>"wxyz", "data"=> {...} ...}, ...]

The main problem with this approach is that it forces the results to be eager loaded. Unless a max is specified in the method call, we could be waiting for up to 1000 comments to load - 20 requests of 50 comments each - to execute before the results are returned.

Stop, enumerate, and listen

Next step is to refactor our paginated_get method to take advantage of Ruby's Enumerator. I previously described Enumerator and showed how it can be used to generate infinite sequences in Ruby, including Pascal's Triangle.

The main advantage of using Enumerator will be to give callers flexibility to work with the results including filtering, searching, and lazy enumeration.

# trello_enumerator.rb

def paginated_get(path, options = {})
  Enumerator.new do |y|
    params  = options.dup
    before  = nil
    total   = 0
    limit   = params.delete(:limit) { 50 }

    loop do
      data = get(path, { before: before, limit: limit }.merge(params))
      total += data.length

      data.each do |element|
        y.yield element
      end

      break if (data.empty? || total >= MAX)

      before = data.last["id"]
    end
  end
end

We've got a few similarities with our first implementation. We still loop over repeated requests for successive pages until either the max is reach or no data is returned from the API. There are a few big differences though.

First, you'll notice we've wrapped our expression in Enumerator which will serve as the return value for #paginated_get.

Using an enumerator may look strange but it offers a huge advantage over our first iteration. Enumerators allow callers to interact with data as it is generated. Conceptually, the enumerator represents the algorithm for retrieving or generating data in Enumerable form.

An enumerator implements the Enumerable module which means we can call familiar methods like #map, #select, #take, and so on.

Instead of building up an internal array of results, enumerators provide a mechanism for yielding each element even though a block may not be given to the method (how mind blowing is that?).

Now we can use enumerator chains to doing something like the following, where we request comment data lazily, transform the API hash to comment text and select the first two addressed to a colleague.

comments.lazy.
  map { |a| a["data"]["text"] }.
  select { |t| t.start_with?("@personIWorkWith") }.
  take(2).force

We may not need to load all 1000 results to because the enumerators chain is evaluated for each item as it is yielded. This technique provides the caller with a great deal of flexibility. Eager loading can be delayed or avoided altogther - a potential performance gain.

Here are magic lines from #paginated_get:

data.each do |element|
  y.yield element
end

The y.yield is not the keyword yield, but the invokation of the #yield method of Enumerator::Yielder, an object the enumerator uses internally to pass values through to the first block used in the enumerator chain. For a more detailed look at how enumerators work under the hood, read more about how Ruby works hard so you can be lazy.

A cursor-y example

Let's do one more iteration on our #paginated_get refactoring. Up to this point, we've been using a "functional" approach; we've just been using a bunch of methods defined in the outermost lexical scope.

First, we'll extract a Client responsible for sending requests to the Trello API and parsing the responses as JSON.

# trello_client.rb

require "http"
require "addressable/uri"

class Client
  def initialize(opts = {})
    @app_key   = opts.fetch(:app_key, ENV.fetch("TRELLO_APP_KEY"))
    @app_token = opts.fetch(:app_token, ENV.fetch("TRELLO_APP_TOKEN"))
  end

  def get(path, params = {})
    HTTP.get(trello_url(path, params)).parse
  end

  private

  def trello_url(path, params = {})
    auth_params = { key: @app_key, token: @app_token }

    Addressable::URI.new({
      scheme: "https",
      host: "api.trello.com",
      path: File.join("1", path),
      query_values: auth_params.merge(params)
    })
  end
end

Next, we'll provide a class to represent the paginated collection of results to replace our implementation of #paginated_get.

The Twitter API uses cursors to navigate through pages, a concept similar to "next" and "previous" links on websites. Although Trello doesn't provide explicit cursors in their API, we can still wrap the paginated results in an enumerable class to get similar behavior.

# trello_cursor.rb
require_relative "./trello_client"

class Cursor
  def initialize(path, options = {})
    @path       = path
    @params     = params

    @collection = []
    @before     = params.fetch(:before, nil)
    @limit      = params.fetch(:limit, 50)
  end
end

The Cursor will be initialized with a path and params, like our paginated_get. We'll also maintain an internal @collection array to cache elements as they are returned from Trello.

class Cursor
  private

  def client
    @client ||= Client.new
  end

  def fetch_next_page
    response              = client.get(@path, @params.merge(before: @before, limit: @limit))
    @last_response_empty  = response.empty?
    @collection           += response
    @before               = response.last["id"] unless last?
  end

  MAX = 1000

  def last?
    @last_response_empty || @collection.size >= MAX
  end
end

We'll introduce a dependency on the Client to interface with Trello through the private client method. We'll use our client to fetch the next page, append the latest results to our cached @collection and increment the page number. Now for the key public method:

class Cursor
  include Enumerable

  def each(start = 0)
    return to_enum(:each, start) unless block_given?

    Array(@collection[start..-1]).each do |element|
      yield(element)
    end

    unless last?
      start = [@collection.size, start].max

      fetch_next_page

      each(start, &Proc.new)
    end
  end
end

We've chosen to have our Cursor expose the Enumerable API by including the Enumerable module and implementing #each. This will give cursor instances enumerable behavior so we can simply replace our paginated_get definition to return a new Cursor.

def paginated_get(path, params)
  Cursor.new(path, param)
end

def comments(params = {})
  paginated_get("members/me/actions", filter: "commentCard")
end

Let's break down Cursor#each a bit further. The first line allows us retain the Enumerator behavior before.

return to_enum(:each, start) unless block_given?

It invokes Kernel#to_enum when no block is given to an each method call. In this case, the method returns an Enumerator that packages the behavior of #each for an enumerator chain similar to before:

puts comments.each.lazy.
  map { |axn| axn["data"]["text"] }.
  select { |txt| txt.start_with?("@mgerrior") }.
  take(2).force

For more info on using #to_enum, check out Arkency's Stop including Enumerable, return Enumerator instead.

We also need to yield each element in the @collection to pass elements to callers of #each

Array(@collection[start..-1]).each do |element|
  yield(element)
end

We iterate from the start of the collection to the end with Array(@collection[start..-1]).each... but wait! when we start iterating, the @collection is empty:

def initialize
  # ...
  @collection = []
end

Wat?

The key comes in the lines that follow in #each:

unless last?
  start = [@collection.size, start].max

  fetch_next_page

  each(start, &Proc.new)
end

Unless we've encountered the last page, we fetch the next page, which appends the latest results to the collection and we recursively invoke #each with a starting point. This means #each will be invoked again with new results until no new data is encountered. Sweet!

A neat trick is how we forward the block given to #each. When we Proc.new without explicitly passing a block or proc object, it will instantiate with the block given to its surrounding method if there is one. The behavior is similar to the following:

def each(start = 0, &block)
  # ...
  each(start, &block)
  # ...
end

The main benefit being we don't needlessly invoke Proc.new by omitting &block in the arguments. For more on this, read up on Passing Blocks in Ruby without &block

"Recursive each" is a powerful technique for providing a seamless, enumerable interface to paginated or cursored results. I first encountered this approach in the sferik's Twitter gem - a great resource for those considering writing an API wrapper in Ruby.

On your own

Give it a shot! Pick out an API you like to use and play with techniques for modeling its collection resources. This is a great way to get more experience with Ruby's Enumerable. Consider one of these approaches when you need to traverse paginated or partitioned subsets of data in an external or internal API.

Think less about pages and more about data.

Changelog

2016-01-28

Updated the examples to use the :before parameter instead of :page for requests for successive "pages"
Posted the full source of the examples above on GitHub

Credits

Icons via the Noun Project:

Arrows by Zlatko Najdenovski

rossta.net