Scrolling Elasticsearch using Node.js and Promises

Elasticsearch is a great way to store lots of documents that need to be quickly searched and retrieved. In addition to a broad query API, Elasticsearch also provides scrolling functionality that lets you query the server and incrementally download the results. This can be really useful for processing any large result set, e.g. for reindexing.

Using this scrolling, however, can be difficult because it first requires setting up then repeatedly calling the server to access the results. In order to simplify scrolling, I have implemented a package called ElasticScroll using Node.js with the Q promises library. In this post I will describe how ElasticScroll works and how to use it.

Note: I am going to use CoffeeScript in this post for reasons described here.

Overview

The two libraries that ElasticScroll uses are Q and Q-IO. Q is a Promises A+ library (read more about promises here) and Q-IO is a wrapper for promises around the Node.js IO interfaces.

First, ElasticScroll must import the libraries: Q = require 'q' qhttp = require 'q-io/http'

Then define the class that will contain the functions and variables: class ElasticScroll constructor: (@url, @query, @process_fn) ->

To initialise ElasticScroll it must have a url that defines where the Elasticsearch server is, a query to send and a function process_fn that will process each document.

The main scrolling function scroll is defined as: scroll: -> Q.fcall( => @set_scroll_id(@query)) .then( => @get_next_set()) .then( (hits) => @process_hits(hits)) .then( (hits) => @continue_scroll(hits))

scroll returns a promise for the results created using Q.fcall. It gets these results by first setting up the scroll_id, then getting the first set of results, processing the hits, then continuing to scroll. Each of these promises are described below.

Note: the use of => (fat-arrow) syntax in CoffeeScript just defines a function where this (@) is bound to the object where the function is defined, read more here

Getting the Scroll ID

The first stage to scroll Elasticsearch is to send the query and it returns a scroll id. This id is used to access the cached results in Elasticsearch. set_scroll_id: -> request = { method: "POST" body: [JSON.stringify(@query)] url: "#{@url}/_search?search_type=scan&scroll=10m" }``qhttp.request(request) .then((response) -> response.body.read()) .then((resp) -> JSON.parse(resp.toString())) .then((json) => @scroll_id = json._scroll_id)

In the set_scroll_id function, first the request is defined to post the query to Elasticsearch. Then Q-IO’s http module is used to send the request, read the body, parse the response, then assign the instance variable scroll_id as the returned scroll id.

Getting the Results

Getting the results involves sending the obtained scroll id to Elasticsearch, then parsing the results. get_next_set: () -> request = { method: "GET" url: "#{@url}/_search/scroll/#{@scroll_id}?scroll=10m" } qhttp.request(request) .then((response) -> response.body.read()) .then((resp) -> JSON.parse(resp.toString())) .then((json) -> json.hits.hits)

Processing the Results

Processing the results just uses CoffeeScript list comprehensions and the function passed into the constructor. process_hits: (hits) -> (@process_fn(hit) for hit in hits)

Continueing to Scroll

To check whether the function should stop scrolling it will return if there were no hits. Otherwise, this function will return a promise for the next step, process the results, then recursively call itself to see if it should continue scrolling. continue_scroll: (hits) -> return if hits.length == 0``@get_next_set() .then( (hits) => @process_hits(hits)) .then( (hits) => @continue_scroll(hits))

Using ElasticScroll

To use ElasticScroll you have to install the module with: npm install elasticscroll

Then you have to import it, define a query, define a processing function, and url, and initialise ElasticScroll with them. then call scroll. ElasticScroll = require 'elasticscroll'``query = { "query": { "query_string" : { "query" : "some query string here" } } }``print_to_console = function(hit){ console.log(hit) }``es = new ElasticScroll("http://localhost:9200", query, print_to_console)``es.scroll().fail(console.log)

Conclusion

I really like working with Elasticsearch and think that Node.js is an excellent platform to build tools that interact with it. Additionally, I really enjoyed learning more about Q promises and Q-IO, because it made writing this reasonably complex function much more enjoyable.

Overview#

Getting the Scroll ID#

Getting the Results#

Processing the Results#

Continueing to Scroll#

Using ElasticScroll#

Conclusion#

Resources#