Fun With Haxl (Part 1)
October 20, 2015This is a blog-post version of a talk I recently gave at the Haskell eXchange 2015. The video of the talk is here, but there were a lot of questions during the talk which aren’t very audible on the video, so hopefully this post will be useful to folks who weren’t at the event.
If you want to play with the examples yourself, the code is available
on
github, and to run the examples you’ll need to cabal install haxl sqlite first, or the stack equivalent.
What is Haxl?
Haxl is a library that was developed for solving a very specific problem at Facebook: we wanted to write purely functional code, including data-fetching operations, and have the data-fetches automatically batched and performed concurrently as far as possible. This is exactly what Haxl does, and it has been running in production at Facebook as part of the anti-abuse infrastructure for nearly a year now.
Although it was designed for this specific purpose, we can put Haxl to use for a wide range of tasks where implicit concurrency is needed: not just data-fetching, but other remote data operations (including writes), and it works perfectly well for batching and overlapping local I/O operations too. In this blog post (series) I’ll start by reflecting on how to use Haxl for what it was intended for, and then move on to give examples of some of the other things we can use Haxl for. In the final example, I’ll use Haxl to implement a parallel build system.
Example: accessing data for a blog
Let’s suppose you’re writing a blog (an old-fashioned one with
dynamically-generated pages!) and you want to store the content and
metadata for the blog in a database. I’ve made an example database
called blog.sqlite, and we can poke around to see what’s in it:
$ sqlite3 blog.sqlite
SQLite version 3.8.2 2013-12-06 14:53:30
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .tables
postcontent postinfo postviews
sqlite> .schema postinfo
CREATE TABLE postinfo(postid int, postdate timestamp, posttopic text);
sqlite> .schema postcontent
CREATE TABLE postcontent(postid int, content text);
sqlite> select * from postinfo;
1|2014-11-20 10:00:00|topic1
2|2014-11-20 10:01:00|topic2
3|2014-11-20 10:02:00|topic3
...
sqlite> select * from postcontent;
1|example content 1
2|example content 2
3|example content 3
...
There are a couple of tables that we’re interested in: postinfo,
which contains the metadata, and postcontent, which contains the
content. Both are indexed by postid, an integer key for each post.
Now, let’s make a little Haskell API for accessing the blog data. I’ll do this twice: first by calling an SQL library directly, and then using Haxl, to compare the two.
The code for the direct implementation is in BlogDB.hs, using
the simple sqlite package for accessing the sqlite DB (there are
other more elaborate and type-safe abstractions for accessing
databases, but that is orthogonal to the issues we’re interested in
here, so I’m using sqlite to keep things simple).
In our simple API, there’s a monad, Blog, in which we can access the
blog data, a function run for executing a Blog computation, and
two operations, getPostIds and getPostContent for making specific
queries in the Blog monad. To summarise:
type Blog a -- a monad
run :: Blog a -> IO a
type PostId = Int
type PostContent = String
getPostIds :: Blog [PostId]
getPostContent :: PostId -> Blog PostContentThe implementation of the API will print out the queries it is making, so that we can see what’s happening when we call these functions. Let’s use this API to query our example DB:
GHCi, version 7.11.20150924: http://www.haskell.org/ghc/ :? for help
[1 of 1] Compiling BlogDB ( BlogDB.hs, interpreted )
Ok, modules loaded: BlogDB.
*BlogDB> run getPostIds
select postid from postinfo;
[1,2,3,4,5,6,7,8,9,10,11,12]
*BlogDB> run $ getPostIds >>= mapM getPostContent
select postid from postinfo;
select content from postcontent where postid = 1;
select content from postcontent where postid = 2;
select content from postcontent where postid = 3;
select content from postcontent where postid = 4;
select content from postcontent where postid = 5;
select content from postcontent where postid = 6;
select content from postcontent where postid = 7;
select content from postcontent where postid = 8;
select content from postcontent where postid = 9;
select content from postcontent where postid = 10;
select content from postcontent where postid = 11;
select content from postcontent where postid = 12;
["example content 1","example content 2","example content 3","example content 4","example content 5","example content 6","example content 7","example content 8","example content 9","example content 10","example content 11","example content 12"]
*BlogDB>
The problem: batching queries
Now, the issue with this API is that every call to getPostContent
results in a separate select query. The mapM call in the above
example gave rise to one select query to fetch the contents of each
post separately.
Ideally, rather than
select content from postcontent where postid = 1;
select content from postcontent where postid = 2;
select content from postcontent where postid = 3;
What we would like to see is something like
select content from postcontent where postid in (1,2,3);
This kind of batching is particularly important when the database is remote, or large, or both.
One way to solve the problem is to add a new API for this query, e.g.:
multiGetPostContents :: [PostId] -> IO [PostContent]But there are several problems with this:
Clients have to remember to call it, rather than using
mapM.If we’re fetching post content in multiple parts of our code, we would have to arrange to do the fetching in one place and plumb the results to the places that need the data, which might involve restructuring our code in an unnatural way, purely for efficiency reasons.
From a taste perspective,
multiGetPostContentsduplicates the functionality ofmapM getPostContent, which is ugly.
This is the problem that Haxl was designed to solve. We’ll look at how to implement this API on top of Haxl in the next couple of sections, but just to demonstrate the effect, let’s try it out first:
Prelude> :l HaxlBlog
[1 of 2] Compiling BlogDataSource ( BlogDataSource.hs, interpreted )
[2 of 2] Compiling HaxlBlog ( HaxlBlog.hs, interpreted )
Ok, modules loaded: HaxlBlog, BlogDataSource.
*HaxlBlog> run $ getPostIds >>= mapM getPostContent
select postid from postinfo;
select postid,content from postcontent where postid in (12,11,10,9,8,7,6,5,4,3,2,1)
["example content 1","example content 2","example content 3","example content 4","example content 5","example content 6","example content 7","example content 8","example content 9","example content 10","example content 11","example content 12"]
*HaxlBlog>
Even though we used the standard mapM function to perform multiple
getPostContent calls, they were batched together and executed as a
single select query.
Introduction to Haxl
You can find the full documentation for Haxl here, but in this section I’ll walk through the most important parts, and then we’ll implement our own data source for the blog database.
Haxl is a Monad:
newtype GenHaxl u a
instance Functor (GenHaxl u)
instance Applicative (GenHaxl u)
instance Monad (GenHaxl u)It is generalised over a type variable u, which can be used to pass
around some user-defined data throughout a Haxl computation. For
example, in our application at Facebook we instantiate u with the
data passed in with the request that we’re processing.
Essentially there is a Reader monad built-in to Haxl. (this might
not be the cleanest design, but it is the way it is.) Throughout the
following we’re not going to be using the u parameter, and I’ll
often instantiate it with (), like this:
type Haxl a = GenHaxl () aThe most important operation in Haxl is dataFetch:
dataFetch :: (DataSource u r, Request r a) => r a -> GenHaxl u aThis is how a user of Haxl fetches some data from a data source (in our example, from the blog database). The Haxl library is designed so that you can use multiple user-defined data sources simultaneously.
The argument of type r a is a request, where r is the request type
constructor, and a is the type of the result we’re expecting. The
r type is defined by the data source you’re using, which should also
supply appropriate instances of DataSource and Request. For
example, the request type for our blog looks like this:
data BlogRequest a where
FetchPosts :: BlogRequest [PostId]
FetchPostContent :: PostId -> BlogRequest PostContentNote that we’re using a GADT, because we have two different requests which each produce a result of a different type.
Next, our request type needs to satisfy the Request constraint.
Request is defined like this:
type Request req a =
( Eq (req a)
, Hashable (req a)
, Typeable (req a)
, Show (req a)
, Show a
)That is, it is a synonym for a handful of type class constraints that
are all straightforward boilerplate. (defining constraint-synonyms
like this requires the ConstraintKinds extension, and it’s a handy
trick to know).
The other constraint we need to satisfy is DataSource, which is
defined like this:
class (DataSourceName req, StateKey req, Show1 req)
=> DataSource u req where
fetch
:: State req
-> Flags
-> u
-> [BlockedFetch req]
-> PerformFetchDataSource has a single method, fetch, which is used by Haxl to
execute requests for this data source. The key point is that fetch
is passed a list of BlockedFetch values, each of which contains
a single request. The BlockedFetch type is defined like this:
data BlockedFetch r = forall a. BlockedFetch (r a) (ResultVar a)That is, it contains a request of type r a, and a ResultVar a
which is a container to store the result in. The fetch
implementation can store the result using one of these two functions:
putSuccess :: ResultVar a -> a -> IO ()
putFailure :: (Exception e) => ResultVar a -> e -> IO ()Because fetch is passed a list of BlockedFetch, it can collect
together requests and satisfy them using a single query to the
database, or perform them concurrently, or use whatever methods are
available for performing multiple requests simultaneously.
The fetch method returns PerformFetch, which is defined like this:
data PerformFetch
= SyncFetch (IO ())
| AsyncFetch (IO () -> IO ())For our purposes here, we’ll only use SyncFetch, which should contain an
IO action whose job it is to fill in all the results in the
BlockedFetches before it returns. The alternative AsyncFetch can
be used to overlap requests from multiple data sources.
Lastly, let’s talk about state. Most data sources will need some
state; in the case of our blog database we need to keep track of the
handle to the database so that we don’t have to open a fresh one each
time we make some queries. In Haxl, data source state is represented
using an associated data type called State, which is defined by the
StateKey class:
class Typeable f => StateKey (f :: * -> *) where
data State fSo every data source with request type req defines a state of type
State req, which can of course be empty if the data source doesn’t
need any state. Our blog data source defines it like this:
instance StateKey BlogRequest where
data State BlogRequest = BlogDataState SQLiteHandleThe State req for a data source is passed to fetch each time it is
called.
The full implementation of our example data source is in BlogDataSource.hs.
How do we run some Haxl?
There’s a runHaxl function:
runHaxl :: Env u -> GenHaxl u a -> IO aWhich needs something of type Env u. This is the “environment” that
a Haxl computation runs in, and contains various things needed by the
framework. It also contains the data source state, and to build an
Env we need to supply the initial state. Here’s how to get an Env:
initEnv :: StateStore -> u -> IO (Env u)The StateStore contains the states for all the data sources we’re
using. It is constructed with these two functions:
stateEmpty :: StateStore
stateSet :: StateKey f => State f -> StateStore -> StateStoreTo see how to put these together, take a look at HaxlBlog.hs.
Trying it out
We saw a small example of our Haxl data source working earlier, but just to round off this first part of the series and whet your appetite for the next part, here are a couple more examples.
Haxl batches things together when we use the Applicative operators:
*HaxlBlog> run $ (,) <$> getPostContent 1 <*> getPostContent 2
select postid,content from postcontent where postid in (2,1)
("example content 1","example content 2")
Even if we have multiple mapM calls, they get batched together:
*HaxlBlog> run $ (,) <$> mapM getPostContent [1..3] <*> mapM getPostContent [4..6]
select postid,content from postcontent where postid in (6,5,4,3,2,1)
(["example content 1","example content 2","example content 3"],["example content 4","example content 5","example content 6"])
In Part 2 we’ll talk more about batching, and introduce the upcoming
ApplicativeDo extension which will allow Haxl to automatically
parallelize sequential-looking do-expressions.