Jaql: A JSON Query Language
Kevin Beyer, Vuk Ercegovac, Jun Rao, Eugene Shekita
{kbeyer,vercego,junrao,shekita}@us.ibm.com
Introduction
JavaScript Object Notation or JSON
has become a popular data format for many Web-based applications
because of its simplicity and modeling flexibility.
Wikipedia includes a nice
summary
of JSON's advantages over other data formats like XML.
In contrast to XML, which was originally designed as a markup language,
JSON was actually designed for data.
Moreover, JSON is a language-independent format, with bindings
in a variety of programming languages.
In short, JSON makes it easy to model a wide spectrum of
data, ranging from homogenous flat data to
heterogeneous nested data, and it can do this
in a language-independent format.
We believe that these characteristics make JSON
an ideal data format for many Hadoop applications
and databases in general.
In this document, we introduce Jaql, a query language for JSON.
Although Jaql has been designed specifically for JSON,
we have tried to borrow some of the best features of
SQL,
XQuery, and
PigLatin.
We begin with an example of JSON data, then go on to describe
the key features of Jaql and show how it can be used to process
JSON data in parallel using Hadoop's map/reduce framework.
Only a high-level description of Jaql is presented here.
A more detailed technical description is forthcoming.
Jaql is still an early draft specification,
so beware that it is likely to change over the next few months.
A JSON Example
Let's start off with an example of books and their reviews in JSON format:
[
{publisher: 'Scholastic',
author: 'J. K. Rowling',
title: 'Deathly Hallows',
year: 2007},
{publisher: 'Scholastic',
author: 'J. K. Rowling',
title: 'Chamber of Secrets',
year: 1999,
reviews: [
{rating: 10, user: 'joe', review: 'The best ...'},
{rating: 6, user: 'mary', review: 'Average ...'}]},
{publisher: 'Scholastic',
author: 'J. K. Rowling',
title: 'Sorcerers Stone',
year: 1998},
{publisher: 'Scholastic',
author: 'R. L. Stine',
title: 'Monster Blood IV',
year: 1997,
reviews: [
{rating: 8, user: 'rob', review: 'High on my list...'},
{rating: 2, user: 'mike', review: 'Not worth the paper ...',
discussion:
[{user: 'ben', text: 'This is too harsh...'},
{user: 'jill', text: 'I agree ...'}]}]},
{publisher: 'Grosset',
author: 'Carolyn Keene',
title: 'The Secret of Kane',
year: 1930}
]
This example shows an array of JSON objects.
Arrays are delimited by brackets '[]' and objects are delimited by braces '{}'.
Objects contain name:value pairs or members, where the
value can be an atomic type or a nested value.
In contrast to XML, the type of an atomic value is always known in JSON.
Here, each top-level object represents a book and its reviews.
The 'reviews' for a book object is an array and each entry
in the array corresponds to a review.
Each review consists of a 'rating', its 'user', and the text of the review.
Each review can also contain a 'discussion', which
itself can be discussed, forming a discussion thread.
It is important to point out that the Jaql
data model is actually a superset of JSON.
The only difference is that atomic types like date have been added to Jaql to
make certain database operations more efficient.
Also, for readability, the names in name:value pairs
do not always need to be quoted.
When data needs to be exported in standard JSON format,
Jaql can convert non-standard atomic types to strings
and add quotes to names.
For the remainder of this document, we will use the Jaql data
model and JSON interchangeably, although strictly speaking
the Jaql data model is a superset of JSON.
Reading and Writing JSON Data
Using Jaql, JSON data can be stored and retrieved
from a variety of data sources including ordinary files.
Jaql queries can take a collection as input and generate
a new collection as output, where a
collection corresponds to a JSON array.
An example using an ordinary file to store a collection
is as follows:
// Example 1. Write to a file named 'hey.dat'.
localWrite('hey.dat', [{text: 'Hello World'}]);
// Read it back...
localRead('hey.dat');
Here, a single object with the 'Hello World' string is
being written to a file called 'hey.dat' in the current directory.
It is also possible to read and write JSON data to
Hadoop HDFS files and
HBase tables.
We have made integration with Hadoop a priority by having all data types implement the
WritableComparable interface.
By integrating Jaql with HDFS and HBase, we are able to store JSON data
and process it in parallel using Hadoop's map/reduce framework.
Examples using Hadoop are provided below.
Our second example writes to an HDFS
SequenceFile.
A SequenceFile is a collection of key-value pairs.
Jaql's hadoopWrite() function only writes data into the value field, leaving the key field empty.
In example 2, the input data is represented as a literal, but in general the input
can be an expression that is the result of a Jaql query.
// Example 2. Write to a Hadoop SequenceFile named: 'orders.dat'.
hadoopWrite('orders.dat', [
{order: 1, cust: 'c1', items: [
{item: 1, qty: 2},
{item: 3, qty: 6},
{item: 5, qty: 10}]},
{order: 2, cust: 'c2', items: [
{item: 2, qty: 1},
{item: 5, qty: 2},
{item: 7, qty: 3}]},
{order: 3, cust: 'c1', items: [
{item: 1, qty: 2},
{item: 7, qty: 14},
{item: 5, qty: 10}]}
]);
// Read it back...
hadoopRead('orders.dat');
Our third example writes to an HBase table.
An HBase table is a collection of records, with each record containing a
primary key and a set of column name-value pairs.
Column names in HBase are of type
Text,
while column values are simply byte arrays.
HBase uses a two-part naming scheme for columns of the form 'column family:column'.
In Jaql, we use a sharp '#' instead of a colon ':' to separate column families from columns,
since the colon is already used as a separator in JSON.
If a column family is not specified, a special 'default' column family is used, as in the
following example.
// Example 3. Write to an HBase table named 'webcrawl'.
hbaseWrite('webcrawl', [
{key: "www.cnn.com", page:'...', rank: 0.9,
inlinks:[
{link: 'www.news.com', anchor: 'newsite'},
{link: 'www.jscript.com', anchor: 'look here'}]},
{key: "www.json.org", page:'...', rank: 0.8}
]);
// Read it back...
hbaseRead('webcrawl');
In hbaseWrite(), all objects in the input are written to a
single HBase table.
Each top-level JSON object is stored as an Hbase record with the specified key,
and each name:value pair in the object is stored as a separate column-value pair in
the record.
Values are serialized as a byte array.
Note that only the outermost name:value pairs in top-level objects are
stored as separate columns in Hbase.
Nested arrays and objects are serialized within these.
In example 3, two HBase records are written.
Each record is used to store the content, rank, and in-links (if any) of a web page.
The Jaql Query Language
Jaql is a functional query language that provides users with a simple,
declarative syntax to do things like filter, join, and group JSON data.
Jaql also allows user-defined functions to be written and used in expressions.
Let's begin with some examples using the book data presented earlier:
Our first two queries illustrate selection, projection, and filtering.
// Write the books collection from data above.
// hadoopWrite('books', [ {publisher... ] )
// Query 1. Return the publisher and title of each book.
for $b in hadoopRead('books')
return {$b.publisher, $b.title};
// result...
[
{publisher: 'Scholastic', title: 'Deathly Hallows'},
{publisher: 'Scholastic', title: 'Chamber of Secrets'},
{publisher: 'Scholastic', title: 'Sorcerers Stone'},
{publisher: 'Scholastic', title: 'Monster Blood IV'},
{publisher: 'Grosset', title: 'The Secret of Kane'}
];
// Query 2. Find the authors and titles of books that have received
// a review.
for $b in hadoopRead('books')
where exists($b.reviews)
return {$b.author, $b.title};
// result...
[
{author: 'J. K. Rowling', title: 'Chamber of Secrets'},
{author: 'R. L. Stine', title: 'Monster Blood IV'}
];
Query 1 uses a for expression to loop over the books collection.
Each book is bound to the variable $b and the return
clause constructs an object that is placed in the output array.
In Query 2, the where clause causes
only books with reviews to be output.
// Query 3. Project the title from each book using the short-hand
// projection notation.
hadoopRead('books').[title];
// Query 3-alt. Or using equivalent the long-hand notation.
for $b in hadoopRead('books')
return $b.title;
// result...
[ "Deathly Hallows",
"Chamber of Secrets",
"Sorcerers Stone",
"Monster Blood IV",
"The Secret of Kane"];
// Query 4. Find authors, titles, and reviews of books where a review
// prompted a discussion by the user 'ben'.
for $b in hadoopRead('books')
where exists(for $r in $b.reviews
where 'ben' in $r.discussion.[user]
return true)
return {$b.author, $b.title, $b.reviews};
// result...
[
{author: 'R. L. Stine',
title: 'Monster Blood IV',
reviews: [
{rating: 8, user: 'rob', review: 'High on my list...'},
{rating: 2, user: 'mike', review: 'Not worth the paper ...',
discussion: [
{user: 'ben', text: 'This is too harsh...'},
{user: 'jill', text: 'I agree ...'}]}]}
];
Query 3 shows the short-hand notation for projecting from an array of objects, as well as
the equivalent long-hand notation.
Query 4 is a more complicated variation of Query 2.
Now the predicate is based on values that are nested three
levels deep (books, reviews, discussion).
A nested for-where-return tests whether book $b contains
a review that prompted a discussion by the user 'ben'.
// Query 5. Find the authors and titles of books that had an
// average review rating over 5.
for $b in hadoopRead('books')
where avg($b.reviews.[rating]) > 5
return {$b.author, $b.title};
// result...
[
{author: 'J. K. Rowling', title: 'Chamber of Secrets'}
];
Query 5 filters books by their average rating.
For each book $b, the syntax $b.reviews.[rating] creates an array of review
ratings for each book $b. The average rating for book $b is then computed and tested
to see if it is greater than 5.
// Query 6. Show how many books each publisher has published.
group $b in hadoopRead('books') by $p = $b.publisher into $pubs
return {publisher: $p, num: count($pubs)};
// result...
[
{publisher: 'Scholastic', num: 4},
{publisher: 'Grosset', num: 1}
];
Query 6 illustrates grouping and aggregation.
The group expression partitions an input collection into groups.
Books are partitioned into groups by the grouping value $p,
which is set to $b.publisher.
The $pubs variable is bound to an array associated with each
group and used to count how many books each publisher has published.
// Query 7. Find the publisher who published the most books.
let
// group books by publisher and compute their book count
$g = group $b in hadoopRead('books') by $p = $b.publisher into $pubs
return {publisher: $p, num: count($pubs)},
// sort publishers by descending book count
$sorted = sort $i in $g by ($i.num desc)
// return the top publisher
return $sorted[0];
// result...
{publisher: 'Scholastic', num: 4};
Query 7 illustrates grouping and sorting.
It shows how group and sort can be used to
find the publisher who published the most books.
The let expression is used to assign expressions
to variables and make it easier to write the query.
Like in PigLatin, the group expression can also be used to co-group multiple collections.
To illustrate co-group, we define the collections X and Y:
write('X',
[
{a:1, b:1},
{a:1, b:2},
{a:2, b:3},
{a:2, b:4}
] );
write('Y',
[
{c:2, d:1},
{c:2, d:2},
{c:3, d:3},
{c:3, d:4}
] );
Query 8 co-groups X and Y.
Co-group is similar to a group over the union
of the input collections, except that two arrays
are generated for each group, one for each input collection.
// Query 8. Co-group X and Y.
group $x in hadoopRead('X') by $g = $x.a into $xgroup,
$y in hadoopRead('Y') by $g = $y.c into $ygroup
return {g: $g, b: $xgroup.[b], d: $ygroup.[d]};
// result...
[
{g: 1, b: [1,2], d: []},
{g: 2, b: [3,4], d: [1,2]},
{g: 3, b: [], d: [3,4]}
];
In Query 8, both X and Y need to be grouped on the same value, namely $g.
The syntax $xgroup.[b] and $ygroup.[d] in the return clause
projects the 'b' and 'd' values in the arrays created for each X and Y
group, respectively.
Joins can be expressed using co-group, but the syntax can get a little messy.
Since joins are common, special syntax has been introduced for them:
// Query 9. Join X and Y.
join $x in hadoopRead('X') on $x.a,
$y in hadoopRead('Y') on $y.c
return {$x.a, $x.b, $y.c, $y.d};
// result...
[
{a: 2, b: 3, c: 2, d: 1},
{a: 2, b: 3, c: 2, d: 2},
{a: 2, b: 4, c: 2, d: 1},
{a: 2, b: 4, c: 2, d: 2},
];
Query 9 shows an inner join on 'a' and 'c' in X and Y, respectively.
Although it has not be shown, left-, right-, and full-outer joins
can also be specified using modifiers.
Extending Jaql
Jaql has been designed to be extensible.
In this section, we briefly describe some interesting functions
that we have added to Jaql.
The first example is mapReduce(), which takes a Jaql
description of the map and reduce functions as input,
and uses it to run a map/reduce job in Hadoop.
// Write to an HDFS file called 'sample'.
hadoopWrite('sample.dat', [
{x: 0, text: 'zero'},
{x: 1, text: 'one'},
{x: 0, text: 'two'},
{x: 1, text: 'three'},
{x: 0, text: 'four'},
{x: 1, text: 'five'},
{x: 0, text: 'six'},
{x: 1, text: 'seven'},
{x: 0, text: 'eight'}
]);
// Run a map/reduce job that counts the number objects
// for each 'x' value.
mapReduce(
{ input: {type: 'hdfs', location: 'sample.dat'},
output: {type: 'hdfs', location: 'results.dat'},
map: fn($i) { [ [$i.x, 1] ] },
reduce: fn($x, $v) { [ {x: $x, num: count($v)} ] }
});
// Read the results...
hadoopRead('results.dat');
// result...
[
{x: '0', num: 5},
{x: '1', num: 4}
];
This example groups the input on 'x' and counts the
number of objects in each group.
The map function must specify how to extract a key-value pair,
and the reduce function must specify how to aggregate the
values for a given key.
Here, the key value is set to $i.x and count($v) is
used to count the values $v associated with each key.
Note that both the map and reduce functions need to output an array
because each input is allowed to produce multiple outputs.
We have found mapReduce() to come in handy for many map/reduce jobs.
In general, the body of the map and reduce functions can
be Jaql expressions, so a lot of expressive power is available
to programmers who want to work with JSON data.
By working in Jaql, programmers are freed from all the little
details required to actually set up and run a map/reduce job.
The mapReduce() function touches on a interesting feature of Jaql,
namely that Jaql is a second-order language.
This allows function definitions to be assigned to variables
-- even stored in the database --
and later evaluated, as shown in the following example.
// Define a function that returns the most recent book
// written by a given author.
let $mostRecent =
fn($author) {
let
$authorsBooks =
for $b in hadoopRead('books')
where $b.author = $author
return {title: $b.title, year: $b.year},
$sorted = sort $b in $authorsBooks by ($b.year desc)
return $sorted[0].title
}
// Invoke the function.
return $mostRecent('J. K. Rowling');
// result...
'Deathly Hallows';
Finally, to work with external data sources,
we recently added a jaqlGet() function that
can retrieve JSON data from a URL.
Below are two examples of jaqlGet() that get data from
Freebase and
Yahoo! Traffic.
// Get albums recorded by "The Police" using Freebase.
let
$artist = "The Police",
$freebase =
httpGet('http://www.freebase.com/api/service/mqlread',
{ queries:
serialize(
{ myquery:
{ query:
[{ type: "/music/artist",
name: $artist,
album: []
}]
}
}
) })[0]
return
unnest $freebase.myquery.result.[album];
// result...
[ "Outlandos d\'Amour",
"Reggatta de Blanc",
"Zenyatta Mondatta",
"Ghost in the Machine",
"Synchronicity",
"Every Breath You Take: The Singles",
"Greatest Hits",
"Message in a Box: The Complete Recordings (disc 1)",
"Message in a Box: The Complete Recordings (disc 2)",
"Message in a Box: The Complete Recordings (disc 3)",
"Message in a Box: The Complete Recordings (disc 4)",
"Live! (disc 1: Orpheum WBCN/Boston Broadcast)",
"Live! (disc 2: Atlanta/Synchronicity Concert)",
"Every Breath You Take: The Classics",
"Their Greatest Hits",
"Can\'t Stand Losing You",
"Roxanne \'97 (Puff Daddy remix)",
"Roxanne \'97"];
// Get traffic incidents from Yahoo!.
let
$trafficData =
httpGet('http://local.yahooapis.com/MapsService/V1/trafficData',
{ appid: "YahooDemo",
street: "701 First Street",
city: "Sunnyvale",
state: "CA",
output: "json"
})[0]
return
for $i in $trafficData.ResultSet.Result
return $i.Title;
// result...
[ "Road construction, on US-101 NB at FAIROAKS AVE TONBTO NB MATHILDA",
"Road construction, on CA-85 SB at MOFFETT BLVD",
"Road construction, on CA-237 EB at MATHILDA AVE TOEBTO FAIR OAKS AVE",
"Road construction, on CA-237 WB at CROSSMAN AVE",
"Road construction, on I-880 at GATEWAY BLVD"];
Implementation Status
An initial prototype of Jaql for Hadoop is available for downloading.
Our prototype has been tested to work in stand-alone mode with
Hadoop's mini-cluster, or on Hadoop's distributed cluster.
Currently, only simple group expressions
are automatically rewritten to use the mapReduce() function when
their input is from HDFS or HBase. We will soon have more expressions, including
for and join, translated to mapReduce().
This enables Hadoop users to write simple Jaql queries over JSON data
and run them in parallel as map/reduce jobs. Of course, mapReduce() can be directly used
to take advantage of the cluster.
There is no shortage of interesting problems left to solve in Jaql,
and we welcome help on it.
These include:
-
Develop better Jaql complier technology so more Jaql queries
can be automatically translated into map/reduce jobs when appropriate.
-
Enabling bindings to other programming languages besides Java
so Jaql clients and functions can be written in, say, Python or JavaScript.
-
Investigating how Facebook's
Thrift
could be leveraged by Jaql.
-
Handling more data sources than just ordinary files, HDFS, and HBase.
-
Performance testing and performance improvements.
-
Extending the notion of job tracking, load balancing, and restarting to Jaql queries.