Jaql: A JSON Query Language

Kevin Beyer, Vuk Ercegovac, Jun Rao, Eugene Shekita
{kbeyer,vercego,junrao,shekita}@us.ibm.com

Introduction

JavaScript Object Notation or JSON has become a popular data format for many Web-based applications because of its simplicity and modeling flexibility. Wikipedia includes a nice summary of JSON's advantages over other data formats like XML. In contrast to XML, which was originally designed as a markup language, JSON was actually designed for data. Moreover, JSON is a language-independent format, with bindings in a variety of programming languages. In short, JSON makes it easy to model a wide spectrum of data, ranging from homogenous flat data to heterogeneous nested data, and it can do this in a language-independent format. We believe that these characteristics make JSON an ideal data format for many Hadoop applications and databases in general.

In this document, we introduce Jaql, a query language for JSON. Although Jaql has been designed specifically for JSON, we have tried to borrow some of the best features of SQL, XQuery, and PigLatin. We begin with an example of JSON data, then go on to describe the key features of Jaql and show how it can be used to process JSON data in parallel using Hadoop's map/reduce framework. Only a high-level description of Jaql is presented here. A more detailed technical description is forthcoming. Jaql is still an early draft specification, so beware that it is likely to change over the next few months.

A JSON Example

Let's start off with an example of books and their reviews in JSON format:

  [
    {publisher: 'Scholastic',
     author: 'J. K. Rowling',
     title: 'Deathly Hallows',
     year: 2007},

    {publisher: 'Scholastic',
     author: 'J. K. Rowling',
     title: 'Chamber of Secrets',
     year: 1999, 
     reviews: [
       {rating: 10, user: 'joe', review: 'The best ...'},
       {rating: 6, user: 'mary', review: 'Average ...'}]},

    {publisher: 'Scholastic',
     author: 'J. K. Rowling',
     title: 'Sorcerers Stone',
     year: 1998},

    {publisher: 'Scholastic',
     author: 'R. L. Stine',
     title: 'Monster Blood IV',
     year: 1997, 
     reviews: [
       {rating: 8, user: 'rob', review: 'High on my list...'}, 
       {rating: 2, user: 'mike', review: 'Not worth the paper ...', 
        discussion:
          [{user: 'ben', text: 'This is too harsh...'}, 
           {user: 'jill', text: 'I agree ...'}]}]},

    {publisher: 'Grosset',
     author: 'Carolyn Keene',
     title: 'The Secret of Kane',
     year: 1930}
  ]

This example shows an array of JSON objects. Arrays are delimited by brackets '[]' and objects are delimited by braces '{}'. Objects contain name:value pairs or members, where the value can be an atomic type or a nested value. In contrast to XML, the type of an atomic value is always known in JSON. Here, each top-level object represents a book and its reviews. The 'reviews' for a book object is an array and each entry in the array corresponds to a review. Each review consists of a 'rating', its 'user', and the text of the review. Each review can also contain a 'discussion', which itself can be discussed, forming a discussion thread.

It is important to point out that the Jaql data model is actually a superset of JSON. The only difference is that atomic types like date have been added to Jaql to make certain database operations more efficient. Also, for readability, the names in name:value pairs do not always need to be quoted. When data needs to be exported in standard JSON format, Jaql can convert non-standard atomic types to strings and add quotes to names. For the remainder of this document, we will use the Jaql data model and JSON interchangeably, although strictly speaking the Jaql data model is a superset of JSON.

Reading and Writing JSON Data

Using Jaql, JSON data can be stored and retrieved from a variety of data sources including ordinary files. Jaql queries can take a collection as input and generate a new collection as output, where a collection corresponds to a JSON array. An example using an ordinary file to store a collection is as follows:

  // Example 1. Write to a file named 'hey.dat'.
  localWrite('hey.dat', [{text: 'Hello World'}]);
	
  // Read it back...
  localRead('hey.dat');

Here, a single object with the 'Hello World' string is being written to a file called 'hey.dat' in the current directory. It is also possible to read and write JSON data to Hadoop HDFS files and HBase tables. We have made integration with Hadoop a priority by having all data types implement the WritableComparable interface. By integrating Jaql with HDFS and HBase, we are able to store JSON data and process it in parallel using Hadoop's map/reduce framework.

Examples using Hadoop are provided below. Our second example writes to an HDFS SequenceFile. A SequenceFile is a collection of key-value pairs. Jaql's hadoopWrite() function only writes data into the value field, leaving the key field empty. In example 2, the input data is represented as a literal, but in general the input can be an expression that is the result of a Jaql query.

  // Example 2. Write to a Hadoop SequenceFile named: 'orders.dat'.
  hadoopWrite('orders.dat', [
    {order: 1, cust: 'c1', items: [ 
      {item: 1, qty: 2},
      {item: 3, qty: 6},
      {item: 5, qty: 10}]},
    {order: 2, cust: 'c2', items: [
      {item: 2, qty: 1},
      {item: 5, qty: 2},
      {item: 7, qty: 3}]},
    {order: 3, cust: 'c1', items: [
      {item: 1, qty: 2},
      {item: 7, qty: 14},
      {item: 5, qty: 10}]} 
  ]);
	
  // Read it back...
  hadoopRead('orders.dat');

Our third example writes to an HBase table. An HBase table is a collection of records, with each record containing a primary key and a set of column name-value pairs. Column names in HBase are of type Text, while column values are simply byte arrays. HBase uses a two-part naming scheme for columns of the form 'column family:column'. In Jaql, we use a sharp '#' instead of a colon ':' to separate column families from columns, since the colon is already used as a separator in JSON. If a column family is not specified, a special 'default' column family is used, as in the following example.

  // Example 3. Write to an HBase table named 'webcrawl'.
  hbaseWrite('webcrawl', [
    {key: "www.cnn.com", page:'...', rank: 0.9,
     inlinks:[
       {link: 'www.news.com', anchor: 'newsite'},
       {link: 'www.jscript.com', anchor: 'look here'}]},
    {key: "www.json.org", page:'...', rank: 0.8}
  ]);
	
  // Read it back...
  hbaseRead('webcrawl');

In hbaseWrite(), all objects in the input are written to a single HBase table. Each top-level JSON object is stored as an Hbase record with the specified key, and each name:value pair in the object is stored as a separate column-value pair in the record. Values are serialized as a byte array. Note that only the outermost name:value pairs in top-level objects are stored as separate columns in Hbase. Nested arrays and objects are serialized within these. In example 3, two HBase records are written. Each record is used to store the content, rank, and in-links (if any) of a web page.

The Jaql Query Language

Jaql is a functional query language that provides users with a simple, declarative syntax to do things like filter, join, and group JSON data. Jaql also allows user-defined functions to be written and used in expressions. Let's begin with some examples using the book data presented earlier: Our first two queries illustrate selection, projection, and filtering.

	
  // Write the books collection from data above.
  // hadoopWrite('books', [ {publisher... ] )
  
  // Query 1. Return the publisher and title of each book.
  for $b in hadoopRead('books')
  return {$b.publisher, $b.title};
	
  // result...
  [  
    {publisher: 'Scholastic', title: 'Deathly Hallows'},
    {publisher: 'Scholastic', title: 'Chamber of Secrets'},
    {publisher: 'Scholastic', title: 'Sorcerers Stone'},
    {publisher: 'Scholastic', title: 'Monster Blood IV'},
    {publisher: 'Grosset',    title: 'The Secret of Kane'}
  ];
  
  // Query 2. Find the authors and titles of books that have received
  // a review.
  for $b in hadoopRead('books')
  where exists($b.reviews)
  return {$b.author, $b.title};
    
  // result...
  [  
    {author: 'J. K. Rowling', title: 'Chamber of Secrets'},
    {author: 'R. L. Stine',   title: 'Monster Blood IV'}
  ];

Query 1 uses a for expression to loop over the books collection. Each book is bound to the variable $b and the return clause constructs an object that is placed in the output array. In Query 2, the where clause causes only books with reviews to be output.

		   	
  // Query 3. Project the title from each book using the short-hand
  // projection notation. 
  hadoopRead('books').[title];
  
  // Query 3-alt. Or using equivalent the long-hand notation.
  for $b in hadoopRead('books')
  return $b.title;

  // result...
  [ "Deathly Hallows",
    "Chamber of Secrets",
    "Sorcerers Stone",
    "Monster Blood IV",
    "The Secret of Kane"];

  // Query 4. Find authors, titles, and reviews of books where a review
  // prompted a discussion by the user 'ben'.
  for $b in hadoopRead('books')
  where exists(for $r in $b.reviews
               where 'ben' in $r.discussion.[user]
               return true)
  return {$b.author, $b.title, $b.reviews};
	
  // result...
  [  
    {author: 'R. L. Stine',
     title: 'Monster Blood IV',
     reviews: [
       {rating: 8, user: 'rob', review: 'High on my list...'},
       {rating: 2, user: 'mike', review: 'Not worth the paper ...',
        discussion: [
          {user: 'ben', text: 'This is too harsh...'},
          {user: 'jill', text: 'I agree ...'}]}]}
  ];

Query 3 shows the short-hand notation for projecting from an array of objects, as well as the equivalent long-hand notation.

Query 4 is a more complicated variation of Query 2. Now the predicate is based on values that are nested three levels deep (books, reviews, discussion). A nested for-where-return tests whether book $b contains a review that prompted a discussion by the user 'ben'.

  	 		
  // Query 5. Find the authors and titles of books that had an
  // average review rating over 5.
  for $b in hadoopRead('books')
  where avg($b.reviews.[rating]) > 5
  return {$b.author, $b.title};
		 	
  // result...
  [  
    {author: 'J. K. Rowling', title: 'Chamber of Secrets'}
  ];

Query 5 filters books by their average rating. For each book $b, the syntax $b.reviews.[rating] creates an array of review ratings for each book $b. The average rating for book $b is then computed and tested to see if it is greater than 5.

  // Query 6. Show how many books each publisher has published.
  group $b in hadoopRead('books') by $p = $b.publisher into $pubs
  return {publisher: $p, num: count($pubs)};
	
  // result...
  [  
    {publisher: 'Scholastic', num: 4},
    {publisher: 'Grosset', num: 1}
  ];

Query 6 illustrates grouping and aggregation. The group expression partitions an input collection into groups. Books are partitioned into groups by the grouping value $p, which is set to $b.publisher. The $pubs variable is bound to an array associated with each group and used to count how many books each publisher has published.

  // Query 7. Find the publisher who published the most books.
  let 
    // group books by publisher and compute their book count
    $g = group $b in hadoopRead('books') by $p = $b.publisher into $pubs
         return {publisher: $p, num: count($pubs)},  
    
    // sort publishers by descending book count
    $sorted = sort $i in $g by ($i.num desc)

  // return the top publisher
  return $sorted[0];
	 
  // result...
  {publisher: 'Scholastic', num: 4};

Query 7 illustrates grouping and sorting. It shows how group and sort can be used to find the publisher who published the most books. The let expression is used to assign expressions to variables and make it easier to write the query. Like in PigLatin, the group expression can also be used to co-group multiple collections. To illustrate co-group, we define the collections X and Y:

  write('X',
    [
      {a:1, b:1}, 
      {a:1, b:2}, 
      {a:2, b:3}, 
      {a:2, b:4}
    ] );

  write('Y',
    [
      {c:2, d:1}, 
      {c:2, d:2}, 
      {c:3, d:3}, 
      {c:3, d:4}
    ] );

Query 8 co-groups X and Y. Co-group is similar to a group over the union of the input collections, except that two arrays are generated for each group, one for each input collection.

  // Query 8. Co-group X and Y.
  group $x in hadoopRead('X') by $g = $x.a into $xgroup,
        $y in hadoopRead('Y') by $g = $y.c into $ygroup
  return {g: $g, b: $xgroup.[b], d: $ygroup.[d]};

  // result...
  [ 
    {g: 1, b: [1,2], d: []},
    {g: 2, b: [3,4], d: [1,2]},
    {g: 3, b: [],    d: [3,4]}
  ];

In Query 8, both X and Y need to be grouped on the same value, namely $g. The syntax $xgroup.[b] and $ygroup.[d] in the return clause projects the 'b' and 'd' values in the arrays created for each X and Y group, respectively.

Joins can be expressed using co-group, but the syntax can get a little messy. Since joins are common, special syntax has been introduced for them:

  // Query 9. Join X and Y.
  join $x in hadoopRead('X') on $x.a,
       $y in hadoopRead('Y') on $y.c
  return {$x.a, $x.b, $y.c, $y.d};
  
  // result...
  [ 
    {a: 2, b: 3, c: 2, d: 1},
    {a: 2, b: 3, c: 2, d: 2},
    {a: 2, b: 4, c: 2, d: 1},
    {a: 2, b: 4, c: 2, d: 2},
  ];

Query 9 shows an inner join on 'a' and 'c' in X and Y, respectively. Although it has not be shown, left-, right-, and full-outer joins can also be specified using modifiers.

Extending Jaql

Jaql has been designed to be extensible. In this section, we briefly describe some interesting functions that we have added to Jaql. The first example is mapReduce(), which takes a Jaql description of the map and reduce functions as input, and uses it to run a map/reduce job in Hadoop.

  // Write to an HDFS file called 'sample'.
  hadoopWrite('sample.dat', [
    {x: 0, text: 'zero'},
    {x: 1, text: 'one'},
    {x: 0, text: 'two'},
    {x: 1, text: 'three'},
    {x: 0, text: 'four'},
    {x: 1, text: 'five'},
    {x: 0, text: 'six'},
    {x: 1, text: 'seven'},
    {x: 0, text: 'eight'}
  ]);
	
  // Run a map/reduce job that counts the number objects
  // for each 'x' value.
  mapReduce( 
    { input:  {type: 'hdfs', location: 'sample.dat'}, 
      output: {type: 'hdfs', location: 'results.dat'}, 
      map:    fn($i) { [ [$i.x, 1] ] }, 
      reduce: fn($x, $v) { [ {x: $x, num: count($v)} ] }
    });
	
  // Read the results...
  hadoopRead('results.dat');
	
  // result...
  [
    {x: '0', num: 5},
    {x: '1', num: 4}
  ];

This example groups the input on 'x' and counts the number of objects in each group. The map function must specify how to extract a key-value pair, and the reduce function must specify how to aggregate the values for a given key. Here, the key value is set to $i.x and count($v) is used to count the values $v associated with each key. Note that both the map and reduce functions need to output an array because each input is allowed to produce multiple outputs.

We have found mapReduce() to come in handy for many map/reduce jobs. In general, the body of the map and reduce functions can be Jaql expressions, so a lot of expressive power is available to programmers who want to work with JSON data. By working in Jaql, programmers are freed from all the little details required to actually set up and run a map/reduce job.

The mapReduce() function touches on a interesting feature of Jaql, namely that Jaql is a second-order language. This allows function definitions to be assigned to variables -- even stored in the database -- and later evaluated, as shown in the following example.

 
  // Define a function that returns the most recent book
  // written by a given author.
  let $mostRecent = 
    fn($author) {
      let 
        $authorsBooks =
           for $b in hadoopRead('books')
           where $b.author = $author
           return {title: $b.title, year: $b.year},
         $sorted = sort $b in $authorsBooks by ($b.year desc)
        
      return $sorted[0].title
    }
    
		// Invoke the function.
  return $mostRecent('J. K. Rowling');

  // result...
  'Deathly Hallows';

Finally, to work with external data sources, we recently added a jaqlGet() function that can retrieve JSON data from a URL. Below are two examples of jaqlGet() that get data from Freebase and Yahoo! Traffic.

  // Get albums recorded by "The Police" using Freebase.
  let
    $artist = "The Police",
    $freebase = 
      httpGet('http://www.freebase.com/api/service/mqlread', 
        { queries: 
            serialize(
              { myquery: 
                { query:
                  [{ type: "/music/artist",
                     name: $artist,
                     album: []
                   }] 
                }
              }
            ) })[0]
  return
    unnest $freebase.myquery.result.[album];
      
  // result...
  [ "Outlandos d\'Amour",
    "Reggatta de Blanc",
    "Zenyatta Mondatta",
    "Ghost in the Machine",
    "Synchronicity",
    "Every Breath You Take: The Singles",
    "Greatest Hits",
    "Message in a Box: The Complete Recordings (disc 1)",
    "Message in a Box: The Complete Recordings (disc 2)",
    "Message in a Box: The Complete Recordings (disc 3)",
    "Message in a Box: The Complete Recordings (disc 4)",
    "Live! (disc 1: Orpheum WBCN/Boston Broadcast)",
    "Live! (disc 2: Atlanta/Synchronicity Concert)",
    "Every Breath You Take: The Classics",
    "Their Greatest Hits",
    "Can\'t Stand Losing You",
    "Roxanne \'97 (Puff Daddy remix)",
    "Roxanne \'97"];
  
  // Get traffic incidents from Yahoo!.
  let
    $trafficData = 
      httpGet('http://local.yahooapis.com/MapsService/V1/trafficData',
        { appid:  "YahooDemo",
          street: "701 First Street",
          city:   "Sunnyvale",
          state:  "CA",
          output: "json"
        })[0]
  return  
    for $i in $trafficData.ResultSet.Result
    return $i.Title;
  
  // result...
  [ "Road construction, on US-101 NB at FAIROAKS AVE TONBTO NB MATHILDA",
    "Road construction, on CA-85 SB at MOFFETT BLVD",
    "Road construction, on CA-237 EB at MATHILDA AVE TOEBTO FAIR OAKS AVE",
    "Road construction, on CA-237 WB at CROSSMAN AVE",
    "Road construction, on I-880 at GATEWAY BLVD"];

Implementation Status

An initial prototype of Jaql for Hadoop is available for downloading. Our prototype has been tested to work in stand-alone mode with Hadoop's mini-cluster, or on Hadoop's distributed cluster. Currently, only simple group expressions are automatically rewritten to use the mapReduce() function when their input is from HDFS or HBase. We will soon have more expressions, including for and join, translated to mapReduce(). This enables Hadoop users to write simple Jaql queries over JSON data and run them in parallel as map/reduce jobs. Of course, mapReduce() can be directly used to take advantage of the cluster.

There is no shortage of interesting problems left to solve in Jaql, and we welcome help on it. These include: