CloudMdsQL Compiler

Home | Technical Overview | Use Case Example | Publications | Participants | Download


Use Case Example

To illustrate the details of CloudMdsQL query processing, we assume a setup with three databases – a relational database, a MongoDB database, and Sparksee (a graph database with Python API). To be able to embed sub-queries against these data stores, we assume the query engine interfaces wrappers for each data store, as follows. The wrapper for the relational database accepts SQL subqueries. The wrapper for MongoDB accepts SQL subqueries but translates them to native MongoDB query invocations. The wrapper for Sparksee accepts as raw text the Python code that needs to be executed against the graph database using its Python client API in the environment of a Python interpreter embedded within the wrapper.

Data Organization

The data in the three databases are organized as follows:
  • DB1 is a relational database storing information about scientists in the table named Scientists as follows:
Name      Affiliation  Country
RicardoUPMSpain
MartinCWINetherlands
PatrickINRIAFrance
.........
  • DB2 is a MongoDB database containing the following collections of publications and reviews:

Publications(
{id:1, title:'Snapshot Isolation', author:'Ricardo', date:'2012-11-10'},
{id:5, title:'Principles of DDBS', author:'Patrick', date:'2011-02-18'},
{id:8, title:'Fuzzy DBs', author:'Boyan', date:'2012-06-29'},
{id:9, title:'Graph DBs', author:'Larri', date:'2013-01-06'}
)

Reviews (
{pub_id:1, reviewer: 'Martin', date: '2012-11-18', review: '…text…'},
{pub_id:5, reviewer: 'Rui', date: '2013-02-28', review: '…text…'},
{pub_id:5, reviewer: 'Ricardo', date: '2013-02-24', review: '…text…'},
{pub_id:8, reviewer: 'Rui', date: '2012-12-02', review: '…text…'},
{pub_id:9, reviewer: 'Patrick', date: '2013-01-19', review: '…text…'}
)

  • DB3 is a graph database (Sparksee) representing a social network with nodes representing persons and 'friend-of' links between them:

Query Example

The following CloudMdsQL query involves all the three databases and aims to discover 'conflicts of interest in publications from Inria reviewed in 2013' (a conflict of interest about a publication is assumed to exist if the author and reviewer are friends or friends-of-friends in the social network). The subquery against DB3 uses the Sparksee Python API and user-defined functions and in particular, a function FindShortestPathByName defined over a graph object, which seeks the shortest path between two nodes by performing breadth-first search, referring the nodes by their 'name' attributes and limited to a maximal length of the sought path.

scientists( name string, affiliation string )@DB1 = (
   SELECT name, affiliation
   FROM scientists
)
pubs_revs( id int, title string, author string, reviewer string,
   review_date timestamp )@DB2 =
(
   SELECT p.id, p.title, p.author, r.reviewer, r.date
   FROM publications p, reviews r
   WHERE p.id = r.pub_id
)
friendships( person1 string, person2 string, level int
   JOINED ON person1, person2 REFERENCING OUTER AS Outer
   WITHPARAMS maxlevel int )@DB3 =
{*
   for (p1, p2) in CloudMdsQL.Outer:
     sp = graph.FindShortestPathByName( p1, p2, $maxlevel )
     if sp.exists():
       yield (p1, p2, sp.get_cost())
*}
friendship_levels( level int, friendship string
   WITHPARAMS maxlevel int )@python =
{*
   for i in range(0, $maxlevel):
     yield (i + 1, 'friend' + '-of-friend' * i)
*}

SELECT pr.id, pr.title, pr.author, pr.reviewer, l.friendship
FROM scientists s, pubs_revs pr,
     friendships(2) f, friendship_levels(2) l
WHERE s.name = pr.author
   AND pr.author = f.person1 AND pr.reviewer = f.person2
   AND f.level = l.level
   AND pr.review_date BETWEEN '2013-01-01' AND '2013-12-31'
   AND s.affiliation = 'INRIA'

This query contains two SQL subqueries – one against a relational database and the other against a document database. The parameterized native named table expression friendships against the graph database defines a relation that represents the level of friendship between two persons (expressed by the length of the shortest path between them). The parameter maxlevel indicates a maximal value for the length of the sought path; the expression is invoked with actual value of the parameter maxlevel=2, meaning that only relationships of type direct-friend or friend-of-friend will be found. The parameterized Python named table expression friendship_levels generates synthetic data containing textual representations of friendship levels between 1 and maxlevel. The main select statement specifies the join operations to integrate data retrieved from the three data stores.


This research has been partially funded by the European Commission under the FP7 programme project #611068