CloudMdsQL Compiler

Home | Technical Overview | Use Case Example | Publications | Participants | Download

Language Concepts

An important concept in CloudMdsQL is the notion of “table expression”, which is used to represent a nested subquery and usually addresses a particular data store. Each subquery can then be expressed either in SQL, or in the data store's native query language. For example, the following simple CloudMdsQL query (Query1) contains two subqueries, defined by the named table expressions T1 and T2, and addressed respectively against the data stores DB1 (an SQL database) and DB2 (a MongoDB database):

-- Query1
T1(x int, y int)@DB1 = ( SELECT x, y FROM A )
T2(x int, z string)@DB2 = {*
  db.B.find( {$lt: {x, 10}}, {x:1, z:1, _id:0} )

SELECT T1.x, T2.z
WHERE T1.x = T2.x AND T1.y <= 3

The purpose of this query is to perform relational algebra operations (expressed in the main SELECT statement) on two datasets retrieved from a relational and a document database. The two subqueries are sent independently for execution against their data stores in order the retrieved relations to be joined by the common query engine. The SQL table expression T1 is defined by an SQL subquery, while T2 is a native expression using a MongoDB query that retrieves from the document collection B the attributes x and z of those documents for which x < 10. The subquery of expression T1 is subject to rewriting by pushing into it the filter condition y <= 3, specified in the main SELECT statement, thus reducing the amount of the retrieved data by increasing the subquery selectivity. The so retrieved datasets are then converted to relations following their corresponding signatures, so that the main CloudMdsQL SELECT statement can be processed with semantic correctness.

Query Compilation

At the query decomposition step, the query compiler transforms a CloudMdsQL query to a preliminary query execution plan (QEP). Meanwhile, the compiler identifies a forest of sub-trees within the query plan, each of which is associated to a certain data store. Each of these sub-plans is meant to be delivered to the corresponding wrapper, which has to translate it to a native query and execute it against its data store. The rest of the query execution plan is the part that will be handled by the query engine. Therefore, two main subsets of the global execution plan are outlined:
  • a forest of sub-trees that will be executed locally by each data store and
  • a common query plan that will be executed by the query engine with leaf nodes consuming the relations returned by each wrapper as result of sub-plan execution.
At query decomposition step, the boundary between the two subsets is preliminary and may be modified during the query optimization step by pushing selections from the common plan to sub-plans to improve the overall execution efficiency.

Query Optimization

The optimizer may use cost models provided by the wrappers, or user-defined cost functions in order to perform cost-based query optimization. In addition, the optimizer implements a rule-based optimization strategy, by mostly performing selection pushdowns and bind joins, as described below.
  • Selection Pushdowns
Pushing a selection operation inside a subquery is always attempted by the optimizer, because it delegates the selection directly to the data store, which allows to early reduce the size of data processed and retrieved from the data stores. This will finally result in rewriting the originally written subquery. For example, Query1 will be rewritten to the following equivalent query:

-- Query1 rewritten
T1(x int, y int)@DB1 = ( SELECT x, y FROM A WHERE y <= 3 )
T2(x int, z string)@DB2 = {*
  db.B.find( {$lt: {x, 10}}, {x:1, z:1, _id:0} )

SELECT T1.x, T2.z
WHERE T1.x = T2.x

However, this will only happen if the data store is capable of handling the selection condition. In order to verify the executability of sub-plans against data stores, the query engine must be aware of the capabilities of the each data store, to perform operations supported by the common data model. These capabilities are exported by each wrapper.
  • Bind Joins
Bind join is an efficient method for performing semi-joins across heterogeneous data stores that uses subquery rewriting to push the join conditions. The approach to perform a bind join is the following: the left-hand side relation is retrieved, during which the tuples are stored in an intermediate storage and the distinct values of the join attribute(s) are kept in a list of values, which will be passed as a filter to the right-hand side subquery. For example, let us consider the following CloudMdsQL query:

A(id int, x int)@DB1 = (SELECT, a.x FROM a)
B(id int, y int)@DB2 = (SELECT, b.y FROM b)
SELECT a.x, b.y FROM b JOIN a ON =

Let us assume that the query planner has decided to use the bind join method and that the join condition will be bound to the right-hand side of the join operation. First, the relation B is retrieved from the corresponding data store using its query mechanism. Then, the distinct values of are used as a filter condition in the query that retrieves the relation A from its data store. Assuming that the distinct values of are b1 … bn, the query to retrieve the right-hand side relation of the bind join uses the following SQL approach (or its equivalent according to the data store’s query language):

SELECT, a.x FROM a WHERE IN (b1, …, bn)

This research has been partially funded by the European Commission under the FP7 programme project #611068