Cube Dataset (Deprecated)
Overview
A Cube dataset is an implementation of an OLAP Cube that is pre-packaged with CDAP. Cube datasets store multidimensional facts and provide a querying interface for the retrieval of the data. Additionally, Cube datasets allows for exploring of the data stored in the Cube.
Storing Data
A Cube dataset stores multidimensional CubeFacts
 that contain dimension values, measurements, and an associated timestamp:
public class CubeFact {
public CubeFact(long timestamp) {...}
public CubeFact addDimensionValue(String name, String value) {...}
public CubeFact addMeasurement(String name, MeasureType type, long value) {...}
// ...
}
Where:
timestamp
 is an epoch in secondsDimension values are key-value pairs of a dimensionÂ
name
 andÂvalue
Measurements have aÂ
name
,Âtype
, andÂvalue
An example of a fact could be data collected by a OS monitoring tool from a server at a specific time:
timestamp=1429929000
 (equivalent to Sat, 25 Apr 2015 02:29:59 GMT)dimensions
rackId="rack1"
serverId="server0002"
measurements
cpu.used(gauge)=60
disk.reads(counter)=23244
Currently, two types of measurements are supported: gauge and counter. A gauge measurement is for an absolute metric (it overwrites) while a counter measurement is for an incremental metric.
Writing Data
The Cube Dataset API provides methods to write either a single fact or multiple facts at once:
public interface Cube extends Dataset, BatchWritable<Object, CubeFact> {
void add(CubeFact fact);
void add(Collection<? extends CubeFact> facts);
// ...
}
Cube Configuration
A Cube dataset allows for querying a pre-aggregated view of the data. That view needs to be configured before any data is written to the Cube. Currently, a view is configured with a list of dimensions and list of required dimensions using the Dataset Properties.
A Cube can have multiple views configured. They can be altered by updating the dataset properties using the Dataset Microservices.
Here's an example of configuring a pre-aggregated view via the dataset properties:
On the bottom are two Cube dataset properties that correspond to a logical view (aggregation) that can be defined with the SQL-like statement on the top.
In this example, the view is configured with two dimensions:Â rack
 and server
. Values for both are required: the data of a CubeFact is aggregated in this view only if a CubeFact has non-null values for both dimensions.
In addition to configuring aggregation views, a Cube can be configured to aggregate for multiple time resolutions based on the dataset.cube.resolutions
 property, which takes a comma-separated list of resolution values in seconds, such as 1,60,3600
 (corresponding to 1 second, 1 minute, or 1 hour resolutions):
dataset.cube.resolutions=1,60,3600
By default, if no dataset.cube.resolutions
 property is provided, a resolution of 1 second is used.
Querying Data
Querying data in Cube dataset is the most useful part of it. One can slice, dice and drill down into the data of the Cube. Use these methods of the API to perform queries:
To understand the CubeQuery
 interface, let's look at an example:
On the right is an example of how to build a Java CubeQuery
 corresponding to the SQL-like statement shown on the left.
In this example, we query two measurements:Â cpu.used
 and disk.reads
 and use max and sum functions to perform aggregation if needed. The query is performed on rack+server
 aggregated view at 1 minute resolution. The data is selected for those records that have a rack dimension value of rack1
 and for the given time range. The data is grouped by server
 values and each resulting time series is limited to 100 data points.
The result of the query is a collection of TimeSeries
. Each timeseries corresponds to a specific measurement and a combination of dimension values of those specified in the groupBy
 part:
Exploring Data
Many times, in order to construct a useful query, you have to explore and discover what data is available in the Cube. For that, Cube provides exploration APIs to search for available dimension values and measurements in specific selection of the Cube data:
The findDimensionValues
 method finds all dimension values that the data selection defined by CubeExploreQuery
 has, in addition to those specified in the CubeExploreQuery
 itself. Each returned value can be added to the original CubeExploreQuery
 to further drill down into the Cube data.
The findMeasureNames
 method finds all measurements that exist in the data selection specified within a CubeExploreQuery
.
CubeExploreQuery
 is performed across all aggregation views and allows you to configure time range, resolution, dimension values to filter by, and limit the returned results count:
This query defines the data selection as 1 minute resolution aggregations that have rack dimension with value rack1
 and the specified time range. It limits the number of results to 100.
AbstractCubeHttpHandler
CDAP comes with an AbstractCubeHttpHandler that can be used to quickly add a Service in your application that provides Microservices on top of your Cube dataset. It is an abstract class with only a single method to be implemented by its subclass that returns the Cube dataset to query in:
Here’s an example of an application with a Cube dataset and an HTTP Service that provides RESTful access to it:
Example of the query in JSON format:
Example of the response in JSON format (pretty-printed to fit):
Examples of Using Cube Dataset
An example of using a Cube Dataset is included in the How To article Data Analysis with OLAP Cube.
Created in 2020 by Google Inc.