Types of Datasets (Deprecated)

A dataset abstraction is defined by a Java class that implements the DatasetDefinition interface. The implementation of a dataset typically relies on one or more underlying (embedded) datasets. For example, the IndexedTable dataset can be implemented by two underlying Table datasets—one holding the data and one holding the index.

We distinguish three categories of datasets: coresystem, and custom datasets:

  • The core datasets of the CDAP are Table and FileSet. Their implementations may use internal CDAP classes hidden from developers.

  • system dataset is bundled with the CDAP and is built around one or more underlying core or system datasets to implement a specific data pattern.

  • custom dataset is implemented by you and can have arbitrary code and methods. It is typically built around one or more Tables, FileSets (or other datasets) to implement a specific data pattern.

Each dataset is associated with exactly one dataset implementation to manipulate it. Every dataset has a unique name and metadata that defines its behavior. For example, every IndexedTable has a name and indexes a particular column of its primary table: the name of that column is a metadata property of each dataset of this type.

Core Datasets

Tables and FileSets are the core datasets, and all other datasets are built using combinations of Tables and FileSets.

While these Tables have rows and columns similar to relational database tables, there are key differences:

  • Tables have no fixed schema. Unlike relational database tables where every row has the same schema, every row of a Table can have a different set of columns.

  • Because the set of columns is not known ahead of time, the columns of a row do not have a rich type. All column values are byte arrays and it is up to the application to convert them to and from rich types. The column names and the row key are also byte arrays.

  • When reading from a Table, one need not know the names of the columns: The read operation returns a map from column name to column value. It is, however, possible to specify exactly which columns to read.

  • Tables are organized in a way that the columns of a row can be read and written independently of other columns, and columns are ordered in byte-lexicographic order. They are also known as Ordered Columnar Tables.

FileSet represents a collection of files in the file system that share some common attributes such as the format and schema, while abstracting from the actual underlying file system interfaces.

  • Every file in a FileSet is in a location relative to the FileSet's base directory.

  • Knowing a file's relative path, any program can obtain a Location for that file through a method of the FileSet dataset. It can then interact directly with the file's Location; for example, to write data to the Location, or to read data from it.

  • A FileSet can be used as the input or output to MapReduce. The MapReduce program need not specify the input and output format to use, or configuration for these—the FileSet dataset provides this information to the MapReduce runtime system.

  • An abstraction of FileSets, PartitionedFileSets allow the associating of meta data (partitioning keys) with each file. The file can then be addressed through its meta data, removing the need for programs to be aware of actual file paths.

  • TimePartitionedFileSet is a further variation that uses a timestamp as the partitioning key. Though it is not required that data in each partition be organized by time, each partition is assigned a logical time.

    This is in contrast to a Timeseries Table dataset, where time is the primary means of how data is organized, and both the data model and the schema that represents the data are optimized for querying and aggregating over time ranges.

    Time-partitioned FileSets are typically written in batch: into large files, every N minutes or hours...while a timeseries table is typically written in real-time, one data point at a time.

  • CubeDataset is a multidimensional dataset, optimized for data warehousing and OLAP (Online Analytical Processing) applications. A Cube dataset stores multidimensional facts, provides a querying interface for retrieval of data and allows exploring of the stored data.

    See Cube Dataset for: details on configuring a Cube dataset; writing to and reading from it; and querying and exploring the data in a cube.

     

Created in 2020 by Google Inc.