Datasets (Deprecated)

Note: Datasets are deprecated and will be removed in CDAP 7.0.0.

Datasets store and retrieve data. Datasets are your means of reading from and writing to the CDAP’s storage capabilities. Instead of forcing you to manipulate data with low-level APIs, datasets provide higher-level abstractions and generic, reusable implementations of common data patterns.

The core datasets of CDAP are Tables and FileSets:

  • Unlike relational database systems, CDAP Tables are not organized into rows with a fixed schema. They are optimized for efficient storage of semi-structured data, data with unknown or variable schema, or sparse data.

  • CDAP FileSets provide an abstraction over the raw file system, and associate properties such as the format or the schema with the files they contain. In addition, partitioned file sets allow addressing files by their partition metadata, removing the need for applications to be aware of actual file system locations.

Other datasets are built on top of tables and file sets. A dataset can implement specific semantics around a core dataset, such as a key/value Table or a counter Table. A dataset can also combine multiple datasets to create a complex data pattern. For example, an indexed Table can be implemented by using one Table for the data and a second Table for the index of that data.

A number of useful datasets, known as system datasets, are included with CDAP, including key/value tables, indexed tables and time series. You can implement your own data patterns as custom datasets, on top of any combination of core and system datasets.

Impersonation

Impersonation allows users to run programs and access datasets, streams, and other resources as pre-configured users (a principal). CDAP supports configuring impersonation at a namespace and at an application level, with application level configuration having a higher precedence than namespace level.

If impersonation is enabled, and you don't specify a principal for an application or dataset, then the namespace owner's principal is used. If there is no namespace owner or you are using the default namespace, then the default principal is used (as set by the properties cdap.master.kerberos.keytab and cdap.master.kerberos.principal in the cdap-site.xml).

Created in 2020 by Google Inc.