Data Exploration (Deprecated)

Warning: This topic is no longer supported.

This section covers how you can explore data in CDAP through the use of ad-hoc SQL-like queries. Queries can be run over certain types of datasets. We refer to this as CDAP Explore, or Explore for short.

Enabling exploration for a dataset results in the creation of a SQL table in the Explore system. The name of this table is, by default, the same as the name of the dataset, prefixed with dataset_. For example, after creating a Table named results, it can be explored with the SQL query:

SELECT * FROM dataset_results LIMIT 5

Note that the table is only explorable if it has a schema.

The name of the Explore table can be configured by setting the dataset property explore.table.name when creating the dataset. It is recommended to use a dataset properties builder:

// Create the "results" partitioned file set, configure it to work with MapReduce and with Explore createDataset("results", PartitionedFileSet.class, PartitionedFileSetProperties.builder() ... .setEnableExploreOnCreate(true) .setExploreTableName("results") .setExploreFormat("csv") .setExploreSchema("`date` STRING, winner STRING, loser STRING, winnerpoints INT, loserpoints INT") .build());

This dataset can be queried with the configured table name; that is, without the dataset_ prefix:

SELECT * FROM results LIMIT 5

Similarly, you can configure the Explore database name by setting the dataset property explore.table.name (or calling the setExploreDatabaseName() method of the dataset properties builder).

Note that if you are running a secure cluster, additional configuration for a secure cluster is required.

Exploration of data in CDAP is governed by a combination of enabling the CDAP Explore Service and creating datasets that are explorable. The CDAP Explore Service is enabled by a setting in the CDAP configuration file (cdap-site.xml file).

Datasets (that were created before the Explore Service was enabled) can be enabled for exploration by using the Query Microservices.

You can use the same Query Microservices to disable exploration of a specific dataset. The dataset will still be accessible programmatically; it just won't respond to queries through the Microservices or be available for exploration using the CDAP UI.

  • Fileset Exploration: Describes how you can make a FileSetPartitionedFileSet, or TimePartitionedFileSet that is explorable.

For more information on data exploration, see Third-Party Integration.

 

Created in 2020 by Google Inc.