Using Datasets in Programs (Deprecated)

There are two ways to use a dataset in a program:

  • Static instantiation

  • Dynamic instantiation

Static Instantiation

You can instruct the CDAP runtime system to inject the dataset into a class member with the @UseDataSet annotation:

class MyFlowlet extends AbstractFlowlet { @UseDataSet("myCounters") private KeyValueTable counters; ... void process(String key) { counters.increment(key.getBytes(), 1L); }

When starting the program, the runtime system reads the dataset specification from the metadata store and injects an instance of the dataset class into the application. This dataset will participate in every transaction that is executed by the program. If the program is multi-threaded (for example, an HTTP service handler), CDAP will make sure that every thread has its own instance of the dataset.

Dynamic Instantiation

If you don't know the name of the dataset at compile time (and hence you cannot use static instantiation), or if you want to use a dataset only for a short time, you can dynamically request an instance of the dataset through the program context:

class MyFlowlet extends AbstractFlowlet { ... void process(String key) { KeyValueTable counters = getContext().getDataset("myCounters"); counters.increment(key.getBytes(), 1L); }

This dataset is instantiated at runtime, in this case every time the method process is invoked. To reduce the overhead of repeatedly instantiating the same dataset, the CDAP runtime system caches dynamic datasets internally. By default, the cached instance of a dataset will not expire, and it will participate in all transactions initiated by the program.

For convenience, if you know the dataset name at the time the program starts, you can store a reference to the dataset in a member variable at that time (similar to static datasets, but assigned explicitly by you):

class MyFlowlet extends AbstractFlowlet { private KeyValueTable counters; @Override public void initialize(FlowletContext context) throws Exception { super.initialize(context); counters = context.getDataset("myCounters"); } void process(String key) { counters.increment(key.getBytes(), 1L); }

Contrary to static datasets, dynamic datasets allow the release of the resources held by their Java classes after you are finished using them. You can do that by calling the discardDataset() method of the program context: it marks the dataset to be closed and removed from all transactions. If a transaction is currently in progress, however, this will not happen until after the current transaction is complete, because the discarded dataset may have performed data operations and therefore still needs to participate in the commit (or rollback) of the transaction.

Discarding a dataset is useful:

  • To ensure that the dataset is closed and its resources are released, as soon as a program does not need the dataset any longer.

  • To refresh a dataset after its properties have been updated. Without discarding the dataset, the program would keep the dataset in its cache and never pick up a change in the dataset's properties. Discarding the dataset ensures that it is removed from the cache after the current transaction. Therefore, the next time this dataset is obtained using getDataset() in a subsequent transaction, it is guaranteed to return a fresh instance of the dataset, hence picking up any properties that have changed since the program started.

It is important to know that after discarding a dataset, it remains in the cache for the duration of the current transaction. Be aware that if you call getDataset() again for the same dataset and arguments before the transaction is complete, then that reverses the effect of discarding. It is therefore a good practice to discard a dataset at the end of a transaction.

Discarding a dataset has the effect of releasing the dataset as soon as possible. If there is no transaction, that is immediately. If there is a current transaction, that is as soon as the transaction finishes.

Similarly to static datasets, if a program is multi-threaded, CDAP will make sure that every thread has its own instance of each dynamic dataset—and in order to discard a dataset from the cache, every thread that uses it must individually call discardDataset().

Multi-threading and Dataset Access

As mentioned above, under static and dynamic instantiation, if a program is multi-threaded, CDAP will make sure that every thread has its own instance of a dataset. This is because datasets are not thread-safe, cannot be shared across threads, and each thread must operate on its own instance of a Dataset.

As a consequence, multiple threads accessing the same dataset will have different instances of the same dataset.

As transactions are not thread-safe, the dataset context of a transaction as well as datasets obtained through it may not be shared across threads.

Cross-namespace Dataset Access

The dataset usage methods described above allow accessing datasets from the same namespace in which the program exists. However, dynamic dataset instantiation also allows users to access datasets from a different namespace than the one in which the program accessing the dataset is running. Typically, this may be required in scenarios where datasets are large enough to warrant sharing across namespaces, as opposed to every namespace having its own copy. To use a dataset from a different namespace, users can pass a namespace parameter to getDataset():

Using this API, users can both read and write to a dataset in a different namespace.

Cross namespace access is not supported using static dataset instantiation, since doing so would require users to know the namespace at the time of development of the application.

Note: On clusters with authorization enabled, refer to authorization policy pushdown for additional instructions.

 

Created in 2020 by Google Inc.