Data Cacher Plugin

Data Cacher is designed for pipelines that use Spark as the execution engine. If the pipeline uses MapReduce as the execution engine, Data Cacher has no effect.

Plugin version: 2.10.1

The Data Cacher plugin was introduced in 6.1.4 and 6.2.1. It caches any record that is passed through it. This is useful when you have auto-caching disabled at the Engine config level and you want to cache records at certain points in a pipeline. Spark caching prevents unnecessary recomputation of previous stages, which is particularly helpful when you have pipelines with branches.

You can add the Data Cacher plugin before any branch in a pipeline. This helps you control which parts of the data flow get cached. For example, you might want to cache data at the beginning of a pipeline before you branch it. This ensure that only data before the branch gets cached and the data downstream from the Data Cacher is not cached:

Although you can use the Data Cacher if auto-caching is enabled to ensure a certain branch is cached, it might result in decreased performance.

It is highly recommended that you disable Spark auto-caching at the Engine Config level before using the Data Cacher plugin in pipelines.

Configuration

Property

Macro Enabled?

Description

Storage Level

Yes

Required. This determines the method in which the data will be cached.

Note: The default value, MEMORY_AND_DISK, is recommended for the best reliability and consistency.

The allowed values are:

DISK_ONLY: Save data to disk.
DISK_ONLY_2: Save data to disk, replicated 2 times.
MEMORY_ONLY*: Save data to memory in raw java objects.
MEMORY_ONLY_2*: Save data to memory in raw java objects, replicated 2 times.
MEMORY_ONLY_SER*: Save data to memory in serialized objects.
MEMORY_ONLY_SER_2*: Save data to memory in serialized objects, replicated 2 times.
MEMORY_AND_DISK: Save data to memory in raw java objects, if memory is not large enough, then spill to disk.
MEMORY_AND_DISK_2: Save data to memory in raw java objects, if memory is not large enough, then spill to disk, replicated 2 times.
MEMORY_AND_DISK_SER: Save data to memory in serialized objects. If memory is not large enough, then spill to disk.
MEMORY_AND_DISK_SER_2: Save data to memory in serialized objects, if memory is not large enough then spill to disk, replicated 2 times.

Note: Caching data in memory only can cause OutOfMemory errors if the data is large.

Default is MEMORY_AND_DISK.

Output Schema

No

Required. The output schema for the data.