Data Cacher is designed for pipelines that use Spark as the execution engine. If the pipeline uses MapReduce as the execution engine, Data Cacher has no effect.
Plugin version: 2.10.1
The Data Cacher plugin was introduced in 6.1.4 and 6.2.1. It caches any record that is passed through it. This is useful when you have auto-caching disabled at the Engine config level and you want to cache records at certain points in a pipeline. Spark caching prevents unnecessary recomputation of previous stages, which is particularly helpful when you have pipelines with branches.
You can add the Data Cacher plugin before any branch in a pipeline. This helps you control which parts of the data flow get cached. For example, you might want to cache data at the beginning of a pipeline before you branch it. This ensure that only data before the branch gets cached and the data downstream from the Data Cacher is not cached:
Although you can use the Data Cacher if auto-caching is enabled to ensure a certain branch is cached, it might result in decreased performance.
It is highly recommended that you disable Spark auto-caching at the Engine Config level before using the Data Cacher plugin in pipelines.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Storage Level | Yes | Required. This determines the method in which the data will be cached. Note: The default value, The allowed values are:
Note: Caching data in memory only can cause Default is |
Output Schema | No | Required. The output schema for the data. |