Data Cacher Plugin
Data Cacher is designed for pipelines that use Spark as the execution engine. If the pipeline uses MapReduce as the execution engine, Data Cacher has no effect.
Plugin version: 2.11.0
The Data Cacher plugin was introduced in 6.1.4 and 6.2.1. It caches any record that is passed through it. This is useful when you have auto-caching disabled at the Engine config level and you want to cache records at certain points in a pipeline. Spark caching prevents unnecessary recomputation of previous stages, which is particularly helpful when you have pipelines with branches.
You can add the Data Cacher plugin before any branch in a pipeline. This helps you control which parts of the data flow get cached. For example, you might want to cache data at the beginning of a pipeline before you branch it. This ensure that only data before the branch gets cached and the data downstream from the Data Cacher is not cached:
Although you can use the Data Cacher if auto-caching is enabled to ensure a certain branch is cached, it might result in decreased performance.
It is highly recommended that you disable Spark auto-caching at the Engine Config level before using the Data Cacher plugin in pipelines.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Storage Level | Yes | Required. This determines the method in which the data will be cached. Note: The default value, The allowed values are:
Note: Caching data in memory only can cause Default is |
Output Schema | No | Required. The output schema for the data. |
Â
Created in 2020 by Google Inc.