...
This release provides performance and scalability improvements that increase developer productivity and optimize pipeline runtime performance. The release includes scaled-up previews that support up to 50 concurrent runs, capabilities to handle large and complex schemas in Pipeline Studio, an enhanced log viewer, and other critical improvements and fixes. Some of the highlights are:
Features
Added support to create autoscaling Dataproc clusters.
Added schema support feature in the UI to edit precision and scale.
Improved memory performance in pipelines by utilizing disk only auto-caching strategy.
Performance and Scalability Improvements
Supported 50 users running previews at the same time.
Supported large and deeply nested schemas (>5K fields with 20+ levels of nesting).
Added ability to optimize the performance of some pipelines with a new, experimental setting
spark.cdap.pipeline.consolidate.stages
.
...
CDAP-16980 - New Log Viewer feature which enables users to see the most recent logs.
CDAP-16836 - Added new options in CDAP CLI to take URI instead of host and port combination.
CDAP-16690 - Added revamped Preview tab with new Record view for large schemas.
Performance and Scalability Improvements
PLUGIN-282 - Added new Data Cacher plugin to allow users to manually cache data at certain points in a pipeline.
PLUGIN-174 - Enabled macro for Hostname, port and database name in database-specific plugins.
CDAP-17179 - Added new properties
Filesystem properties
andOutput File Prefix
for Google Cloud Storage Sink.CDAP-17130 - Added Joiner distribution support to MapReduce and streaming pipelines.
CDAP-17123 - Make
records.updated
metric available for Google Cloud Storage Batch Sink plugin.CDAP-17095 - Added Distribution to AutoJoiner API to increase performance for skewed joins.
CDAP-17078 - Added an experimental setting to consolidate multiple pipeline branches into single operations in Spark pipelines. This can improve performance in pipelines by avoiding recomputation. This can be turned on by setting a preference or runtime argument for
spark.cdap.pipeline.consolidate.stages
totrue
.CDAP-17077 - Changed the auto-caching strategy in Spark pipelines to default to using disk only caching instead of memory due to common out of memory failures. Also changed the caching strategy to only cache at places that would prevent sources from being recomputed instead of the more aggressive caching previously done.
CDAP-16712 - Improved the scalability of the preview system when running in Kubernetes environment by separating out preview runs in their own individual pods. Preview manager pod now only responsible for handling preview REST API.
CDAP-16697 - Created Best Practices guide for Spark engine tuning.
CDAP-16682 - When the backend is slow to respond to requests from UI, we now show a snackbar saying there's a delay.
CDAP-16668 - Added support for creating autoscale Dataproc cluster.
CDAP-16850 - Introduced new schema editor for plugins in pipelines. The schema editor in addition to supporting large schemas (>5k fields) supports the ability to edit attributes for decimal types (precision & scale).
CDAP-17015 - Updated Preview to show number of preview runnings pending before current run (if there are any runs pending). The number of pending runs is shown under the timer in the UI.
...