Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

This release provides performance and scalability improvements that increase developer productivity and optimize pipeline runtime performance. The release includes scaled-up previews that support up to 50 concurrent runs, capabilities to handle large and complex schemas in Pipeline Studio, an enhanced log viewer, and other critical improvements and fixes. Some of the highlights are:

Features

  1. Added support to create autoscaling Dataproc clusters.

    • Added schema support feature in the UI to edit precision and scale.

    • Improved memory performance in pipelines by utilizing disk only auto-caching strategy.

Performance and Scalability Improvements

  1. Supported 50 users running previews at the same time.

    • Supported large and deeply nested schemas (>5K fields with 20+ levels of nesting).

    • Added ability to optimize the performance of some pipelines with a new, experimental setting spark.cdap.pipeline.consolidate.stages.

...

  • CDAP-16980 - New Log Viewer feature which enables users to see the most recent logs.

  • CDAP-16836 - Added new options in CDAP CLI to take URI instead of host and port combination.

  • CDAP-16690 - Added revamped Preview tab with new Record view for large schemas.

Performance and Scalability Improvements

  • PLUGIN-282 - Added new Data Cacher plugin to allow users to manually cache data at certain points in a pipeline.

  • PLUGIN-174 - Enabled macro for Hostname, port and database name in database-specific plugins.

  • CDAP-17179 - Added new properties Filesystem properties and Output File Prefix for Google Cloud Storage Sink.

  • CDAP-17130 - Added Joiner distribution support to MapReduce and streaming pipelines.

  • CDAP-17123 - Make records.updated metric available for Google Cloud Storage Batch Sink plugin.

  • CDAP-17095 - Added Distribution to AutoJoiner API to increase performance for skewed joins.

  • CDAP-17078 - Added an experimental setting to consolidate multiple pipeline branches into single operations in Spark pipelines. This can improve performance in pipelines by avoiding recomputation. This can be turned on by setting a preference or runtime argument for spark.cdap.pipeline.consolidate.stages to true.

  • CDAP-17077 - Changed the auto-caching strategy in Spark pipelines to default to using disk only caching instead of memory due to common out of memory failures. Also changed the caching strategy to only cache at places that would prevent sources from being recomputed instead of the more aggressive caching previously done.

  • CDAP-16712 - Improved the scalability of the preview system when running in Kubernetes environment by separating out preview runs in their own individual pods. Preview manager pod now only responsible for handling preview REST API.

  • CDAP-16697 - Created Best Practices guide for Spark engine tuning.

  • CDAP-16682 - When the backend is slow to respond to requests from UI, we now show a snackbar saying there's a delay.

  • CDAP-16668 - Added support for creating autoscale Dataproc cluster.

  • CDAP-16850 - Introduced new schema editor for plugins in pipelines. The schema editor in addition to supporting large schemas (>5k fields) supports the ability to edit attributes for decimal types (precision & scale).

  • CDAP-17015 - Updated Preview to show number of preview runnings pending before current run (if there are any runs pending). The number of pending runs is shown under the timer in the UI.

...