CDAP Release 6.1.3

Important: CDAP 6.1.3 is deprecated.

Summary

This release introduces performance improvements as well as few minor bug fixes. Some of the highlights are:

Performance improvements

  • Improve the performance of joiner plugins to better handle data skewness with capped memory.

  • Improve the performance of aggregator by using new reducer APIs.

  • Improve the program startup performance by using async operations.

  • Improve the preview performance by limiting records in partitions.

Pipeline Upgradability

  • Support upgrading all pipelines in a namespace via REST API to use latest available artifacts.

  • Upgrade to use Google Cloud Dataproc API v1beta2 to allow endpoint configuration.

Improved Error Messages, Preview Enhancement and Custom Google Cloud Dataproc Image Support

  • Improve error messages for program execution.

  • Minor bug fixes and enhancements for preview and pipeline execution.

  • Allow user to select a Custom Google Cloud Dataproc Image by specifying image URI.

Performance Improvements

  • CDAP-16709 - Implemented performance improvements to joiner and aggregator plugins to cap the required memory to around 4gb per executor instead of scaling up as the skewness of the join goes up. Joins can now also be performed in-memory if one side is small, and behavior on null keys can be chosen by the user.

  • CDAP-16656 - Added support for rendering large schemas (>1000 fields) in pipelines UI. By default collapse complex schemas and lazy-load fields in record types.

  • CDAP-16521 - Improved program startup performance by using a thread pool to start a program start program instead of starting from a single thread.

  • CDAP-16673 - Added payload compression support to messaging service.

  • CDAP-16724 - Fixed a bug in wrangler that would cause it to go out of memory when sampling a Google Cloud Storage (GCS) object that has a lot of rows.

  • CDAP-16725 - Fixed a bug where concurrent preview runs were failing because SparkConf for the new preview runs was getting populated with the configurations from the previously started in-progress preview run.

  • CDAP-17000 - Changed default value of spark.network.timeout to 10 minutes to make pipeline execution more stable for shuffle heavy pipelines.

Pipeline Upgradability

  • CDAP-16835 - Added support for upgrading applications via REST API. Example usage is to upgrade all pipelines in a namespace to use the latest available artifacts.

  • CDAP-16918 - Introduced a new REST API for getting all application details across all namespaces.

  • CDAP-16328 - Labeling Google Cloud Dataproc clusters configured as Remote Hadoop Provisioners.

  • CDAP-16606 - Limit to reading in 100 records across all input partitions in preview.

  • CDAP-16621 - Removed modal showing pipeline JSON when users export pipelines. Instead, the pipeline gets downloaded when users click "export pipeline" without the extra confirmation step.

  • CDAP-16815 - Added a metric records.updated in Google BigQuery sink. This will give a total of all the inserts, updates and upsert into the sink.

  • CDAP-16929 - Added the ability to select a Custom Google Cloud Dataproc Image. The complete URI for the custom image should be specified.

User Interface Fixes

  • CDAP-16493 - Fixed a bug that caused long field names to overflow in the Joiner plugin.

  • CDAP-16645 - Fixed a bug that resets preferences of an app/pipeline every 10 seconds.

  • CDAP-16663 - Fixed a bug where UI incorrectly showed "No schema available" when the previous stages' output schema is a macro.

  • CDAP-16751 - Fixed a bug where UI overwrites scale property of a decimal schema field if 0.

  • CDAP-16788 - Remove reference to detailed view of an application. UI now only shows overview of custom applications and pipeline detailed view for pipelines when navigating from control center.

  • CDAP-16801 - Fixed an issue where runtime arguments would lose focus after typing certain properties.

  • CDAP-16845 - Fixed a bug that started running preview for pipelines with post-run actions even if the user choose option to not run preview.

  • CDAP-16891 - Unsupported pipelines in drafts would be upgraded when users open them.

  • CDAP-16940 - UI now waits for 5 mins for inactivity in the browser before stopping all the polling logic.This prevents stopping polling for resources that might take more than 30 seconds to respond (current timeout is 30 seconds).

  • CDAP-16959 - Fixed an issue with runtime arguments re-rendering and losing focus when containing macros in preview.

  • CDAP-16972 - Fixed an issue where preview config would open when trying to stop a preview.

  • CDAP-16975 - UI now adds the latest version of plugin, among the list of different versions of the plugin, when added from the side panel in pipeline studio.

  • CDAP-16976 - UI resets the default version of plugins for specific users during upgrade to enable users choose the latest version for Pipeline Studio.

  • CDAP-16993 - Fixed a bug in preview for fields that have non-string types such as bytes.

  • CDAP-16315 - Disable showing systems logs by default when viewing logs for a pipeline.

  • CDAP-16240 - Fixed a bug to show master and worker memory in Google Cloud Dataproc compute profiles in GB.

Plugin Fixes

  • CDAP-16055 - Fixed a bug that the failure error message emitted by Spark driver is not being collected.

  • CDAP-16222 - Fixed the package references in the dynamic Spark plugin to use io.cdap instead of co.cask.

  • CDAP-16453 - Fixed a bug with LimitingInputFormat that made DBSource plugin fail in preview mode.

  • CDAP-16465 - Fixed a bug where Wrangler database connections could show more tables than those in the configured database.

  • CDAP-16870 - Fixed PySpark support to work with Spark 2.1.3+.

  • CDAP-16760 - Fixed a bug where memory, CPU, and engine config properties were not being set for sparkprogram plugins.

  • CDAP-15775 - Fixed a bug that disallowed writing to an empty Google BigQuery table without any data or schema.

  • CDAP-16633 - Added option to generate scoped GoogleCredentials with BQ and Drive scope for all BQ requests.

  • CDAP-16655 - Fixed a bug in File Source to allow it to read files on Google Cloud Storage.

  • CDAP-16664 - Fixed a bug that resulted in failure to update/upsert to BQ in a different project.

  • CDAP-16809 - Added support for compressed file with header copying for text file based source.

  • CDAP-16879 - 'Truncate table' and 'update schema' options if set together, will apply only WRITE_TRUNCATE to BQ job.

  • CDAP-16880 - Removed schema validation from BQ sink when 'truncate table' option is set.

Metadata Fixes

  • CDAP-16367 - Fixed a bug where field lineage is incorrect when a source is directly connected to a sink.

API Fixes

  • CDAP-16211 - Unified JSON structure used by REST endpoints for getting pipeline configuration and deploying pipelines.

  • CDAP-16614 - Fixed the fetch run records API to honor the limit query parameter correctly.

Error Message Fixes

  • CDAP-16507 - Fixed the error message that the delimited format generates when the number of fields in the data does not match the number of fields in the schema.

  • CDAP-16950 - Includes all ERROR level logs logged under the application logging context.

Platform Fixes

  • CDAP-16593 - Added restrictions on the maximum number of network tags for Google Cloud Dataproc VM to 64.

  • CDAP-16736 - Fixed record schema comparison to include record name.

  • CDAP-16816 - Fixed schedule properties to overwrite preferences set on the application.

Created in 2020 by Google Inc.