CDAP Release 6.2.0

Important: CDAP 6.2.0 is deprecated.

Summary

This release introduces a number of new features, improvements, and bug fixes for CDAP. Some of the main highlights of the release are:

Replication

  • A CDAP application that you can use to easily replicate data at low-latency and in real-time from transactional and operational databases into analytical data warehouses.

Google Cloud Dataproc Runtime Improvement

  • The Google Cloud Dataproc runtime now uses native Dataproc API's for job submission instead of SSH.

Pipeline Studio Improvements

  • Added the ability to perform bulk operations (copy, delete) in the Pipeline Studio. Also added a right-click context menu for the Pipeline Studio.

New Features

  • CDAP-16385 - Added JDBC plugin selector widget.

  • CDAP-16339 - Introduced a new REST endpoint for fetching scheduled time for multiple programs.

  • CDAP-16243 - Added new capability to start system applications using application specific config during startup.

  • CDAP-16223 - Added Replication feature.

  • CDAP-16210 - Added support for connecting to multiple hubs through market.base.urls property in cdap-site.xml.

  • CDAP-16130 - Added the ability to right-click on the Pipeline Studio canvas to add a Wrangler source. This allows you to add multiple Wrangler sources (source + Wrangler transform) in the same pipeline without losing context.

  • CDAP-16107 - Added support for Spark 2.4.

  • CDAP-15941 - Added date picker widget to allow users to specify a single date or date range in a plugin.

  • CDAP-15633 - Added support to launch a job using Google Cloud Dataproc APIs.

  • CDAP-9014 - Added the ability to select multiple plugins and connections from Pipeline Studio copy or delete them in bulk.

Improvements

  • CDAP-16633 - Added option to generate scoped GoogleCredentials with Google BigQuery and Google Drive scope for all Google BigQuery requests.

  • CDAP-16572 - Added macro support for Format field in Google Cloud Storage plugin.

  • CDAP-16525 - Added an option for Database source to replace characters in the field names.

  • CDAP-16809 - Added support for copying header on compressed file.

  • CDAP-16656 - Added support for rendering large schemas (>1000 fields) in Pipeline UI by collapsing complex schemas and lazy-load fields in record types.

  • CDAP-16616 - Make the View Raw Logs and Download Logs buttons to be enabled all the time in the log viewer page.

  • CDAP-16593 - Added restrictions on the maximum number of network tags for Dataproc VM to be 64.

  • CDAP-16586 - Changed behavior for selecting multiple nodes in Studio to require the user to hold the key [shift] and click on the plugins (instead of holding [ctrl] and then click).

  • CDAP-16521 - Improved program startup performance by using a thread pool to start a program instead of starting from a single thread.

  • CDAP-16517 - Added an option to skip header in the files in delimited, csv, tsv, and text formats.

  • CDAP-16509 - Reduced memory footprint for StructureRecord which improves overall memory consumption for pipeline execution.

  • CDAP-16351 - Added an API that returns the names of input stages.

  • CDAP-16330 - Replaced config.getProperties with config.getRawProperties to make sure validation happens on raw value before macros are evaluated.

  • CDAP-16324 - Added macro support for Analytics plugins.

  • CDAP-16308 - Reduced preview startup by 60%. Also added limit to maximum concurrent preview runs (10 by default).

  • CDAP-16249 - Added ability to show dropped field operation from field level lineage page.

  • CDAP-16248 - For field level lineage, added ability for user to view all fields in a cause or impact dataset (not just the related fields).

  • CDAP-16211 - Unified JSON structure used by REST endpoints for fetching pipeline configuration and deploying pipelines.

  • CDAP-15894 - Added ability for user to navigate to non-target dataset by selecting the header of the dataset in field level lineage.

  • CDAP-15579 - Added the ability for SparkCompute and SparkSink to record field level lineage.

  • CDAP-15061 - Added a page level error when the user navigates to an invalid pipeline via the URL.

  • CDAP-13643 - Added support for recording field level lineage in streaming pipelines.

Bug Fixes

  • CDAP-16816 - Fixed schedule properties to overwrite preferences set on the application instead of the other way around. This most visibly fixed a bug where the compute profile set on a pipeline schedule or trigger would get overwritten by the profile for the pipeline.

  • CDAP-16751 - Fixed a bug where UI overwrites scale and precision properties in a schema with decimal logical type if the value is 0.

  • CDAP-16736 - Fixed record schema comparison to include record name.

  • CDAP-16725 - Fixed a bug where concurrent preview runs were failing because SparkConf for the new preview runs was getting populated with the configurations from the previously started in-progress preview run.

  • CDAP-16724 - Fixed a bug in Wrangler that would cause it to go out of memory when sampling a Google Cloud Storage object that has a lot of rows.

  • CDAP-16664 - Fixed a bug that resulted in failure to update/upsert to Google BigQuery in a different project.

  • CDAP-16663 - Fixed a bug where UI incorrectly showed "No schema available" when the output of the previous stage is a macro.

  • CDAP-16655 - Fixed a bug in File source that prevented reading files from Google Cloud Storage.

  • CDAP-16614 - Fixed the fetch run records API to honor the limit query parameter correctly.

  • CDAP-16581 - Fixed a bug that prevented a user from using parse-as-json directive in Wrangler.

  • CDAP-16538 - Fixed a bug in the PluginProperties class where internal map was modifiable.

  • CDAP-16526 - Fixed Google BigQuery sink to properly allow certain types as clustering fields.

  • CDAP-16501 - Fixed a bug to correctly update pipeline stage metrics in UI.

  • CDAP-16471 - Fixed a bug that would leave zombie processes when using the Remote Hadoop Provisioner.

  • CDAP-16465 - Fixed a bug where Wrangler database connections could show more tables than those in the configured database.

  • CDAP-16453 - Fixed a bug with LimitingInputFormat that made Database source plugin fail in preview mode.

  • CDAP-16425 - Fixed macro support for output schema in Google BigQuery source plugin.

  • CDAP-16309 - Fixed a race condition bug that can cause failure when running Spark program.

  • CDAP-16240 - Fixed a bug to show master and worker memory in Google Cloud Dataproc compute profiles in GB.

  • CDAP-16055 - Fixed a bug where the failure message emitted by Spark driver was not being collected.

  • CDAP-16002 - Fixed a bug that caused errors when Wrangler's parse-as-csv with header was used when reading multiple small files.

  • CDAP-15775 - Fixed a bug that disallowed writing to an empty Google BigQuery table without any data or schema.

  • CDAP-15649 - Fixed a bug that would cause the Google BigQuery sink to fail the pipeline run if there was no data to write.

  • CDAP-14850 - Fixed a bug in the custom date range picker that prevented users from setting a custom date range that is not in the current year.

  • CDAP-14190 - Fixed a bug where users cannot delete the entire column name in Wrangler.

Created in 2020 by Google Inc.