CDAP Release 6.5.0

Release Date: August 31, 2021

New Features

Connections

CDAP-17870: Added global connections for sources in Wrangler and data pipelines. For more information, see Managing Connections. Also added new endpoints for connections to the Pipeline Microservices.

CDAP-17924: Redesigned the Namespace Admin page.

Dataproc

CDAP-17999: Added support for labels in the Dataproc provisioner.

CDAP-17862: Added Shielded VMs as configuration settings for the Dataproc provisioner. For more information, see Google Dataproc.

CDAP-18004: Added support for running worker pods using different Kubernetes service accounts.

CDAP-18126: Added support to reuse Dataproc clusters for data pipelines. Use to speed up pipeline run startup by reusing clusters from previous runs. This skips the Provisioning step during pipeline startup.

Namespaces

CDAP-17731: Added support to show current namespace name in the footer.

CDAP-17877, CDAP-17876: Added Connections and Drivers to Namespace Admin page for centralized management of all connections and Drivers. For more information, see JDBC Drivers and Managing Connections.

Spark 3

CDAP-17693: Added Spark 3 support for Standalone CDAP, CDAP Sandbox, and Previewing data.

CDAP-17930: Added Dataproc version to 2.0 as the default for new and upgraded pipelines. For more information, see “Upgrade Notes for Spark 3” below.

Transformation Pushdown

CDAP-17863: Added support for Transformation pushdown into BigQuery for Joiner transformations. For more information, see Using Transformation pushdown.

Improvements

CDAP-17730: Added authorization checks for preferences, logging, compute profiles, and metadata endpoints.

CDAP-17915: Added support to search for tables based on schema name when you select tables for a Replication job.

CDAP-17946: Improved error messages on the Pipeline List page.

CDAP-17973: Improved Wrangler error messages

CDAP-18024: Added support for running CDAP as a non-root user.

CDAP-18039: Added additional trace logging in the authorization flow for debugging.

CDAP-18146: Pods created by CDAP now inherit their ImagePullPolicy from the pod which created them.

CDAP-18194: Added support for BIGNUMERIC data type for BigQuery target in replication.

PLUGIN-764: Added support for Datetime data type for SQL Server batch source plugins.

PLUGIN-645: Added support for Datetime data type for Replication jobs.

Behavior Changes

CDAP-18114: MySQL, Oracle, PostgreSQL, and SQL Server batch sources, sinks, actions, and pipeline alerts are now installed by default as system plugins. Previously, these plugins were available in the Hub as user plugins.

CDAP-17898: When you use a connection in Wrangler and create a data pipeline, CDAP now creates a pipeline with the source plugin and then Wrangler transformation. In previous releases, CDAP created the pipeline with just the Wrangler transformation. You had to manually add the source plugin to the pipeline and configure it.

Bug Fixes

CDAP-17895: Fixed an issue in Replication that caused jobs to fail if more than 1000 tables are selected for replication.

CDAP-17919: Fixed an issue that caused replication jobs to hang when there were too many Delete or DDL events.

CDAP-17939: Improved the Messaging Service cleanup strategy so that it uses far fewer resources and cannot go out of memory.

CDAP-17942: Fixed an issue that caused plugin validation to fail when a macro is used within a macro function. For example: ${logicalStartTime(${date_format})}

CDAP-17943: Fixed an issue that caused pipelines with aggregations and Decimal fields to fail with an exception.

CDAP-17959: Fixed an issue that caused Wrangler to ignore all the other columns other than the given column when parsing Excel files.

CDAP-17965: For CDAP instances running on Kubernetes, fixed an issue that prevented new previews from being scheduled after the preview manager had been stopped 10 times.

CDAP-17995: Fixed Wrangler to fail pipelines upon error. In Wrangler 6.2 and above, there was a backwards incompatible change where pipelines did not fail if there was an error and instead were marked as completed.

CDAP-18002: Fixed an issue in Replication that caused jobs to fail when restarted during snapshotting.

CDAP-18012, CDAP-18003, CDAP-17853: Improved resilience of TMS.

CDAP-18060: Fixed an issue in CDAP Sandbox that caused Get Schema to fail when the source includes the Format field.

CDAP-18131: Fixed an issue where replication to BigQuery was failing because the source table had column names which are reserved keywords in the BigQuery.

PLUGIN-178: Fixed an issue while writing non-null values to a nullable field in BigQuery.

PLUGIN-635: Fixed an issue in the BigQuery plugins to correctly delete temporary GCS storage buckets.

PLUGIN-654: Fixed an issue that caused pipelines to fail when Pub/Sub source Subscription field was a macro.

PLUGIN-655: Fixed an issue in the BigQuery sink that caused failures when the input schema was not provided.

PLUGIN-669: Fixed Join Condition Type to be displayed in the Joiner for pipelines upgraded from versions before 6.4.0

PLUGIN-678: Fixed an issue in the BigQuery sink that caused pipelines to fail or give incorrect results.

PLUGIN-697: Fixed an issue that caused File Source Plugin validation to fail when there was a macro in the Format field.

Upgrade Notes for Spark 3

In CDAP 6.5.0, Spark 3 is the new default engine that will be used for Preview and running pipelines on Dataproc. Also Spark 1 support was removed from CDAP.

After an instance is upgraded to version 6.5.0, any new or upgraded pipeline that uses a Dataproc profile without an explicit image version will use the latest Dataproc image 2.0 that has Spark 3.1 bundled.

Any pipeline that was not upgraded will still use the original 1.3 Dataproc image that has Spark 2.3 bundled.

What does it mean for pipeline developers / operations?

Spark 3.1 provides a lot of improvements in different areas. See the release notes for Spark 3.0 and Spark 3.1. The main changes that affect backwards compatibility are:

Python 2 support is removed, any PySpark code must be Python 3 compatible.
Spark 3.1 uses Scala 2.12 that is binary incompatible with Scala 2.11. Most of the code is source compatible, so recompile Scala code with Scala 2.12 if you have any issues.

What does it mean for plugin developers?

If you use any Scala code, make sure it’s binary compatible with the corresponding Scala version: 2.12 for Spark 3 and 2.11 for Spark 2 execution environment.

This can be easily achieved by referencing the proper spark2_2.11 or spark3_2.12 version of the CDAP artifact, e.g. see https://github.com/cdapio/hydrator-plugins/pull/1364 . Note that you must explicitly choose the version because artifacts without the version that were previously using Spark 1 are no longer available.

If you have any dependencies on Scala-specific artifacts (e.g. Kafka), change those as well.

The new Hadoop version used in dependencies is 2.6.0 instead of 2.3.0.

What to do in case of any problems?

Spark 2 is still fully supported by CDAP. If you use Dataproc, enter image version “1.3” in your provisioning profile and it will use exactly the same image CDAP 6.4 uses.

It’s highly recommended that you solve any problems found and migrate to the Spark 3 execution environment as it brings a number of enhancements including huge performance improvements.

Known Issues

Database connections

Although you can create connections for Database, MySQL, Oracle, PostgreSQL, and SQL Server sources, the plugin properties do not include Use Connection. This means that you cannot reference a connection in a database source plugin. However, from the Properties page in a database source plugin, you can select a connection to have CDAP populate the plugin properties with the connection properties.

To use the properties set in these connections in the corresponding batch source plugin, follow these steps:

In Pipeline Studio, add the source plugin to the canvas.
Click Properties.
Click Browse Database.
The Browse Database page appears with the available connections listed in the left panel.
Click the connection you want to use.

Locate the table you want to add to the source plugin and click it.
The source properties now include all of the properties from the connection.