CDAP Release 6.4.0

Release Date: March 24, 2021

New Features

Datetime Data Type

PLUGIN-615, PLUGIN-614: Added Datetime data type support to the following plugins:

  • BigQuery batch source

  • BigQuery sink

  • BigQuery Multi Table sink

  • Bigtable batch source

  • Bigtable sink

  • Datastore batch source

  • Datastore sink

  • GCS File batch source

  • GCS File sink

  • GCS Multi File sink

  • Spanner batch source

  • Spanner sink

  • File source

  • File sink

  • Wrangler

  • Amazon S3 batch source

  • Amazon S3 sink

  • Database source

Also, BigQuery Datetime type can be directly mapped to CDAP datetime data type.

CDAP-17684, CDAP-17636: Added support for Datetime data type in Wrangler. You can now select Parse > Datetime to transform columns of strings to datetime values and Format > Datetime to change the date and time pattern of a column of datetime values.

Added new Wrangler directives that you can use in Power Mode to transform columns of strings to datetime values: Parse as Datetime, Current Datetime, Datetime to Timestamp, Format Datetime, Timestamp to Datetime 

CDAP-17620: Added support Datetime logical data type in CDAP schema

Dataproc

CDAP-17622: Added machine type, cluster properties, and idle TTL as configuration settings for the dataproc provisioner. For more information, see Google Dataproc.

Security

CDAP-17709: Added support for PROXY authentication mode to nodejs proxy. CDAP UI now supports both MANAGED and PROXY modes of authentication. For more information, see Configuring Proxy Authentication Mode.

Pipeline Studio

CDAP-17549: Added support for data pipeline comments. For more information, see Adding comments to a data pipeline.

Plugin OAUTH Support

CDAP-17611: Updated Salesforce plugins to incorporate with the new OAuth macro function

CDAP-17610: Implemented a new macro function for OAuth token exchange

CDAP-17609: Implemented new HTTP endpoints for OAuth management

Replication

CDAP-17674: Added support to allow users to specify a runtime argument, retain.staging.table, to retain BigQuery staging table to help debug issues

CDAP-17595: Added upgrade support for replication jobs

CDAP-17471: Added the ability to duplicate, export, and import replication jobs

CDAP-17337: Added property to configure dataset name in the BigQuery replication target. By default, the dataset name is the same as the Replication source database name. For more information, see Google BigQuery Target.

CDAP-16755: Added ability to add the runtime argument "event.queue.capacity" to specify the capacity of the event queue in bytes for Replication jobs. If the target plugin consumes the event slower than the source plugin emits the event, the event may stay in the queue and occupy the memory. With this capability the user can control how much memory, at most, can be used for the event queue.

Kubernetes

CDAP-17618: Replaced Zookeeper for K8S CDAP setup with K8S secrets. For more information, see Prepare the secret token for authentication service.

CDAP-17466: Added Authentication functionality for CDAP on Kubernetes setup. For more information, see Installation on Kubernetes.

Joiner Analytics Plugin

CDAP-17607: Added advanced join conditions to the joiner plugin. This allows users to specify an arbitrary SQL condition to join on. These types of joins are typically much more costly to perform than basic join on equality. For more information, see Join Condition Type.

New System Plugins for Data Pipelines

PLUGIN-558: Added new post-action plugin, GCS Done File Marker. This post-action plugin marks the end of a pipeline run by creating and storing an empty DONE (or SUCCESS) file in the given GCS bucket upon a pipeline completion, success, or failure so that you can use it to orchestrate downstream/dependent processes. 

Improvements

PLUGIN-601: Added a metric for bytes read from database source, which appears in the Spark UI.

PLUGIN-571: Added support to filter tables in the Multiple Database Tables Batch Source

PLUGIN-570: Improved error handling for Multiple Database Batch Sources and BigQuery multi-table sink that enables the pipelines to continue if one or more tables fail

CDAP-17724: Renamed replication pipelines to jobs

CDAP-17721: Added support for Kerberos login in K8s environment

CDAP-17675: Renamed Delete button to Remove in Replication Assessment report 

CDAP-17670:  Improved plugin initialization performance optimization

CDAP-17650: Added tag with parent artifact detail to Dataproc cluster created by CDAP

CDAP-17645: Set a timeout on the ssh connection so that the pipeline runs fails when the cluster becomes unreachable

CDAP-17642: Added namespace count to Dataplane metrics

CDAP-17621: Added the Customer Manager Encryption Key (CMEK) configuration property for replication BigQuery target. For more information, see Google BigQuery Replication Target.

CDAP-17613: Improved Replication Assessment page to highlight SQL Server tables with Schema issues in red

CDAP-17603: Added ability to jump to any step when modifying the Replication draft

CDAP-17601: Improved performance by loading data directly into the target table during replication snapshot process

CDAP-17597: Added poll metrics in Overview and Monitoring in Replication detail view

CDAP-17582: Added ability to pass additional properties for Debezium and jdbc drivers for replication sources

CDAP-17482: Added ability to start Replication app from a last known checkpoint.

CDAP-17474: Added support for configuring elasticsearch TLS connection to trust all certs. For more information, see Elasticsearch.

CDAP-17414: Improved Replication Table selection user experience

CDAP-17289: Improved reliability of Pub/Sub Source plugin

CDAP-17248: Added File Encoding property (ISO-8859, Windows and EBCDIC) to Amazon S3, File and GCS File Reader batch source plugins

CDAP-17114: Removed the record view in pipeline preview for the Joiner node because it was misleading

CDAP-16548: Renamed the Staging Bucket Location property to Location in the BigQuery Target properties page. For more information, see Google BigQuery Target.

CDAP-16623: Removed multiple ways to collapse/expand the Connection menu

CDAP-16008: Added support for running pipelines on Hadoop cluster with Kerberos enabled.

CDAP-15552: Fixed Wrangler to highlight new column generated by a directive

Behavior Changes

CDAP-16180: Resolved macro to preferences during pipeline validation

In CDAP 6.4.0, when you validate a plugin, macros get resolved with system and namespace preferences. In previous releases, to validate a plugin's configuration, you had to change the pipeline to remove the macros.

PLUGIN-470: Removed Multi sink runtime argument requirements, allowing users to add simple transformations in multi-source/multi-sink pipelines.

In version 6.4.0, CDAP determines the schema dynamically at runtime instead of requiring arguments to be set.  Multi sink runtime argument requirements have been removed, which lets you add simple transformations in multi-source/multi-sink pipelines. In previous releases, multi-sink plugins require the pipeline to set a runtime argument for each table, with the schema for each table. 

Bug Fixes

PLUGIN-610: Fixed Bigtable Batch Source plugin. Previously, all pipelines that include the Bigtable source failed.

PLUGIN-606: FTP batch source now works with empty File System Properties. See “Deprecations” below.

PLUGIN-545: Added support for strings in Min/Max aggregate functions (used in both Group By and Pivot plugins)

PLUGIN-539: Fixed Salesforce plugin to correctly parse the schema as Avro schema to make sure all the field names are accepted by Avro

PLUGIN-517: Fixed data pipeline with BigQuery sink that failed with INVALID_ARGUMENT exception if the range specified was a macro 

PLUGIN-222: Fixed Kinesis Spark Streaming source, which had a class conflict, so users can now run pipelines with this source.

CDAP-17746: Fixed an issue in field validation logic in pipelines with BigQuery sink that caused a NullPointerException

CDAP-17744: Fixed Schema editor to show UI validations

CDAP-17737: Fixed Conditions plugins to work with Spark 3

CDAP-17732: Fixed the Wrangler Generate UUID directive to correctly generate a universally unique identifier (UUID) of the record

CDAP-17718: Fixed advanced joins to recognize auto broadcast setting

CDAP-17717: Fixed upgraded CDAP instances to include arrow to the Error Collector 

CDAP-17713: Fixed Pipeline Studio UI to send null instead of string for blank plugin properties

CDAP-17703: Fixed Pipeline Studio to use current namespace when it fetches data pipeline drafts

CDAP-17691: Fixed SecureStore API to support SYSTEM namespace

CDAP-17683: Fixed million indicator on Replication Monitoring page

CDAP-17680: Fixed Replication statistics to display on the dashboard for SQL Server

CDAP-17678: Fixed an issue where clicking the Delete button on Replication Assessment page resulted in an error for the replication job

CDAP-17653: Removed the usage of authorization token while generating session token in nodejs proxy.

CDAP-17641: Schema name is now shown when selecting tables to replicate

CDAP-17635: Fixed Replication to correctly insert rows that were previous deleted by a replication job

CDAP-17630: Data pipelines running in Spark 3 enabled Dataproc cluster no longer fail with class not found exception  

CDAP-17617: Fixed Replication Overview page to display the label of the table status when you hover over the table status

CDAP-17598: Added ability to hover over metrics in the Pipeline Summary page

CDAP-17591: Fixed Wrangler completion percentage

CDAP-17584: Fixed Replication with a SQL Server source to generate rows correctly in BigQuery target table if snapshot failed and restarted

CDAP-17570: Fixed an issue where SQL Server replication job stopped processing data when the connection was reset by the SQL Server

CDAP-17568: Fixed the Replication wizard to close without error when you click the X icon to exit

CDAP-17495: Fixed an error in Replication wizard Step 3 "Select tables, columns and events to replicate" where selecting no columns for a table caused the wizard to fetch all columns in a table

CDAP-17491: Using a macro for a password in a replication job no longer results in an error

CDAP-17483: Fixed logical type display for data pipeline preview runs

CDAP-17476: Fixed Dashboard API to return programs running but started before the startTime

CDAP-17450: Fixed Replication job (when deployed) to show advanced configurations in UI

CDAP-17347: Fixed data pipeline with Python Evaluator transformation to run without stack trace errors

CDAP-17331: Suppressed verbose info logs from Debezium in Replication jobs

CDAP-17189: Added loading indicator while fetching logs in Log Viewer

CDAP-17028: Fixed Pipeline preview so logical start time function doesn’t display as a macro

CDAP-16804: Fixed fields with a list drop down menu in the Replication wizard to default to “Select one”

CDAP-16726: Added message in Replication Assessment when there are tables that CDAP cannot access

CDAP-16609: Used error message when an invalid expression is added in Wrangler

CDAP-16316: Fixed RENAME directive in Wrangler so it’s case sensitive

CDAP-16233: Fixed Pipeline Operations UI to stop showing the loading icon forever when it gets error from backend

CDAP-15979: Fixed Wrangler to no longer generate invalid reference names

CDAP-15509: Fixed Wrangler to display logical types instead of java types

CDAP-15465: Fixed pipelines from Wrangler to no longer generate incorrect for xml files

CDAP-13907: Added connection in Wrangler hard codes the name of the jdbc driver

CDAP-13281: Batch data pipelines with Spark 2.2 engine and HDFS sinks no longer fail with delegation token issue error

Known Issues

BigQuery Sinks

PLUGIN-678: Data pipelines that include BigQuery sinks version 0.17.0 fail or give incorrect results. This is fixed in BigQuery sink version 0.17.1, which is available for download in the Hub. 

Workaround:

In the Hub, download Google Cloud Platform version 0.17.1. For each pipeline, replace BigQuery sink plugins version 0.17.0 with BigQuery sink plugins version 0.17.1. If a pipeline has a BigQuery sink and other Google Cloud Platform plugins, such as a BigQuery source, you must update all Google Cloud Platform plugins to version 0.17.1. Google Cloud Platform plugins in the same pipeline must be the same version.
To quickly update each plugin, export all pipelines that use BigQuery sinks. You can use the Pipeline Studio to export pipelines in Draft and Deploy states. You can also use the Lifecycle Microservices to export pipelines in Deploy state in batch. Then import them back into Pipeline Studio. Pipeline Studio prompts you to update the plugins with version 0.17.1. Because CDAP exports pipelines to Draft state, you’ll need to deploy each pipeline after you import them.
Also, set version 0.17.1 as the default for all Google Cloud Platform plugins. For more information, see Working with multiple versions of the same plugin.

Joiner Analytics

PLUGIN-669: Joiner plugin version 2.6.0 does not show join conditions

The following issue occurs in the Joiner plugin version 2.6.0, which lets you toggle between basic and advanced join conditions. After you upgrade CDAP to 6.4.0 or import a pipeline from a previous version, and you open the Joiner properties page, the basic join condition for the configured pipeline does not appear. This issue doesn't affect how the pipeline runs, the join condition still exists.

Workaround:

To resolve this issue:

  1. Click System Admin > Configuration > Make HTTP Calls.

  2. In the HTTP calls executor fields, enter:

    PUT namespaces/system/artifacts/core-plugins/versions/2.6.0/properties/widgets.Joiner-batchjoiner?scope=SYSTEM

  3. Paste the following JSON content in the Body field:

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 0967c-fdb73 2ee80-67055 b41f9-1dcd9 425a5-cf822 7e1a0-485e6 eda47-040ea 27430-fabba 803ec-2c6e7 8f7e0-2738d e22b5-4c375 b3abb-778e4 2deda-2d6be 47855-b451d 3e356-1268e f0ff9-876b6 623df-8703a

     

  4. Click Send.

If your Pipeline page is open in another window, you might need to refresh the page to see the join conditions.

Replication

CDAP-17720: When you run a Replication job, if a source table has a column name that does not conform to BigQuery naming conventions, the job fails with an error similar to the following:

com.google.cloud.bigquery.BigQueryException: Invalid field name "SYS_NC00012$". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.

Note: In BigQuery, column names must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long. 

Workaround: Remove columns from the Replication job that do not conform to the BigQuery naming conventions.

Third-Party Limitations

CDAP-17897: Due to a limitation in Microsoft SQL Server CDC, if your replication source table has a newly added column, it is not automatically added to CDC tables. You must manually add it to the underlying CDC table.

Solution:

  1. Disable the CDC instance:

1 2 3 4 5 EXEC sp_cdc_disable_table @source_schema = N'dbo', @source_name = N'myTable', @capture_instance = 'dbo_myTable' GO

2. Enable the CDC instance again:

1 2 3 4 5 6 EXEC sp_cdc_enable_table @source_schema = N'dbo', @source_name = N'myTable', @role_name = NULL, @capture_instance = 'dbo_myTable' GO

3. Create a new replication job.

For more information, see Handling Changes to the Source Table.

Deprecations

FTP Batch Source (System Plugin for Data Pipelines)

The FTP Batch Source plugin installed with CDAP is deprecated and will be removed in a CDAP 7.0.0. This deprecation includes all versions of the FTP Batch Source prior to version 3.0.0. The supported version of the FTP Batch Source is version 3.0.0 and is available for download in the Hub.

FTP Batch Source version 3.0.0 is completely backward compatible, except that it uses a different artifact. This was done to ensure that updates to the plugin can be delivered out of band of CDAP releases, through the Hub.

It’s recommended that you use the FTP Batch Source plugin version 3.0.0 or later in your data pipelines.