CDAP Release 6.7.0

Release Date: June 7, 2022

New Features

General

Added support for mounting arbitrary volumes to CDAP system services in the CDAP operator.

Added support for monitoring CDAP pipelines using an external tool.

Performance and Scalability

CDAP-19016: Increase pipeline run scalability.

CDAP-18837: Use system pods to enable horizontal scaling of pipeline launching. For more information, see System Workers.

Plugins

Google Dataplex Batch Source and Google Dataplex Sink system plugins are available in Preview.

Transformation Pushdown

Transformation Pushdown for joins is generally available (GA).

In Transformation Pushdown, Group By aggregation and Deduplicate aggregation are available in Preview.

CDAP-18437: Transformation Pushdown supports the BigQuery Storage Read API to improve performance when extracting data from BigQuery.

PLUGIN-1001: Added support for connections to Transformation Pushdown.

Wrangler

Added support to parse files before loading data into the Wrangler workspace. This means the recipe does not include parse directives. Now, when you create a pipeline from Wrangler, the source has the correct Format property.
Added support to allow users to import the schema for formats such as JSON and some AVRO files where schema inference is not possible before loading data into the Wrangler workspace.
Added the following directives:
- CREATE-RECORD. Creates a record column with nested values by copying values from source columns into a destination column.
- FLATTEN-RECORD. Splits a record column with nested values into multiple columns.
- Note: In releases before CDAP 6.7.0, due to a bug, the FLATTEN directive worked with both an array and a record column with nested values. Starting in CDAP 6.7.0, the FLATTEN directive only works with arrays.

Enhancements

PLUGIN-1245: In the Joiner transformation, renamed the Distribution Skewed Input Stage property to Skewed Input Stage. Changed UI label only.

PLUGIN-1118: In Google Cloud Storage File Reader batch source and Amazon S3 batch source plugins, added the Enable Quoted Values property, which lets you treat content between quotes as a value.

PLUGIN-1107: In the Google Cloud Data Loss Prevention (DLP) Decrypt Transformation and Google Cloud Data Loss Prevention (DLP) Redact Transformation, added the Resource Location property, which lets you specify the resource location for the DLP Service. For more information, see https://cloud.google.com/dlp/docs/specifying-location.

PLUGIN-1004, CDAP-18386: Improved connection management to allow users to edit connections. Removed option to view connections.

PLUGIN-984: Added support for connections to the following plugins:

PLUGIN-968: Added support for connections in the following sinks:

PLUGIN-965: In the GCS Done File Marker post-action plugin, added the Location property, which lets you have buckets and customer-managed encryption keys in locations that are not US locations.

PLUGIN-926, PLUGIN-939: In the BigQuery Execution Action plugin and the BigQuery Argument Setter action plugin, added support for the Dataset Project ID property, which is the Project ID of the dataset that stores the query results. It's required if the dataset is in a different project than the BigQuery job.

PLUGIN-731: In BigQuery sinks, added support for BigNumeric data type.

PLUGIN-670: In the BigQuery Table Batch Source, added the ability to query any temporary table in any project when you set the Enable querying views property to Yes. Previously, you could only query views.

PLUGIN-650: In Google Data Loss Prevention plugins, added support for templates from other projects.

CDAP-18982: Added a new pipeline state for when you manually stop a pipeline run: Stopping.

CDAP-18778: In the BigQuery Execute action plugin, added the ability to look up the drive scope for the service account to read from external tables created from the drive.

CDAP-18713: Added support for setting up workload identity in separate k8s namespaces.

CDAP-18655: Improved generic Database source plugin to correctly read decimal data.

CDAP-18556: Improved Google Cloud Platform plugins to validate the Encryption Key Name property.

CDAP-18456: In the replication configurations, added the ability to enable soft deletes from a BigQuery target.

CDAP-18405: Improved connection management to allow users to browse partial hierarchies like BigQuery datasets and Dataplex zones.

CDAP-18318: Permission checks are now required for updating/viewing system service information.

CDAP-17955: Replication assessment warnings no longer block draft deployment.

CDAP-16035: In Wrangler, added support for nested arrays, such as the BigQuery STRUCT data type.

In the Amazon S3 connection and Amazon S3 batch source plugins, added Session Token property.

In the Google Cloud Storage File Reader batch source plugin, added the Allow Empty Input property.

In the Joiner transformation, added the Input with Larger Data Skew property.

In the in Google Cloud Storage File Reader batch source plugin, Amazon S3 batch source plugin, and File batch source plugin, changed Skip Header property name to Use First Row as Header

Behavior Changes

CDAP-18990: In the Pipeline Studio, if you click Stop on a running pipeline, if the pipeline does not stop after 6 hours, the pipeline is forcefully terminated.

CDAP-18918: in the Deduplicate Analytics plugin, Limited the Filter Operation property to one record. If this property is not set, one random record will be chosen from the group of ‘duplicate’ records.

PLUGIN-795: The BigQuery sink supports Nullable Arrays. A NULL array gets converted to empty arrays at insertion time.

Wrangler no longer infers all values in CSV files as Strings. Instead, it maps the columns to a corresponding data type.

Bug Fixes

PLUGIN-1210: Fixed an issue in the Group By transformation where Longest String and Shortest String aggregators returned an empty string "", even when all records contained null values in the specified field. The Group By transformation now returns null for empty input.

PLUGIN-1183: Fixed an issue in the Group By transformation that caused the Concat and Concat Distinct aggregate function to produce incorrect results in some cases.

PLUGIN-1177: Fixed an issue in the Group By transformation that caused the Variance, Variance If and Standard Deviation aggregate function to produce incorrect results in some cases.

PLUGIN-1126: In the Oracle and MySQL Batch Source plugins, fixed an issue to treat all timestamps, specifically the ones older than the Gregorian cutover date (October 15, 1582), from the database in Gregorian calendar format.

PLUGIN-1074: Improved the generic Database source plugin to correctly read data when the data type is NUMBER, scale is set, and the data contains integer values.

PLUGIN-1024: Fixed an issue in the Router transformation that resulted in an error when the Default handling property was set to Skip.

PLUGIN-1022: Fixed an issue that caused pipelines with a Conditional plugin and running on MapReduce to fail.

PLUGIN-972: Fixed an issue in sources (such as File and Cloud Storage) that resulted in an error if you clicked Get Schema when the source file contained delimiters used in regular expressions, such as "|" or ".". You no longer need to escape delimiters for sources.

PLUGIN-733: Fixed an issue where Google Cloud Datastore sources read a maximum of 300 records. Datastore sources now read all records.

PLUGIN-704: Fixed an issue in BigQuery sinks where the output table was not partitioned correctly under the following circumstances:

The output table doesn’t exist.
Partitioning type is set to Time.
Operation is set to Upsert.

PLUGIN-694: Fixed an issue that caused pipelines with BigQuery sinks that have input schemas with nested array fields to fail.

PLUGIN-682: Fixed an access-level issue that caused pipelines with Elasticsearch and MongoDB plugins to fail.

CDAP-18994: Fixed issues that caused failures when reading maps and named enums from Avro files.

CDAP-18992: Fixed an issue in the Replication BigQuery target plugin where French characters in the source table were getting transferred incorrectly by getting replaced with '?'

CDAP-18974: Fixed an issue in the MySQL replication source where timestamp mapped to string. Now, timestamp correctly maps to timestamp.

CDAP-18878: Fixed an issue where Amazon S3 source and sink failed on Spark3 when using s3n as the Path property.

CDAP-18806: Fixed an issue in the GCS connection so it can read files with spaces in the name.

CDAP-18900: Fixed an issue where FileSecureStoreService did not properly store keys in case-sensitive namespaces.

CDAP-18860: Removed whitespace trimming from the runtime arguments UI. Whitespace can now properly be set as an argument value.

CDAP-18786: Fixed an issue in plugin templates where the Lock change option did not work.

CDAP-18692: Fixed an issue that caused Null Pointer Exceptions when dealing with Array of Records in BigQuery.

CDAP-18396: Fixed an issue where connection names allowed special characters. Now, connection names can only include letters, numbers, hyphens, and underscores.

CDAP-18009: Fixed an issue where the CDAP Pipeline Studio UI automatically checks the Null box if a schema record has the array data type.

CDAP-17955: Replication assessment warnings no longer block draft job deployment.

Deprecations

MapReduce Compute Engine

CDAP-18913: The MapReduce compute engine is deprecated and will be removed in a future release. Recommended: Use Spark as the compute engine for data pipelines.

Spark Compute Engine running on Scala 2.11

CDAP-19063: Spark running on Scala 2.11 is no longer supported. CDAP supports Spark 2.4+ running on Scala 2.12 only.

CDAP-19016: Spark-specific metrics are not served anymore with CDAP metrics API.

Wrangler

CDAP-18897: Deprecated the Set first row as header option for the parse-as-csv Wrangler directive. Parsing should be configured at the connection or source layer, not at the transformation layer. For more information, see Parsing Files in Wrangler.

Plugins

The following plugins are deprecated and will be removed in a future release:

Avro Snapshot Dataset batch source
Parquet Snapshot Dataset batch source
CDAP Table Dataset batch source
Avro Time Partitioned Dataset batch source
Parquet Time Partitioned Dataset batch source
Key Value Dataset batch source
Key Value Dataset sink
Avro Snapshot Dataset sink
Parquet Snapshot Dataset sink
Snapshot text sink
CDAP Table Dataset sink
Avro Time Partitioned Dataset sink
ORC Time Partitioned Dataset sink
Parquet Time Partitioned Dataset sink
MD5/SHA Field Dataset transformation
Value Mapper transformation
ML Predictor analytics

Amazon S3 Batch Source and Sink

CDAP-18878: s3n is deprecated as a scheme in the Path property in Amazon S3 Batch Source and Amazon S3 Sink plugins. If the Path property includes s3n, it is converted to s3a during runtime.

Known Issues

PostgreSQL batch source and sink plugins

PLUGIN-1126: Any timestamps older than the Gregorian cut over date (October 15, 1582) will not be represented correctly in the pipeline.

SQL Server Replication Source

CDAP-19354: The default setting for the snapshot transaction isolation level (snapshot.isolation.mode) is repeatable_read, which locks the source table until the initial snapshot completes. If the initial snapshot takes a long time, this can block other queries.

In case transaction isolation level doesn't work or is not enabled on the SQL Server instance, follow these steps:

Configure SQL Server with one of the following transaction isolation levels:

In most cases, set snapshot.isolation.mode to snapshot.
If schema modification will not happen during the initial snapshot, set snapshot.isolation.mode to read_committed.

For more information, see Enable the snapshot transaction isolation level in SQL Server 2005 Analysis Services.

2. After SQL Server is configured, pass a Debezium argument to the Replication job. To pass a

Debezium argument to a Replication job in CDAP, specify a runtime argument prefixed with source.connector, for example, set the Key to source.connector.snapshot.isolation.mode and the Value to snapshot.

For more information about setting a Debezium property, see Pass a Debezium argument to a Replication job.

Dataproc

CDAP version 6.7.0 does not support Dataproc version 1.3. For more information, see the compatible versions of Dataproc.

Replication

CDAP-19622: Upgrade for replication jobs is broken. We recommend not upgrading replication jobs to 6.7.0.