CDAP Release 6.8.0

Release Date: December 1, 2022

New Features

The Dataplex Batch Source and Dataplex Sink plugins are generally available (GA).

CDAP-19592: For Oracle (by Datastream) replication sources, added a purge policy for a GCS (Google Cloud Storage) bucket created by the plugin that Datastream will write its output to. 

CDAP-19584: Added support for monitoring CDAP pipelines using an external tool.

CDAP-18450: Added support for AND triggers. Now, you can create OR and AND triggers. Previously, all triggers were OR triggers.

PLUGIN-871: Added support for BigQuery batch source pushdown. See below:

Transformation Pushdown (BigQuery Source Pushdown)

For CDAP release 6.8, we are introducing BigQuery source pushdown which enhances the performance of Transformation Pushdown when the input source is BigQuery Compatible.

Faster reads from BigQuery Sources

We have implemented the ability to execute compatible Transformation pushdown operations when the BigQuery Source is used

Lesser Data Movement

Prior to CDAP 6.8, when the source is BigQuery and Transformation Pushdown was enabled, the BigQuery source stages were required to be read into Spark before being pushed down again into BigQuery execution dataset for execution. This involved data movement from BQ (source) -> GCS -> Spark -> GCS -> BQ (execution dataset). With this new improvement the data movement is reduced to BQ (source) -> BQ (execution dataset).

Partitioning and Clustering properties are preserved

Before CDAP 6.8, because of the data movement via GCS the source table’s partitioning and clustering properties were lost. But with source pushdown, these properties are preserved in the execution dataset and the customer can utilize these properties for optimizing further operations like joins. 

Some Requirements

Ensure the following requirements when using Transformation pushdown with a BigQuery source: 

  1. The Service Account configured for BigQuery Transformation Pushdown has permissions to read tables in the BigQuery Source’s dataset.

  2. The Datasets used for BigQuery Source and BigQuery Transformation Pushdown must be stored in the same location.

For the pipeline in the given figure, the whole operation will be executed in BigQuery. 

Behavior Changes

CDAP-19667: Reference name is no longer required for the following plugins:

  • BigQuery Source

  • BigQuery Sink

  • Dataplex Source

  • Dataplex Sink

  • Spanner Sink

  • GCS Sink

For these plugins, their unique identifiers in lineage are generated based on their configuration properties. For example, project ID+dataset+table is used as a unique identifier for BigQuery. This identifier can be seen on the lineage diagram.

Enhancements

CDAP-19678: Added ability to specify k8s affinity for CDAP services in CDAP custom resource. 

CDAP-19605: Added ability to see the logs coming from Twill application master in pipeline logs. 

CDAP-19591: In the Datastream replication source, added the property GCS Bucket Location, which Datastream will write its output to. 

CDAP-19590: In the Datastream replication source, added the list of Datastream regions to the Region property. You no longer need to manually enter the Datastream region.

CDAP-19589: For replication jobs with an Oracle (by Datastream) source, ensured data consistency when multiple CDC events are generated at the same timestamp, by ordering events reliably.     

CDAP-19568: Significantly improved time it takes to start a pipeline (after provisioning). 

CDAP-19555, CDAP-19554: Made the following improvements and changes for streaming pipelines with a single Kafka Consumer Streaming source and no Windower plugins:

  • Kafka Consumer Streaming source has native support so the data is guaranteed to be processed at least once.

CDAP-19501: For Replication jobs, improved performance for Review Assessment.

CDAP-19475: Modified /app endpoints (GET and POST) in AppLifecycleHttpHandler to include the following information in the response:

  "change": {

      "author": "joe",

      "creationTimeMillis": 1668540944833,

      "latest": true

}

The new information is included in the response for the following endpoints:

CDAP-19365: Changed the Datastream replication source to identify each row by the Primary key of the table. Previously, the plugin identified each row by the ROWID.

CDAP-19328: Splitter Transformation based plugins now have access to prepareRun, onRunFinish methods. 

CDAP-18430: The Lineage page has a new look-and-feel.

PLUGIN-1378: In the Dataplex Sink plugin, added a new property, Update Dataplex Metadata, which adds support for updating metadata in Dataplex for newly generated data.

PLUGIN-1374: Improved performance for batch pipelines with MySQL sinks.

PLUGIN-1333: Improved Kafka Producer Sink performance. 

PLUGIN-664: In the Google Cloud Storage Delete Action plugin, added support for bulk deletion of files and folders. You can now use the (*) wildcard character to represent any character.

PLUGIN-641: In Wrangler, added the Average arithmetic function, which calculates the average of the selected columns.

In Wrangler, Numeric functions support 3 or more columns.

Bug Fixes

CDAP-20002: Removed the CDAP Tour from the Welcome page.

CDAP-19939: Fixed an issue in the BigQuery target replication plugin that caused replication jobs to fail when replicating datetime columns from sources that are more precise than microsecond, for example datetime2 data type in SQL Server. 

CDAP-19970: Google Cloud Data Loss Prevention plugins (version 1.4.0) are available in the CDAP Hub version 6.8.0 with the following changes:

  • For the Google Cloud Data Loss Prevention (DLP) PII Filter Transformation, fixed an issue where pipelines failed because the DLP client was not initialized. 

  • For all of the Google Cloud Data Loss Prevention (DLP) transformations, added relevant exception details when validation of DLP Inspection template fails, rather than throwing a generic IllegalArgumentException.

CDAP-19630: For custom Dataproc compute profiles, fixed an issue where the wrong GCS bucket was used to stage data. Now, CDAP uses the GCS bucket specified in the custom compute profile.

CDAP-19599: Fixed an issue in the BigQuery Replication Target plugin that caused replication jobs to fail when the BigQuery target table already existed. The new version of the plugin will automatically be used in new replication jobs. Due to CDAP-19622, if you want to use the new plugin version in existing jobs, recreate each replication job.

CDAP-19486: In the Wrangler transformation, fixed an issue where the pipeline didn’t fail when the Error Handling property was set to Fail Pipeline. This happened when an error was returned, but no exception was thrown and there were 0 records in output. For example, this happened when one of the directive (such as parse-as-simple-date) failed because the input data was not in the correct format. This fix is under a feature flag and not available by default. If this feature flag is enabled, existing pipelines might fail if there are data issues since the default error handling property is set to Fail Pipeline.

CDAP-19481: Fixed an issue that caused Replication Assessment to hang when the Oracle (by Datastream) GCS Bucket property was empty or had an invalid bucket name. Now, CDAP returns a 400 error code during assessment when the property is empty or has an invalid bucket name.

CDAP-19455: Added user error tags to Dataproc errors returned during cluster creation and job submission. Added ability to set troubleshooting docs url in CDAP site for Dataproc API errors. 

CDAP-19442: Fixed an issue that caused Replication jobs to fail when the source column name didn’t comply with BigQuery naming conventions. Now, if a source column name doesn’t comply with BigQuery naming conventions, CDAP replaces invalid characters with an underscore, prepends an underscore if the first character is a number, and truncates the name if it exceeds the maximum length. 

CDAP-19266: In the File batch source, fixed an issue where Get Schema appeared only when Format was set to delimited. Now, Get Schema appears for all formats. 

CDAP-18846: Fixed issue with the output schema when connecting a Splitter transformation with a Joiner transformation.

CDAP-18302: Fixed an issue where Compute Profile creation failed without showing an error message in the CDAP UI. Now, CDAP shows an error message when a Compute Profile is missing required properties.

CDAP-17619: Fixed an issue that caused imports in the CDAP UI to fail for pipelines exported through the Pipeline Microservices.

CDAP-13130: Fixed an issue where you couldn’t keep an earlier version of a plugin when you exported a pipeline and then imported it into the same version of CDAP, even though the earlier version of the plugin is deployed in CDAP. Now, if you export a pipeline with an earlier version of a plugin, when you import the pipeline (if the earlier version of the plugin exists in the CDAP instance/namespace), you can choose to keep the earlier version or upgrade it to the current version. For example, if you export a pipeline with a BigQuery source (version 0.20.0) and then import it into CDAP, you can choose to keep version 0.20.0 or upgrade to version 0.21.0.

PLUGIN-1433: In the Oracle Batch Source, when the source data included fields with the Numeric data type (undefined precision and scale), CDAP set the precision to 38 and the scale to 0. If any values in the field had scale other than 0, CDAP truncated these values, which could have resulted in data loss. If the scale for a field was overridden in the plugin output schema, the pipeline failed.

Now, if an Oracle source has Numeric data type fields with undefined precision and scale, you must manually set the scale for these fields in the plugin output schema. When you run the pipeline, the pipeline will not fail and the new scale will be used for the field instead. However, there might be truncation if there are any Numbers present in the fields with the scale greater than the scale defined in the plugin. CDAP writes warning messages in the pipeline log indicating the presence of Numbers with undefined precision and scale in the pipeline. For more information about setting precision and scale in a plugin, see Changing the precision and scale for decimal fields in the output schema.

PLUGIN-1325: In Wrangler, fixed an issue that caused the Wrangler UI to hang when a BigQuery table name contained characters besides alphanumeric characters and underscores (such as a dash). Now, Wrangler successfully imports BigQuery tables that comply with BigQuery naming conventions.

PLUGIN-826: In the HTTP batch source plugin, fixed an issue where validation failed when the URL property contained a macro and Pagination Type was set to Increment an index.

Security Fixes

The following vulnerabilities were found in open source libraries:

  • Arbitrary Code Execution

  • Deserialization of Untrusted Data

  • SQL Injection 

  • Information Exposure

  • Hash Collision

  • Remote Code Execution (RCE)

To address these vulnerabilities, the following libraries have security fixes:

  • commons-collections:commons-collections (Deserialization of Untrusted Data). Upgraded to apply security fixes.

  • commons-fileupload:commons-fileupload (Arbitrary Code Execution). Upgraded to apply security fixes.

  • ch.qos.logback:logback-core (Arbitrary Code Execution). Upgraded to apply security fixes.

  • org.apache.hive:hive-jdbc (SQL Injection). Excluded org.apache.hive:hive-jdbc dependency

  • org.bouncycastle:bcprov-jdk16  (Hash Collision)

  • com.fasterxml.jackson.core:jackson-databind (Deserialization of Untrusted Data). Upgraded to apply security fixes.

Deprecations

CDAP-19559: For streaming pipelines, the Pipeline configuration properties Checkpointing and Checkpoint directory are deprecated. Setting these properties will no longer have any effect. 

CDAP will decide automatically if checkpointing or CDAP internal state tracking is enabled. To disable at least once processing in streaming pipelines, you can set the runtime argument cdap.streaming.atleastonce.enabled. Both Spark checkpointing and state tracking will be disabled if this is set to false.​​

Known Issues

CDAP-20431: In CDAP versions 6.8.0 and 6.8.1, there's a known issue that might cause pipelines to fail when running pipelines on GCP Dataproc with the following error: Unsupported program type: Spark. The first time a pipeline that only contains actions runs on a newly created or upgraded instance, it succeeds. However, following pipeline runs, which include sources or sinks, might fail with this error. For updated settings, see Troubleshooting.

Replication

CDAP-19622: Upgrade for MySQL and SQL Server replication jobs is broken. Upgrading MySQL and SQL Server replication jobs from CDAP 6.7.1 and 6.7.2 to CDAP 6.8.0 isn’t recommended.

CDAP-20013: Upgrade for Oracle by Datastream replication jobs is broken. Upgrading Oracle by Datastream replication jobs from CDAP 6.6.0, 6.7.1, and 6.7.2 to CDAP 6.8.0 isn’t recommended.

Secure Macros

CDAP-20271: Pipelines fail when they use a connection that includes a secure macro and the secure macro has JSON as the value (for example, the Service Account property).

Workarounds

Use one of the following workarounds:

(1) For existing, running pipelines, create a new secure key for the connection and escape all the quotes in the secure macro JSON. Then edit the connection to use the new secure key JSON. Note: If you use this approach, browsing and sampling in Wrangler, and other places that directly use the secure macro will start to fail.

(2) Do not use the connection with the secure macro when you run the pipeline. 

To remove the connection, follow these steps:

  1. Duplicate the deployed pipeline.

  2. In the Pipeline Studio, for each plugin that uses the connection, turn off the connection and then edit the Service Account property to include the secure macro. 

Created in 2020 by Google Inc.