CDAP-18437: Transformation Pushdown supports the BigQuery Storage Read API to improve performance when extracting data from BigQuery.
PLUGIN-1001: Added support for connections to Transformation Pushdown.
Added support to parse files before loading data into the Wrangler workspace. This means the recipe does not include parse directives. Now, when you create a pipeline from Wrangler, the source has the correct Format property.
Added support to allow users to import the schema for formats such as JSON and some AVRO files where schema inference is not possible before loading data into the Wrangler workspace.
Added the following directives:
CREATE-RECORD. Creates a record column with nested values by copying values from source columns into a destination column.
FLATTEN-RECORD. Splits a record column with nested values into multiple columns.
Note: In releases before CDAP 6.7.0, due to a bug, the FLATTEN directive worked with both an array and a record column with nested values. Starting in CDAP 6.7.0, the FLATTEN directive only works with arrays.
PLUGIN-670: In the BigQuery Table Batch Source, added the ability to query any temporary table in any project when you set the Enable querying views property to Yes. Previously, you could only query views.
PLUGIN-650: In Google Data Loss Prevention plugins, added support for templates from other projects.
CDAP-18982: Added a new pipeline state for when you manually stop a pipeline run: Stopping.
CDAP-18990: In the Pipeline Studio, if you click Stop on a running pipeline, if the pipeline does not stop after 6 hours, the pipeline is forcefully terminated.
CDAP-18918: in the Deduplicate Analytics plugin, Limited the Filter Operation property to one record. If this property is not set, one random record will be chosen from the group of ‘duplicate’ records.
PLUGIN-795: The BigQuery sink supports Nullable Arrays. A NULL array gets converted to empty arrays at insertion time.
Wrangler no longer infers all values in CSV files as Strings. Instead, it maps the columns to a corresponding data type.
PLUGIN-1210: Fixed an issue in the Group By transformation where Longest String and Shortest String aggregators returned an empty string "", even when all records contained null values in the specified field. The Group By transformation now returns null for empty input.
PLUGIN-1183: Fixed an issue in the Group By transformation that caused the Concat and Concat Distinct aggregate function to produce incorrect results in some cases.
PLUGIN-1177: Fixed an issue in the Group By transformation that caused the Variance, Variance If and Standard Deviation aggregate function to produce incorrect results in some cases.
PLUGIN-1126: In the Oracle and MySQL Batch Source plugins, fixed an issue to treat all timestamps, specifically the ones older than the Gregorian cutover date (October 15, 1582), from the database in Gregorian calendar format.
PLUGIN-1074: Improved the generic Database source plugin to correctly read data when the data type is NUMBER, scale is set, and the data contains integer values.
PLUGIN-1024: Fixed an issue in the Router transformation that resulted in an error when the Default handling property was set to Skip.
PLUGIN-1022: Fixed an issue that caused pipelines with a Conditional plugin and running on MapReduce to fail.
PLUGIN-972: Fixed an issue in sources (such as File and Cloud Storage) that resulted in an error if you clicked Get Schema when the source file contained delimiters used in regular expressions, such as "|" or ".". You no longer need to escape delimiters for sources.
PLUGIN-733: Fixed an issue where Google Cloud Datastore sources read a maximum of 300 records. Datastore sources now read all records.
PLUGIN-704: Fixed an issue in BigQuery sinks where the output table was not partitioned correctly under the following circumstances:
The output table doesn’t exist.
Partitioning type is set to Time.
Operation is set to Upsert.
PLUGIN-694: Fixed an issue that caused pipelines with BigQuery sinks that have input schemas with nested array fields to fail.
PLUGIN-682: Fixed an access-level issue that caused pipelines with Elasticsearch and MongoDB plugins to fail.
CDAP-18994: Fixed issues that caused failures when reading maps and named enums from Avro files.
CDAP-18992: Fixed an issue in the Replication BigQuery target plugin where French characters in the source table were getting transferred incorrectly by getting replaced with '?'
CDAP-18974: Fixed an issue in the MySQL replication source where timestamp mapped to string. Now, timestamp correctly maps to timestamp.
CDAP-18878: Fixed an issue where Amazon S3 source and sink failed on Spark3 when using s3n as the Path property.
CDAP-18806: Fixed an issue in the GCS connection so it can read files with spaces in the name.
CDAP-18900: Fixed an issue where FileSecureStoreService did not properly store keys in case-sensitive namespaces.
CDAP-18860: Removed whitespace trimming from the runtime arguments UI. Whitespace can now properly be set as an argument value.
CDAP-18786: Fixed an issue in plugin templates where the Lock change option did not work.
CDAP-18692: Fixed an issue that caused Null Pointer Exceptions when dealing with Array of Records in BigQuery.
CDAP-18396: Fixed an issue where connection names allowed special characters. Now, connection names can only include letters, numbers, hyphens, and underscores.
CDAP-18009: Fixed an issue where the CDAP Pipeline Studio UI automatically checks the Null box if a schema record has the array data type.
CDAP-17955: Replication assessment warnings no longer block draft job deployment.
MapReduce Compute Engine
CDAP-18913: The MapReduce compute engine is deprecated and will be removed in a future release. Recommended: Use Spark as the compute engine for data pipelines.
Spark Compute Engine running on Scala 2.11
CDAP-19063: Spark running on Scala 2.11 is no longer supported. CDAP supports Spark 2.4+ running on Scala 2.12 only.
CDAP-19016: Spark-specific metrics are not served anymore with CDAP metrics API.
CDAP-18897: Deprecated the Set first row as header option for the parse-as-csv Wrangler directive. Parsing should be configured at the connection or source layer, not at the transformation layer. For more information, see Parsing Files in Wrangler.
The following plugins are deprecated and will be removed in a future release:
Avro Snapshot Dataset batch source
Parquet Snapshot Dataset batch source
CDAP Table Dataset batch source
Avro Time Partitioned Dataset batch source
Parquet Time Partitioned Dataset batch source
Key Value Dataset batch source
Key Value Dataset sink
Avro Snapshot Dataset sink
Parquet Snapshot Dataset sink
Snapshot text sink
CDAP Table Dataset sink
Avro Time Partitioned Dataset sink
ORC Time Partitioned Dataset sink
Parquet Time Partitioned Dataset sink
MD5/SHA Field Dataset transformation
Value Mapper transformation
ML Predictor analytics
Amazon S3 Batch Source and Sink
CDAP-18878: s3n is deprecated as a scheme in the Path property in Amazon S3 Batch Source and Amazon S3 Sink plugins. If the Path property includes s3n, it is converted to s3a during runtime.
PostgreSQL batch source and sink plugins
PLUGIN-1126: Any timestamps older than the Gregorian cut over date (October 15, 1582) will not be represented correctly in the pipeline.
SQL Server Replication Source
CDAP-19354: The default setting for the snapshot transaction isolation level (snapshot.isolation.mode) is repeatable_read, which locks the source table until the initial snapshot completes. If the initial snapshot takes a long time, this can block other queries.
In case transaction isolation level doesn't work or is not enabled on the SQL Server instance, follow these steps:
Configure SQL Server with one of the following transaction isolation levels:
In most cases, set snapshot.isolation.mode to snapshot.
If schema modification will not happen during the initial snapshot, set snapshot.isolation.mode to read_committed.
2. After SQL Server is configured, pass a Debezium argument to the Replication job. To pass a
Debezium argument to a Replication job in CDAP, specify a runtime argument prefixed with source.connector, for example, set the Key to source.connector.snapshot.isolation.mode and the Value to snapshot.