...
This plugin groups the records by the specified key and sort them based on the start date field and compute the end date based on that. The end date is computed as the start date of the next value minus 1 second. If there is no next value, 9999-12-31 is used.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Key | Yes | Required. The name of the key field. The records will be grouped based on the key. This field must be a boolean, int, long, float, double, bytes or string type. |
Start Date Field | Yes | Required. The name of the start date field. The grouped records are sorted based on this field. This must be a timestamp. |
End Date Field | Yes | Required. The name of the end date field. The sorted results are iterated to compute the value of this field based on the start date. |
End Date Offset | Yes | Optional. The offset to compute the end date. The end date will be computed as the next start date minus this offset. The format is expected to be a number followed by an ‘us’, ‘ms’, ‘s’, ‘m’, ‘h’, or ‘d’ specifying the time unit, with ‘us’ for microseconds, ‘ms’ for milliseconds, ‘s’ for seconds, ‘m’ for minutes, ‘h’ for hours, and ‘d’ for days. |
Fill in Null | Yes | Required. Fill in null fields from most recent previous records with same id. See “Fill in Null Example” below. Default is False. |
Deduplicate | Yes | Required. Deduplicate consecutive records that have no changes. This can be used with the blacklist to exclude some fields from comparison. The blacklist typically includes the start time field. See “Deduplicate Example” below. Default is False. |
Black List | Yes | Optional. Blacklist for fields to ignore to compare when deduplicating the record. |
Preserve Target | Yes | Optional. Preserve all the records from the scd2 target table when deduplicating. |
Target Field | Yes | Optional. The name of the field that tells whether the record is coming from the scd2 target table. This field must be boolean. For example, suppose the target field is set to ‘is_target’ and the following records are processed: |
Number of Partitions | Yes | Optional. Number of partitions to use when grouping the data. This number determines the level of parallelism for the job. A reasonable starting point is to divide your cluster memory by the pipeline executor memory and set that as the number of partitions. If not specified. If not specified, 200 is used as default. |
Fill in Null Example
For example, suppose three records with same key are processed:
id | start_date | name | |
---|---|---|---|
1 | 0 | John | bcd@abc.com |
1 | 10 | null | abc@abc.com |
1 | 100 | null | null |
After filling in null, the resulting records will be:
id | start_date | name | |
---|---|---|---|
1 | 0 | John | bcd@abc.com |
1 | 10 | John | abc@abc.com |
1 | 100 | John | abc@abc.com |
Deduplicate Example
For example, suppose the following records are processed with start_date as blacklist:
id | start_date | name | |
---|---|---|---|
1 | 0 | John | bcd@abc.com |
1 | 10 | John | bcd@abc.com |
1 | 100 | Sam | abc@abc.com |
After deduplicating, the resulting records will be:
id | start_date | name | |
---|---|---|---|
1 | 0 | John | bcd@abc.com |
1 | 100 | Sam | abc@abc.com |
Example
For example, Suppose the plugin is configured to use the ‘id’ field as the key, and receives the following input records:
id | start_date | name |
---|---|---|
1 | 0 | 10000000 |
2 | 10 | 20000000 |
1 | 100000000 | 500000000 |
2 | 21000000 | 10000000 |
1 | 10000000 | null |
2 | 15000000 | null |
The records are grouped by id, then sorted by start_date. The end dates are updated to be N - 1, where N is the start date timestamp of the next record.
id | start_date | name |
---|---|---|
1 | 0 | 9999999 |
1 | 10000000 | 99999999 |
1 | 100000000 | 253402214400000000 |
2 | 10 | 14999999 |
2 | 15000000 | 20999999 |
2 | 21 | 253402214400000000 |