Page Comparison

...

This plugin groups the records by the specified key and sort them based on the start date field and compute the end date based on that. The end date is computed as the start date of the next value minus 1 second. If there is no next value, 9999-12-31 is used.

Configuration

Property	Macro Enabled?	Description
Key	Yes	Required. The name of the key field. The records will be grouped based on the key. This field must be a boolean, int, long, float, double, bytes or string type.
Start Date Field	Yes	Required. The name of the start date field. The grouped records are sorted based on this field. This must be a timestamp.
End Date Field	Yes	Required. The name of the end date field. The sorted results are iterated to compute the value of this field based on the start date.
End Date Offset	Yes	Optional. The offset to compute the end date. The end date will be computed as the next start date minus this offset. The format is expected to be a number followed by an ‘us’, ‘ms’, ‘s’, ‘m’, ‘h’, or ‘d’ specifying the time unit, with ‘us’ for microseconds, ‘ms’ for milliseconds, ‘s’ for seconds, ‘m’ for minutes, ‘h’ for hours, and ‘d’ for days.
Fill in Null	Yes	Required. Fill in null fields from most recent previous records with same id. See “Fill in Null Example” below. Default is False.
Deduplicate	Yes	Required. Deduplicate consecutive records that have no changes. This can be used with the blacklist to exclude some fields from comparison. The blacklist typically includes the start time field. See “Deduplicate Example” below. Default is False.
Black List	Yes	Optional. Blacklist for fields to ignore to compare when deduplicating the record.
Preserve Target	Yes	Optional. Preserve all the records from the scd2 target table when deduplicating.
Target Field	Yes	Optional. The name of the field that tells whether the record is coming from the scd2 target table. This field must be boolean. For example, suppose the target field is set to ‘is_target’ and the following records are processed:
Number of Partitions	Yes	Optional. Number of partitions to use when grouping the data. This number determines the level of parallelism for the job. A reasonable starting point is to divide your cluster memory by the pipeline executor memory and set that as the number of partitions. If not specified. If not specified, 200 is used as default.

Fill in Null Example

For example, suppose three records with same key are processed:

id	start_date	name	email
1	0	John	bcd@abc.com
1	10	null	abc@abc.com
1	100	null	null

After filling in null, the resulting records will be:

id	start_date	name	email
1	0	John	bcd@abc.com
1	10	John	abc@abc.com
1	100	John	abc@abc.com

Deduplicate Example

For example, suppose the following records are processed with start_date as blacklist:

id	start_date	name	email
1	0	John	bcd@abc.com
1	10	John	bcd@abc.com
1	100	Sam	abc@abc.com

After deduplicating, the resulting records will be:

id	start_date	name	email
1	0	John	bcd@abc.com
1	100	Sam	abc@abc.com

Example

For example, Suppose the plugin is configured to use the ‘id’ field as the key, and receives the following input records:

id	start_date	name
1	0	10000000
2	10	20000000
1	100000000	500000000
2	21000000	10000000
1	10000000	null
2	15000000	null

The records are grouped by id, then sorted by start_date. The end dates are updated to be N - 1, where N is the start date timestamp of the next record.

id	start_date	name
1	0	9999999
1	10000000	99999999
1	100000000	253402214400000000
2	10	14999999
2	15000000	20999999
2	21	253402214400000000

Versions Compared

Old Version 3

New Version Current

Key

Configuration

Fill in Null Example

Deduplicate Example

Example