Plugin version: 2.11.0

De-duplicates records either using one or more fields or by using the record as a whole. Additionally, it supports logically choosing a record out of the duplicate records based on a filter field. Supported logical functions are any, max and min. min and max must be operated only on numeric filter fields.

The any logical function will match any of the records where the value for the specified field is non-null. If not found, a record where the value of this field is null will be returned.

The aggregator is used when you want to filter out duplicates in the input in a predictable way.

The first and last logical functions are deprecated and will be removed in later versions of the platform.

BigQuery ELT Transformation Pushdown

Starting in CDAP 6.7.0, Deduplicate stages are eligible to execute in BigQuery when BigQuery ELT Transformation Pushdown is enabled in a pipeline. Deduplicate stages will be executed in BigQuery when a preceding stage has already been executed in BigQuery (such as a Join operation or another aggregation stage). The following filter operations are supported when deduplicating records: any, max, and min.

Configuration

Property	Macro Enabled?	Description
Unique Fields	Yes	Optional. A comma-separated list of fields on which to perform the deduplication. If none given, each record will be considered as a whole for deduplication. For example, if the input contain records with fields `fname`, `lname`, `item`, and `cost`, and we want to deduplicate the records by name, then this property should be set to `fname,lname`.
Filter Operation	Yes	Optional. An optional property that can be set to predictably choose one or more records from the set of records that needs to be deduplicated. This property takes in a field name and the logical operation that needs to be performed on that field on the set of records. The syntax is `field:function`. For example, if we want to choose the record with maximum cost for the records with schema `fname`, `lname`, `item`, `cost`, then this field should be set as `cost:max`. Supported functions are `any`, `max` and `min`. `min` and `max` must be operated only on numeric filter fields. The `first` and `last` logical functions are deprecated and will be removed in the future versions of the platform. Note: Only one pair of field and function is allowed. If this property is not set, one random record will be chosen from the group of ‘duplicate’ records.
Number of Partitions	Yes	Optional. Number of partitions to use when grouping unique fields. If not specified, the execution framework will decide on the number to use.
Output Schema	No	Required. The output schema for the data.

Example

This example deduplicates records by their fname and lname fields. Then, it chooses one record out of the duplicates based on the cost field. Since the function specified is max, the record with the maximum value in the cost field is chosen out of each set of duplicate records.

    {
        "name": "Deduplicate",
        "type": "batchaggregator",
        "properties": {
            "uniqueFields": "fname,lname",
            "filterOperation": "cost:max"
        }
    }

For example, suppose the aggregator receives input records where each record represents a purchase:

fname	lname	cost	zipcode
bob	smith	50.23	12345
bob	jones	30.64	23456
alice	smith	1.50	34567
bob	smith	0.50	45678
alice	smith	30.21	56789
alice	jones	500.93	67890

Output records will contain one record for each fname,lname combination that has the maximum cost:

fname	lname	cost	zipcode
bob	smith	50.23	12345
bob	jones	30.64	23456
alice	smith	30.21	56789
alice	jones	500.93	67890