Plugin version: 2.11.0
De-duplicates records either using one or more fields or by using the record as a whole. Additionally, it supports logically choosing a record out of the duplicate records based on a filter field. Supported logical functions are any
, max
and min
. min
and max
must be operated only on numeric filter fields.
The any
logical function will match any of the records where the value for the specified field is non-null. If not found, a record where the value of this field is null will be returned.
The aggregator is used when you want to filter out duplicates in the input in a predictable way.
The |
Starting in CDAP 6.7.0, Deduplicate stages are eligible to execute in BigQuery when BigQuery ELT Transformation Pushdown is enabled in a pipeline. Deduplicate stages will be executed in BigQuery when a preceding stage has already been executed in BigQuery (such as a Join operation or another aggregation stage). The following filter operations are supported when deduplicating records: any
, max
, and min
.
Property | Macro Enabled? | Description |
---|---|---|
Unique Fields | Yes | Optional. A comma-separated list of fields on which to perform the deduplication. If none given, each record will be considered as a whole for deduplication. For example, if the input contain records with fields |
Filter Operation | Yes | Optional. An optional property that can be set to predictably choose one or more records from the set of records that needs to be deduplicated. This property takes in a field name and the logical operation that needs to be performed on that field on the set of records. The syntax is Note: Only one pair of field and function is allowed. If this property is not set, one random record will be chosen from the group of ‘duplicate’ records. |
Number of Partitions | Yes | Optional. Number of partitions to use when grouping unique fields. If not specified, the execution framework will decide on the number to use. |
Output Schema | No | Required. The output schema for the data. |
This example deduplicates records by their fname
and lname
fields. Then, it chooses one record out of the duplicates based on the cost
field. Since the function specified is max
, the record with the maximum value in the cost
field is chosen out of each set of duplicate records.
{ "name": "Deduplicate", "type": "batchaggregator", "properties": { "uniqueFields": "fname,lname", "filterOperation": "cost:max" } } |
For example, suppose the aggregator receives input records where each record represents a purchase:
fname | lname | cost | zipcode |
---|---|---|---|
bob | smith | 50.23 | 12345 |
bob | jones | 30.64 | 23456 |
alice | smith | 1.50 | 34567 |
bob | smith | 0.50 | 45678 |
alice | smith | 30.21 | 56789 |
alice | jones | 500.93 | 67890 |
Output records will contain one record for each fname,lname
combination that has the maximum cost
:
fname | lname | cost | zipcode |
---|---|---|---|
bob | smith | 50.23 | 12345 |
bob | jones | 30.64 | 23456 |
alice | smith | 30.21 | 56789 |
alice | jones | 500.93 | 67890 |