Deduplicate Analytics
Plugin version: 2.11.0
De-duplicates records either using one or more fields or by using the record as a whole. Additionally, it supports logically choosing a record out of the duplicate records based on a filter field. Supported logical functions are any
, max
and min
. min
and max
must be operated only on numeric filter fields.
The any
logical function will match any of the records where the value for the specified field is non-null. If not found, a record where the value of this field is null will be returned.
The aggregator is used when you want to filter out duplicates in the input in a predictable way.
The first
and last
logical functions are deprecated and will be removed in later versions of the platform.
BigQuery ELT Transformation Pushdown
Starting in CDAP 6.7.0, Deduplicate stages are eligible to execute in BigQuery when BigQuery ELT Transformation Pushdown is enabled in a pipeline. Deduplicate stages will be executed in BigQuery when a preceding stage has already been executed in BigQuery (such as a Join operation or another aggregation stage). The following filter operations are supported when deduplicating records: any
, max
, and min
.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Unique Fields | Yes | Optional. A comma-separated list of fields on which to perform the deduplication. If none given, each record will be considered as a whole for deduplication. For example, if the input contain records with fields |
Filter Operation | Yes | Optional. An optional property that can be set to predictably choose one or more records from the set of records that needs to be deduplicated. This property takes in a field name and the logical operation that needs to be performed on that field on the set of records. The syntax is Note: Only one pair of field and function is allowed. If this property is not set, one random record will be chosen from the group of ‘duplicate’ records. |
Number of Partitions | Yes | Optional. Number of partitions to use when grouping unique fields. If not specified, the execution framework will decide on the number to use. |
Output Schema | No | Required. The output schema for the data. |
Example
This example deduplicates records by their fname
and lname
fields. Then, it chooses one record out of the duplicates based on the cost
field. Since the function specified is max
, the record with the maximum value in the cost
field is chosen out of each set of duplicate records.
{
"name": "Deduplicate",
"type": "batchaggregator",
"properties": {
"uniqueFields": "fname,lname",
"filterOperation": "cost:max"
}
}
For example, suppose the aggregator receives input records where each record represents a purchase:
fname | lname | cost | zipcode |
---|---|---|---|
bob | smith | 50.23 | 12345 |
bob | jones | 30.64 | 23456 |
alice | smith | 1.50 | 34567 |
bob | smith | 0.50 | 45678 |
alice | smith | 30.21 | 56789 |
alice | jones | 500.93 | 67890 |
Output records will contain one record for each fname,lname
combination that has the maximum cost
:
fname | lname | cost | zipcode |
---|---|---|---|
bob | smith | 50.23 | 12345 |
bob | jones | 30.64 | 23456 |
alice | smith | 30.21 | 56789 |
alice | jones | 500.93 | 67890 |
Created in 2020 by Google Inc.