Group By Analytics

Plugin version: 2.11.0

Groups by one or more fields, then performs one or more aggregate functions on each group. Supports Average, Count, First, Last, Max, Min,Sum,Collect List,Collect Set, Standard Deviation, Variance, Count Distinct, Longest String,collectSet as aggregate functions.

The transform is used when you want to calculate some basic aggregations in your data similar to what you could do with a group-by query in SQL.

BigQuery ELT Transformation Pushdown

Group By stages are now eligible to execute in BigQuery when BigQuery ELT Transformation Pushdown is enabled in a pipeline. Group By stages will be executed in BigQuery when a preceding stage has already been executed in BigQuery (such as a Join operation or another aggregation stage). The following aggregation operations are supported in BigQuery: Average, Collect List (Null values are removed from the output array), Collect Set (Null values are removed from the output array), Concat, Concat Distinct, Count, Count Distinct, Count Nulls, Logical And, Logical Or, Max, Min, Standard Deviation, Sum, and Variance. If a Group By stage contains any aggregation operation that is not supported in BigQuery, the stage will be executed in Spark.

Configuration

Property	Macro Enabled?	Description

Property	Macro Enabled?	Description
Group by fields	Yes	Required. Comma-separated list of fields to group by. Records with the same value for all these fields will be grouped together. Records output by this aggregator will contain all the group by fields and aggregate fields. For example, if grouping by the `user` field and calculating an aggregate `numActions:count(*)`, output records will have a `user` field and a `numActions` field.
Aggregates	Yes	Required. Aggregates to compute on each group of records. Supported aggregate functions are `avg`, `count`, `count()`, `first`, `last`, `max`, `min`,`sum`,`collectList`, `collectSet`, `countDistinct`, `longestString`, `shortestString`, `countNulls`, `concat`, `variance` `concatDistinct`, `stdDev`,`logicalAnd`, `logicalOr`, `sumOfSquares`, `correctedSumOfSquares`, `avgIf`, `countIf`, `maxIf`, `minIf`, `sumIf`, `collectListIf`, `collectSetIf`, `countDistinctIf`, `longestStringIf`, `shortestStringIf`, `concatIf`, `varianceIf`, `anyIf`, `concatDistinctIf`, `stdDevIf` `logicalAndIf`, `logicalOrIf`, `sumOfSquaresIf`, `correctedSumOfSquaresIf`. A function must specify the field it should be applied on, as well as the name it should be called. Aggregates are specified using the syntax `name:function(field)[, other aggregates]`. For example, `avgPrice:avg(price),cheapest:min(price),countPricesHigherThan:countIf(price):condition(price>500)` will calculate three aggregates. The first will create a field called `avgPrice` that is the average of all `price` fields in the group. The second will create a field called `cheapest` that contains the minimum `price` fields in the group. The second will create a field called `cheapest` that contains the minimum `price` field in the group. The count function differs from count() in that it contains non-null values of a specific field, while count(*) will count all records regardless of value.
Number of Partitions	Yes	Optional. Number of partitions to use when grouping fields. If not specified, the execution framework will decide on the number to use.
Output Schema	Yes	Required. The output schema for the data.

Example

This example groups records by their user and item fields. It then calculates three aggregates for each group. The first is a sum on price, the second counts the number of records in the group and third one calculates the average price of each item not taking into consideration prices lower than 0.50.

    {
        "name": "GroupByAggregate",
        "type": "batchaggregator",
        "properties": {
            "groupByFields": "user,item",
            "aggregates": "totalSpent:sum(price),numPurchased:count(*),avgItemPrice:avgIf(price):condition(price>=0.50)"
        }
    }

For example, suppose the aggregator receives input records where each record represents a purchase:

user	item	price

user	item	price
bob	donut	0.80
bob	coffee	2.05
bob	coffee	0.35
bob	donut	1.50
bob	donut	0.50
bob	donut	0.45
bob	coffee	3.50
alice	tea	1.99
alice	cookie	0.50
alice	cookie	0.80
alice	tea	1.50
alice	tea	0.30

Output records will contain all group fields in addition to a field for each aggregate:

user	item	totalSpent	numPurchased	avgItemPrice

user	item	totalSpent	numPurchased	avgItemPrice
bob	donut	3.25	4	0.933
bob	coffee	5.90	3	2.775
alice	tea	3.79	3	1.745
alice	cookie	1.30	2	0.65