Group By Analytics

Plugin version: 2.11.0

Groups by one or more fields, then performs one or more aggregate functions on each group. Supports Average, Count, First, Last, Max, Min,Sum,Collect List,Collect Set, Standard Deviation, Variance, Count Distinct, Longest String,collectSet as aggregate functions.

The transform is used when you want to calculate some basic aggregations in your data similar to what you could do with a group-by query in SQL.

BigQuery ELT Transformation Pushdown

Group By stages are now eligible to execute in BigQuery when BigQuery ELT Transformation Pushdown is enabled in a pipeline. Group By stages will be executed in BigQuery when a preceding stage has already been executed in BigQuery (such as a Join operation or another aggregation stage). The following aggregation operations are supported in BigQuery: Average, Collect List (Null values are removed from the output array), Collect Set (Null values are removed from the output array), Concat, Concat Distinct, Count, Count Distinct, Count Nulls, Logical And, Logical Or, Max, Min, Standard Deviation, Sum, and Variance. If a Group By stage contains any aggregation operation that is not supported in BigQuery, the stage will be executed in Spark.

Configuration

Property

Macro Enabled?

Description

Property

Macro Enabled?

Description

Group by fields

Yes

Required. Comma-separated list of fields to group by. Records with the same value for all these fields will be grouped together. Records output by this aggregator will contain all the group by fields and aggregate fields. For example, if grouping by the user field and calculating an aggregate numActions:count(*), output records will have a user field and a numActions field.

Aggregates

Yes

Required. Aggregates to compute on each group of records. Supported aggregate functions are avg, count, count(*), first, last, max, min,sum,collectList, collectSet, countDistinct, longestString, shortestString, countNulls, concat, variance concatDistinct, stdDev,logicalAnd, logicalOr, sumOfSquares, correctedSumOfSquares, avgIf, countIf, maxIf, minIf, sumIf, collectListIf, collectSetIf, countDistinctIf, longestStringIf, shortestStringIf, concatIf, varianceIf, anyIf, concatDistinctIf, stdDevIf logicalAndIf, logicalOrIf, sumOfSquaresIf, correctedSumOfSquaresIf.

A function must specify the field it should be applied on, as well as the name it should be called. Aggregates are specified using the syntax name:function(field)[, other aggregates]. For example, avgPrice:avg(price),cheapest:min(price),countPricesHigherThan:countIf(price):condition(price>500) will calculate three aggregates. The first will create a field called avgPrice that is the average of all price fields in the group. The second will create a field called cheapest that contains the minimum price fields in the group. The second will create a field called cheapest that contains the minimum price field in the group. The count function differs from count() in that it contains non-null values of a specific field, while count() will count all records regardless of value.

Number of Partitions

Yes

Optional. Number of partitions to use when grouping fields. If not specified, the execution framework will decide on the number to use.

Output Schema

Yes

Required. The output schema for the data.

Example

This example groups records by their user and item fields. It then calculates three aggregates for each group. The first is a sum on price, the second counts the number of records in the group and third one calculates the average price of each item not taking into consideration prices lower than 0.50.

{ "name": "GroupByAggregate", "type": "batchaggregator", "properties": { "groupByFields": "user,item", "aggregates": "totalSpent:sum(price),numPurchased:count(*),avgItemPrice:avgIf(price):condition(price>=0.50)" } }

For example, suppose the aggregator receives input records where each record represents a purchase:

user

item

price

user

item

price

bob

donut

0.80

bob

coffee

2.05

bob

coffee

0.35

bob

donut

1.50

bob

donut

0.50

bob

donut

0.45

bob

coffee

3.50

alice

tea

1.99

alice

cookie

0.50

alice

cookie

0.80

alice

tea

1.50

alice

tea

0.30

Output records will contain all group fields in addition to a field for each aggregate:

user

item

totalSpent

numPurchased

avgItemPrice

user

item

totalSpent

numPurchased

avgItemPrice

bob

donut

3.25

4

0.933

bob

coffee

5.90

3

2.775

alice

tea

3.79

3

1.745

alice

cookie

1.30

2

0.65

Created in 2020 by Google Inc.