Group By Analytics
Plugin version: 2.11.0
Groups by one or more fields, then performs one or more aggregate functions on each group. Supports Average
, Count
, First
, Last
, Max
, Min
,Sum
,Collect List
,Collect Set
, Standard Deviation
, Variance
, Count Distinct
, Longest String
,collectSet
as aggregate functions.
The transform is used when you want to calculate some basic aggregations in your data similar to what you could do with a group-by query in SQL.
BigQuery ELT Transformation Pushdown
Group By stages are now eligible to execute in BigQuery when BigQuery ELT Transformation Pushdown is enabled in a pipeline. Group By stages will be executed in BigQuery when a preceding stage has already been executed in BigQuery (such as a Join operation or another aggregation stage). The following aggregation operations are supported in BigQuery: Average
, Collect List
(Null values are removed from the output array), Collect Set
(Null values are removed from the output array), Concat
, Concat Distinct
, Count
, Count Distinct
, Count Nulls
, Logical And
, Logical Or
, Max
, Min
, Standard Deviation
, Sum
, and Variance
. If a Group By stage contains any aggregation operation that is not supported in BigQuery, the stage will be executed in Spark.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Group by fields | Yes | Required. Comma-separated list of fields to group by. Records with the same value for all these fields will be grouped together. Records output by this aggregator will contain all the group by fields and aggregate fields. For example, if grouping by the |
Aggregates | Yes | Required. Aggregates to compute on each group of records. Supported aggregate functions are A function must specify the field it should be applied on, as well as the name it should be called. Aggregates are specified using the syntax |
Number of Partitions | Yes | Optional. Number of partitions to use when grouping fields. If not specified, the execution framework will decide on the number to use. |
Output Schema | Yes | Required. The output schema for the data. |
Example
This example groups records by their user
and item
fields. It then calculates three aggregates for each group. The first is a sum on price
, the second counts the number of records in the group and third one calculates the average price of each item not taking into consideration prices lower than 0.50.
{
"name": "GroupByAggregate",
"type": "batchaggregator",
"properties": {
"groupByFields": "user,item",
"aggregates": "totalSpent:sum(price),numPurchased:count(*),avgItemPrice:avgIf(price):condition(price>=0.50)"
}
}
For example, suppose the aggregator receives input records where each record represents a purchase:
user | item | price |
---|---|---|
bob | donut | 0.80 |
bob | coffee | 2.05 |
bob | coffee | 0.35 |
bob | donut | 1.50 |
bob | donut | 0.50 |
bob | donut | 0.45 |
bob | coffee | 3.50 |
alice | tea | 1.99 |
alice | cookie | 0.50 |
alice | cookie | 0.80 |
alice | tea | 1.50 |
alice | tea | 0.30 |
Output records will contain all group fields in addition to a field for each aggregate:
user | item | totalSpent | numPurchased | avgItemPrice |
---|---|---|---|---|
bob | donut | 3.25 | 4 | 0.933 |
bob | coffee | 5.90 | 3 | 2.775 |
alice | tea | 3.79 | 3 | 1.745 |
alice | cookie | 1.30 | 2 | 0.65 |
Created in 2020 by Google Inc.