Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Property

Macro Enabled?

Release Introduced

Description

Input Schema

No

Derived from the output schema of the previous stages. Cannot be changed.

Fields

Yes

Required. List of fields from each input that should be included in the output. Output field names must be unique. If the same field name exists in more than one input, each field must be aliased (renamed) to a unique output name.

Join Type

Yes

Optional. Type of join to perform. A join between two required input is an inner join. A join between a required input and an optional input is a left outer join. A join between two optional inputs is an outer join.

A join of more than two inputs is logically equivalent to performing inner joins over all the required inputs, followed by left outer joins on the optional inputs.

Join Condition Type

Yes

6.4.0 / 2.6.0

Optional. Type of join condition to use. A condition can either be 'Basic' or 'Advanced'. Advanced join conditions cannot be used in streaming pipelines or with the MapReduce engine. Advanced join conditions can only be used when joining two inputs.

Join Condition

Yes

Required. When the condition type is 'Basic', the condition specifies the list of keys to perform the join operation. The join will be performed based on equality of the join keys. When the condition type is 'Advanced', the condition can be any SQL expression supported by the engine. It is important to note that advanced join conditions can be many times more expensive than basic joins performed on equality. Advanced outer joins must load one of the inputs into memory. Advanced inner joins do not need to load an input into memory. However, without an in-memory input, the engine will be forced to calculate a very expensive cartesian product.

Get Schema

No

Optional. If you drop fields or add aliases to field names, Get Schema refreshes the output schema. Also used to validate the join condition.

Inputs to Load in Memory

Yes

6.2.1

Optional. Spark only. Hint to the underlying execution engine that the specified input data should be loaded into memory to perform an in-memory join. This is ignored by the MapReduce engine and passed onto the Spark engine. An in-memory join performs well when one side of the join is small (for example, under 1gb). Be sure to set Spark executor and driver memory to a number large enough to load all of these datasets into memory. This is most commonly used when a large input is being joined to a small input and will lead to much better performance in such scenarios. A general rule of thumb is to set executor and driver memory to fives times the dataset size.
For more information, see “Inputs to Load into Memory” below.

Join on Null keys

Yes

6.2.1

Optional. For more information, see Join on Null Keys below.

Number of Partitions

Yes

6.2.1

Optional. Number of partitions to use when grouping fields. This value is ignored if an input is loaded into memory or if an advanced join is being performed. When an input is loaded into memory, the number of partitions will be equal to the number of partitions used for the input that is not loaded into memory. If no value is given, or if an inner advanced join is being performed, the number of partitions will be determined by the value of 'spark.sql.shuffle.partitions' in the engine config, which defaults to 200. For more information, see Number of Partitions below.

Skewed Input Stage

Yes

6.72.0/2.9.0

Optional. Name of the skewed input stage. The skewed input stage is the one that contains many rows that join to the same row in the non-skewed stage. E.g. If stage A has 1,000,000 rows that join on the same row in stage B, then stage A is the skewed input stage.

For more information about Data Skew, please see the Data Skew section of this documentation.

Distribution Enabled

Yes

6.2.2

Optional. Enabling distribution will increase the level of parallelism when joining skewed data. A skewed join happens when a significant percentage of input records have the same key. Distribution is possible when the following conditions are met:
1. There are exactly two input stages.
2. Inputs to Load in Memory (also known as Broadcast in Spark) is not enabled for either stage.
3. The skewed input fields are marked as required.

Distribution requires two parameters:

  1. Distribution Size: This controls the size of the salt that will be generated for distribution. The Number of Partitions property should be greater than or equal to this number for optimal results. A larger value will lead to more parallelism but it will also grow the size of the non-skewed dataset by this factor.

  2. Skewed Input Stage: Listed above.

For more information about Distribution and data skew, please see the Data Skew section of this documentation.

Distribution Size

Yes

6.2.2

Optional. This controls the size of the composite key (salt) that will be generated for distribution. The Number of Partitions property should be greater than or equal to this number for optimal results. A larger value will lead to more parallelism but it will also grow the size of the non-skewed dataset by this factor.

Distribution Skewed Input Stage

Yes

6.2.2

Optional. Name of the skewed input stage. The skewed input stage is the one that contains many rows that join to the same row in the non-skewed stage. Ex. If stage A has 10 rows that join on the same row in stage B, then stage A is the skewed input stage. For more information about Distribution and data skew, see Data skew.

Input with Larger Data Skew

Yes

6.7.0/2.9.0

Optional. Defines the most skewed stage for a join operation. You can select the most skewed input stage when executing join operations in BigQuery ELT Transformation Pushdown. This might prevent a RESOURCE EXCEEDED error when executing join operations with skewed datasets.

Output Schema

Yes

Required. The output schema should not be manually edited. If you make changes to the the input fields, such as adding an alias to a duplicate field name, always use the Get Schema button to refresh the output schema. The output schema will have the same order of fields as the input schemas. Field names in the output schema must be unique.

...