Exploding fields
You can split data in one column to separate important information into individual fields.
Delimited text
You might have important data stored in one column, separated by a delimiter. To perform transformations on this data, you need to split the values into separate rows. In Wrangler, you can split fields in a column into separate rows based on a delimiter. You can split values based on the following delimiters:
Comma
Tab
Pipe
Whitespace
Custom separator
Note: If you select Custom separator, use regular expression (regex) to define the separator. Regular expressions allows the use of complex search patterns when splitting the data in the column. It supports standard Java regular expression constructs.
If the value does not have the specified separator, then no additional rows are generated.
To split values based on a delimiter, follow these steps:
From the drop-down menu, click Explode > Delimited text.
Select the delimiter to use to split the fields.
Wrangler splits the fields based on the selected delimiter and adds the split-to-row
directive to the recipe. When you run the data pipeline, the transformation will be applied to all values in the column.
For example, if you have a column of distinct strings separated by commas, such as:
ID | Fruit |
---|---|
1 | orange,lime,lemon |
2 | banana,mango,guava |
3 | avocado,peach,plum |
You can split the values into separate rows. Wrangler deletes the original column, creates a new column with one row for each value. Wrangler makes copies of all values in each of the other columns in the dataset:
ID | Fruit_1 |
---|---|
1 | orange |
1 | lime |
1 | lemon |
2 | banana |
2 | mango |
2 | guava |
3 | avocado |
3 | peach |
3 | plum |
Arrays (by flattening)
The FLATTEN directive separates the elements in a repeated field into individual records.
The FLATTEN directive is useful for the flexible exploration of repeated data.
To maintain the association between each flattened value and the other fields in the record, the FLATTEN directive copies all of the other columns into each new record.
An array is a data structure that contains a list of values or elements
[“element1”, “element2”, “element3”, “elementN”]
Created in 2020 by Google Inc.