Extracting fields
You might have important data stored in one column, separated by a delimiter. To perform transformations on this data, you need to split the values into separate columns. In Wrangler, you can extract values from a column and create a new column for the extracted values. You can extract values based on:
Patterns
Delimiters
Positions
Using patterns
You can extract values from fields in String columns by using the following common patterns:
Credit cards
Date
Date time
Email
URLs from HTML anchors
IPv4 address
ISBN codes
Mac address
N digits number
SSN
Start/End pattern
Time
To extract values based on a pattern, follow these steps:
From the drop-down menu, click Extract fields > Using patterns.
Select a pattern to extract the fields.
Optionally, click Show pattern to see the regex for that pattern.
Wrangler extracts the fields based on the selected pattern and adds the extract-regex-groups
directive to the recipe. When you run the data pipeline, the transformation will be applied to all values in the column.
For example, you have a column of emails that are preceded by a number. You only want to include the email in the dataset. You can use the email extract pattern to create a new column that only includes the email addresses.
Emails |
---|
1 joe@gmail.com |
2 elaine@gmail.com |
3 sue@gmail.com |
You can extract the values into a new column. Wrangler retains the original column:
Emails | Emails_1_1 |
---|---|
1 joe@gmail.com | joe@gmail.com |
2 elaine@gmail.com | elaine@gmail.com |
3 sue@gmail.com | sue@gmail.com |
Using delimiters
You can split a column into two or more columns based on a delimiter. You can extract values based on the following delimiters:
Comma
Tab
Pipe
Whitespace
Custom separator
Note: If you select Custom separator, regular expression (regex) to define the separator. Regular expressions allows the use of complex search patterns when splitting the data in the column. It supports standard Java regular expression constructs.
If the value does not have the specified separator, then no additional rows are generated.
To extract values based on a delimiter:
From the drop-down menu, click Extract fields > Using delimiters.
Select the delimiter to use to extract the fields.
Wrangler extracts the fields based on the selected delimiter and adds the split-to-columns
directive to the recipe. When you run the data pipeline, the transformation will be applied to all values in the column.
For example, if you have a column of distinct strings separated by commas, such as:
ID | Fruit |
---|---|
1 | orange,lime,lemon |
2 | banana,mango,guava |
3 | avocado,peach,plum |
You can split the values into three separate columns. Wrangler retains the original column.
ID | Fruit | Fruit_1 | Fruit_2 | Fruit_3 |
---|---|---|---|---|
1 | orange,lime,lemon | orange | lime | lemon |
2 | banana,mango,guava | banana | mango | guava |
3 | avocado,peach,plum | avocado | peach | plum |
Using positions
You can extract part of a string based on the position in the string.
To extract fields based on positions from a column:
From the drop-down menu, click Extract fields > Using positions.
The column appears with a blue background, signifying you are in Extract mode.Highlight the portion of one value you want to extract for all values in the column.
The Extract dialog box appears showing you the positions you selected to extract:Â
Enter a name for the new column.
Wrangler extracts the fields based on the selected pattern and adds the
cut-character
directive to the recipe. When you run the data pipeline, the transformation will be applied to all values in the column.
Â
Created in 2020 by Google Inc.