Extracting fields

You might have important data stored in one column, separated by a delimiter. To perform transformations on this data, you need to split the values into separate columns. In Wrangler, you can extract values from a column and create a new column for the extracted values. You can extract values based on:

  • Patterns

  • Delimiters

  • Positions

Using patterns

You can extract values from fields in String columns by using the following common patterns:

  • Credit cards

  • Date

  • Date time

  • Email

  • URLs from HTML anchors

  • IPv4 address

  • ISBN codes

  • Mac address

  • N digits number

  • SSN

  • Start/End pattern

  • Time

To extract values based on a pattern, follow these steps:

  1. From the drop-down menu, click Extract fields > Using patterns.

  2. Select a pattern to extract the fields.

  3. Optionally, click Show pattern to see the regex for that pattern.

Wrangler extracts the fields based on the selected pattern and adds the extract-regex-groups directive to the recipe. When you run the data pipeline, the transformation will be applied to all values in the column.

For example, you have a column of emails that are preceded by a number. You only want to include the email in the dataset. You can use the email extract pattern to create a new column that only includes the email addresses.

Emails

Emails

1 joe@gmail.com

2 elaine@gmail.com

3 sue@gmail.com

You can extract the values into a new column. Wrangler retains the original column:

Emails

Emails_1_1

Emails

Emails_1_1

1 joe@gmail.com

joe@gmail.com

2 elaine@gmail.com

elaine@gmail.com

3 sue@gmail.com

sue@gmail.com

Using delimiters

You can split a column into two or more columns based on a delimiter. You can extract values based on the following delimiters:

  • Comma

  • Tab

  • Pipe

  • Whitespace

  • Custom separator

Note: If you select Custom separator, regular expression (regex) to define the separator. Regular expressions allows the use of complex search patterns when splitting the data in the column. It supports standard Java regular expression constructs.

If the value does not have the specified separator, then no additional rows are generated.

To extract values based on a delimiter:

  1. From the drop-down menu, click Extract fields > Using delimiters.

  2. Select the delimiter to use to extract the fields.

Wrangler extracts the fields based on the selected delimiter and adds the split-to-columns directive to the recipe. When you run the data pipeline, the transformation will be applied to all values in the column.

For example, if you have a column of distinct strings separated by commas, such as:

ID

Fruit

ID

Fruit

1

orange,lime,lemon

2

banana,mango,guava

3

avocado,peach,plum

You can split the values into three separate columns. Wrangler retains the original column.

ID

Fruit

Fruit_1

Fruit_2

Fruit_3

ID

Fruit

Fruit_1

Fruit_2

Fruit_3

1

orange,lime,lemon

orange

lime

lemon

2

banana,mango,guava

banana

mango

guava

3

avocado,peach,plum

avocado

peach

plum

Using positions

You can extract part of a string based on the position in the string.

To extract fields based on positions from a column:

  1. From the drop-down menu, click Extract fields > Using positions.
    The column appears with a blue background, signifying you are in Extract mode.

  2. Highlight the portion of one value you want to extract for all values in the column.
    The Extract dialog box appears showing you the positions you selected to extract:

     

  3. Enter a name for the new column.

    Wrangler extracts the fields based on the selected pattern and adds the cut-character directive to the recipe. When you run the data pipeline, the transformation will be applied to all values in the column.

 

Created in 2020 by Google Inc.