Parsing Files in Wrangler

Starting in CDAP 6.7.0, you can parse a file before loading it into the Wrangler workspace.

When you parse a file before wrangling:

  • Wrangler infers data types and maps each column to the inferred data type, just like file source plugins do in Pipeline Studio. 

  • You can ​​import the schema for file formats, such as JSON, where schema inference is not possible.

  • The recipe doesn’t include the parse directive, which reduces transformation logic during pipeline runs.

  • When you create a pipeline from Wrangler, the source plugin includes all of the same parsing properties and values that you set in Wrangler.

To parse a file before loading it into Wrangler, you must use a file connection (File, GCS, or Amazon S3).

For a list of supported file formats, see File Connection.

To parse a file in Wrangler, follow these steps.

  1. Create a connection (File, GCS, or S3). For more information about connections, see Working with Connections in Wrangler.

  2. Click the connection name and locate the file you want to wrangle.

  3. Parse the file. Depending on the file format, enter the following options:
    Format: Format of the data to read. The format must be one of ‘avro’, ‘blob’, ‘csv’, ‘delimited’, ‘json’, ‘parquet’, ‘text’, or ‘tsv’. The ‘blob’ format also requires a schema that contains a field named ‘body’ of type ‘bytes’. If the format is ‘text’, the schema must contain a field named ‘body’ of type ‘string’.
    Delimiter: Delimiter to use when the format is ‘delimited’.
    Enable Quoted Values: Whether to treat content between quotes as a value. This value will only be used if the format is ‘csv’, ‘tsv’ or ‘delimited’. For example, if this is set to true, a line that looks like 1, "a, b, c" will output two fields. The first field will have 1 as its value and the second will have a, b, c as its value. The quote characters will be trimmed. The newline delimiter cannot be within quotes.

    It also assumes the quotes are well enclosed, for example, "a, b, c". If there is an unenclosed quote, for example "a,b,c, an error will occur.
    Use First Row as Header: Whether to use the first line of each file as the column headers. Supported formats are 'text', 'csv', 'tsv', 'delimited'.
    File Encoding: File encoding for the source file. Default is UTF-8.

  4. (Optional) To import the schema or override the inferred schema for the file, click Import Schema. For example, you must import the schema for formats such as JSON and some AVRO files where schema inference is not possible. Note: The schema must be in the AVRO format.

  5. Click Confirm. The parsed file appears in the Wrangler workspace.

Dealing with CSV challenges in Wrangler

Created in 2020 by Google Inc.