Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Parsing CSV with header extraction from the file as of release 6.2.x does not work when used on large files and multiple smaller files.

  • Avoid using automatic header detection with parse-as-csv directive(parse-as-csv :col ‘\t’ false). On large files that are distributed across multiple partitions, the header line which is the first line of CSV is not present. This will either result in failure or records will be lost.

  • If you have to use parse-as-csv directive, then make sure the files are smaller than 128 MB (lowest data block).

  • Recommendation: parse as csv with skipping header in Wrangler. This would entail the following steps 

    • Add a filter condition to skip header

      • filter-row-if-true offset == 0

    • Drop offset

      • drop offset

    • Parse as csv skipping header

      • parse-as-csv :COLUMN 'SEPARATOR' false

    • Rename the parsed entities in Wrangler (using set-headers)

      • set-headers :COL1,:COL2,…,:COLN