This page describes various options available in Wrangler to deal with complex scenarios of processing CSV / TSV or any delimited files.
Before you begin
Some of the instructions here require you to use power-mode in Wrangler. Power-mode in Wrangler exposes the full capabilities of Wrangler directives. Power-mode is available at the bottom of the Wrangler screen.
Non-Printable and CTRL-M Characters
Avoid using automatic header detection with parse-as-csv
directive(parse-as-csv :col ‘\t’ false
). On large files that are distributed across multiple partitions, the header line which is the first line of the CSV file is not present. This will either result in failure or records will be lost.
Sometimes a file looks fine, but it could contain non-printable ASCII characters that usually don’t belong in CSV files. It can be hard to track these down. Use the
find-and-replace
directive.`find-and-replace :col 's/\000-\007\013-\037//g'
.To remove CTRL-M from the end of each line, again use the
find-and-replace
directive.find-and-replace :col 's/\r$//g'
if you want to remove CTRL-M at end of the line. Make sure you apply this directive before applyingparse-as-csv
.
Incorrect Width
Filtering records that do not have a specified number of columns in a record can be achieved with
send-to-error
orfilter-row
. For example,send-to-error exp : { this.width() < 4 }
will send all records that have less than 4 (0..3) to error.filter-row exp: { this.width() < 4 } true
will filter records that are less than 4 from the main dataset.To find rows that have issues, you can use
filter-row
with a minor change as follows:filter-row exp: { this.width() < 4 } false
. This will show you only rows that are problematic.
Trailing Commas
Saving an Excel file as a CSV file can create extra commas at the end of each row. Trailing commas can result when columns are deleted or column headers are removed.
school_id, student_id, last_name, first_name,,, 14J456,33445566,Potter,Harry,,, 14J456,44333433,Weasley,Ron,,, 14J456,65765566,Granger,Hermione,,, 14J456,13233121,Diggory,Cedric,,, 14J456,98786868,Weasley,George,,, 14J456,78977876,Weasley,Fred,,,
Take the following steps to resolve the issue using the Wrangler interface:
Open the file in Wrangler.
Parse CSV.
Use the directive
keep
to keep the columns that you need, for example,keep school_id, student_id, last_name, first_name
.
In Wrangler, type in the directive:
keep school_id, student_id, last_name, first_name
Here’s the full recipe:
parse-as-csv body , false filter-rows-on regex-match body_1 ^school_id$ drop body set columns school_id, student_id, last_name, first_name keep school_id,student_id,last_name,first_name