This page describes various options available in Wrangler to deal with complex scenarios of processing CSV / TSV or any delimited files.
Before you begin
Some of the instructions here require you to use power-mode in Wrangler. Power-mode in Wrangler exposes the full capabilities of Wrangler directives. Power-mode is available at the bottom of the Wrangler screen.
Non-Printable and CTRL-M Characters
Avoid using automatic header detection with parse-as-csv
directive(parse-as-csv :col ‘\t’ false
). On large files that are distributed across multiple partitions, the header line which is the first line of the CSV file is not present. This will either result in failure or records will be lost.
Sometimes a file looks fine, but it could contain non-printable ASCII characters that usually don’t belong in CSV files. It can be hard to track these down. Use the
find-and-replace
directive.`find-and-replace :col 's/\000-\007\013-\037//g'
.To remove CTRL-M from the end of each line, again use the
find-and-replace
directive.find-and-replace :col 's/\r$//g'
if you want to remove CTRL-M at end of the line. Make sure you apply this directive before applyingparse-as-csv
.
Removing the Header Row
It is not recommended to use the auto-detect header option because it is error prone. That said, how do you go about removing the header from the data rows that need to be processed and more importantly how you set the header? Let’s take the following CSV file as input:
school_id, student_id, last_name, first_name 14J456,33445566,Potter,Harry 14J456,44333433,Weasley,Ron 14J456,65765566,Granger,Hermione 14J456,13233121,Diggory,Cedric 14J456,98786868,Weasley,George 14J456,78977876,Weasley,Fred
To parse the CSV file without the header, use the following directive:
parse-as-csv :body ',' false
false
at the end specifies that no auto-header detection is enabled. The columns would be named as body_1
, body_2
, … body_n
. The name of the columns are based on the input column name, in this case it’s body
.
Note: Data rows will also include the header, which need to be removed from processing.
Now that we have all rows of the CSV file parsed, we can use the filter-rows-on
directive to filter and remove the header row. For this, you will have to identify a column name in the header that will never occur in the data. In this example, we will use school_id
as the column name:
filter-rows-on regex-match body_1 ^school_id$
The directive specifies that we are filtering on regex-match
on the column body_1
(this is the name of the column that was automatically generated from the previous step).
Once this directive is issued, the header row is removed. Now, you can go ahead and set the actual names of the column using the set columns
directive as follows:
set columns school_id,student_id,first_name,last_name
Note: You don’t need to rename all the columns. Rename only the columns you are interested in. The column names are serially applied to the actual columns in the data rows. If there are more columns than what you have set, they will not be changed.
Here’s the full recipe:
parse-as-csv :body ',' false filter-rows-on regex-match body_1 ^school_id$ drop body set columns school_id,student_id,first_name,last_name
Incorrect Width
Filtering records that do not have a specified number of columns in a record can be achieved with
send-to-error
orfilter-row
. For example,send-to-error exp : { this.width() < 4 }
will send all records that have less than 4 (0..3) to error.filter-row exp: { this.width() < 4 } true
will filter records that are less than 4 from the main dataset.To find rows that have issues, you can use
filter-row
with a minor change as follows:filter-row exp: { this.width() < 4 } false
. This will show you only rows that are problematic.
Trailing Commas
Saving an Excel file as a CSV file can create extra commas at the end of each row. Trailing commas can result when columns are deleted or column headers are removed.
school_id, student_id, last_name, first_name,,, 14J456,33445566,Potter,Harry,,, 14J456,44333433,Weasley,Ron,,, 14J456,65765566,Granger,Hermione,,, 14J456,13233121,Diggory,Cedric,,, 14J456,98786868,Weasley,George,,, 14J456,78977876,Weasley,Fred,,,
Take the following steps to resolve the issue using the Wrangler interface:
Open the file in Wrangler.
Parse CSV.
Use the directive
keep
to keep the columns that you need, for example,keep school_id, student_id, last_name, first_name
.
In Wrangler, type in the directive:
keep school_id, student_id, last_name, first_name
Here’s the full recipe:
parse-as-csv body , false filter-rows-on regex-match body_1 ^school_id$ drop body set columns school_id, student_id, last_name, first_name keep school_id,student_id,last_name,first_name
Extra Comma in field values
this.width()
function was introduced in the 4.2 version of Wrangler.
Your data might sometimes include a comma within a field value (for example, if a student’s last name is “Potter, Jr.”). The comma-separated values file will treat the comma as a field separator unless the entire field value is enclosed in double quotation marks.
school_id, student_id, last_name, first_name 14J456,33445566,Potter,Jr,Harry 14J456,44333433,Weasley,Ron
The three commas in the data row (1)
separate that into five fields instead of 4. So you would have to merge field (3)
and (4)
only when the record size is five. Following is the recipe for fixing extra column issues:
parse-as-csv :body ',' false filter-rows-on regex-match body_1 ^school_id$ drop body set columns school_id, student_id, last_name, first_name, body_5 set-column :last_name exp:{ this.width() == 5 ? (last_name + ',' + first_name) : last_name} set-column :first_name exp:{ this.width() == 5 ? body_5 : first_name} drop body_5
Line 1
parses the record as CSV with no headers.Line 2
filters the header from the data rows.Line 3
drops the body column since we don’t store the original data.Line 5
If the number of columns is 5, then it merges salute (in wrong place first name) and last name into a single column last name.Line 6
copies the data value frombody_5
intofirst_name
when column width is 5.
The result of the recipe is as follows:
school_id, student_id, last_name, first_name 14J456,33445566,"Potter,Jr",Harry 14J456,44333433,Weasley,Ron
Leading Zeros
Numerical values (in fields like ‘school_id’, ‘student_id’, and ‘teacher_id’) often begin with “0”, but Excel removes these leading zeros, which changes the value listed for that field. For example, student_id “009844” becomes “9844”.
In the case of Wrangler, when parsed, the leading zeros are maintained until you either remove them explicitly or convert it to int
, float
, or double
type.
UTF-8 encoded files
At times, you might receive CSV files that are UTF-8 encoded. When a file is exported from Excel, you can configure it to be exported as UTF-8. In order to better handle such files, Wrangler provides a set-character
directive to set the right encoding. This directive has to be applied before any other operations are performed as follows.
set-charset :body 'utf-8'
Here’s the full recipe:
set-charset :body 'utf-8' parse-as-csv :body ',' false filter-rows-on regex-match body_1 ^school_id$ drop body set columns school_id,student_id,first_name,last_name