Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Filtering records that do not have a specified number of columns in a record can be achieved with send-to-error or filter-row. For example, send-to-error exp : { this.width() < 4 } will send all records that have less than 4 (0..3) to error. filter-row exp: { this.width() < 4 } true will filter records that are less than 4 from the main dataset.

  • To find rows that have issues, you can use filter-row with a minor change as follows: filter-row exp: { this.width() < 4 } false. This will show you only rows that are problematic.

Trailing Commas

Saving an Excel file as a CSV file can create extra commas at the end of each row. Trailing commas can result when columns are deleted or column headers are removed.

Code Block
school_id, student_id, last_name, first_name,,,
14J456,33445566,Potter,Harry,,,
14J456,44333433,Weasley,Ron,,,
14J456,65765566,Granger,Hermione,,,
14J456,13233121,Diggory,Cedric,,,
14J456,98786868,Weasley,George,,,
14J456,78977876,Weasley,Fred,,,

Take the following steps to resolve the issue using the Wrangler interface:

  • Open the file in Wrangler.

  • Parse CSV.

  • Use the directive keep to keep the columns that you need, for example, keep school_id, student_id, last_name, first_name.

In Wrangler, type in the directive:

  • keep school_id, student_id, last_name, first_name

Here’s the full recipe:

Code Block
parse-as-csv body , false
filter-rows-on regex-match body_1 ^school_id$
drop body
set columns school_id, student_id, last_name, first_name
keep school_id,student_id,last_name,first_name

Extra Comma in field values

Info

this.width() function was introduced in the 4.2 version of Wrangler.

Your data might sometimes include a comma within a field value (for example, if a student’s last name is “Potter, Jr.”). The comma-separated values file will treat the comma as a field separator unless the entire field value is enclosed in double quotation marks.

Code Block
school_id, student_id, last_name, first_name
14J456,33445566,Potter,Jr,Harry
14J456,44333433,Weasley,Ron

The three commas in the data row (1) separate that into five fields instead of 4. So you would have to merge field (3) and (4) only when the record size is five. Following is the recipe for fixing extra column issues:

Code Block
parse-as-csv :body ',' false
filter-rows-on regex-match body_1 ^school_id$
drop body
set columns school_id, student_id, last_name, first_name, body_5
set-column :last_name exp:{ this.width() == 5 ? (last_name + ',' + first_name) : last_name}
set-column :first_name exp:{ this.width() == 5 ? body_5 : first_name}
drop body_5
  • Line 1 parses the record as CSV with no headers.

  • Line 2 filters the header from the data rows.

  • Line 3 drops the body column since we don’t store the original data.

  • Line 5 If the number of columns is 5, then it merges salute (in wrong place first name) and last name into a single column last name.

  • Line 6 copies the data value from body_5 into first_name when column width is 5.

The result of the recipe is as follows:

Code Block
school_id, student_id, last_name, first_name
14J456,33445566,"Potter,Jr",Harry
14J456,44333433,Weasley,Ron

UTF-8 encoded files

At times, you might receive CSV files that are UTF-8 encoded. When a file is exported from Excel, you can configure it to be exported as UTF-8. In order to better handle such files, Wrangler provides a set-character directive to set the right encoding. This directive has to be applied before any other operations are performed as follows.

  • set-charset :body 'utf-8'

Here’s the full recipe:

Code Block
set-charset :body 'utf-8'
parse-as-csv :body ',' false
filter-rows-on regex-match body_1 ^school_id$
drop body
set columns school_id,student_id,first_name,last_name