Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Sometimes a file looks fine, but it could contain non-printable ASCII characters that usually don’t belong in CSV files. It can be hard to track these down. Use the find-and-replace directive. `find-and-replace :col 's/\000-\007\013-\037//g'.

  • To remove CTRL-M from the end of each line, again use the find-and-replace directive. find-and-replace :col 's/\r$//g' if you want to remove CTRL-M at end of the line. Make sure you apply this directive before applying parse-as-csv.

Removing the Header Row

Note

Important: This section is deprecated. See Parsing CSV Files in Wrangler.

It is not recommended to use the auto-detect header option because it is error prone. That said, how do you go about removing the header from the data rows that need to be processed and more importantly how you set the header? Let’s take the following CSV file as input:

Code Block
school_id, student_id, last_name, first_name
14J456,33445566,Potter,Harry
14J456,44333433,Weasley,Ron
14J456,65765566,Granger,Hermione
14J456,13233121,Diggory,Cedric
14J456,98786868,Weasley,George
14J456,78977876,Weasley,Fred

To parse the CSV file without the header, use the following directive:

  • parse-as-csv :body ',' false

false at the end specifies that no auto-header detection is enabled. The columns would be named as body_1, body_2, … body_n. The name of the columns are based on the input column name, in this case it’s body.

Note: Data rows will also include the header, which need to be removed from processing.

Now that we have all rows of the CSV file parsed, we can use the filter-rows-on directive to filter and remove the header row. For this, you will have to identify a column name in the header that will never occur in the data. In this example, we will use school_id as the column name:

  • filter-rows-on regex-match body_1 ^school_id$

The directive specifies that we are filtering on regex-match on the column body_1 (this is the name of the column that was automatically generated from the previous step).

Once this directive is issued, the header row is removed. Now, you can go ahead and set the actual names of the column using the set columns directive as follows:

  • set columns school_id,student_id,first_name,last_name

Note: You don’t need to rename all the columns. Rename only the columns you are interested in. The column names are serially applied to the actual columns in the data rows. If there are more columns than what you have set, they will not be changed.

Here’s the full recipe:

Code Block
parse-as-csv :body ',' false
filter-rows-on regex-match body_1 ^school_id$
drop body
set columns school_id,student_id,first_name,last_name

Incorrect Width

  • Filtering records that do not have a specified number of columns in a record can be achieved with send-to-error or filter-row. For example, send-to-error exp : { this.width() < 4 } will send all records that have less than 4 (0..3) to error. filter-row exp: { this.width() < 4 } true will filter records that are less than 4 from the main dataset.

  • To find rows that have issues, you can use filter-row with a minor change as follows: filter-row exp: { this.width() < 4 } false. This will show you only rows that are problematic.

...

Code Block
school_id, student_id, last_name, first_name
14J456,33445566,"Potter,Jr",Harry
14J456,44333433,Weasley,Ron

Leading Zeros

Numerical values (in fields like ‘school_id’, ‘student_id’, and ‘teacher_id’) often begin with “0”, but Excel removes these leading zeros, which changes the value listed for that field. For example, student_id “009844” becomes “9844”.

...

UTF-8 encoded files

At times, you might receive CSV files that are UTF-8 encoded. When a file is exported from Excel, you can configure it to be exported as UTF-8. In order to better handle such files, Wrangler provides a set-character directive to set the right encoding. This directive has to be applied before any other operations are performed as follows.

...