Asterisk(*) as Delimiter in split-to-columns works in Data Prep, but fails in Pipeline

Description

Files delimited with an asterisk parse within data preparation, but throw an error in preview or published pipeline. Either no error is generated or this is seen in the error table: Dangling meta character '' near index 0\x0D\x0A\x0D\x0A^

find-and-replace body s/\"//g
split-to-columns body *
drop body
set columns NAIC,NPLAN,PRODUCT,CLAIM,LINE,VERSION,IGROUP,ESSN,CONTRACT,SEQNO,MEMSSN,REL,SEX,DOB,PATCITY,PATST,PATZIP,PDATE,ADMDAT,ADMHR,ADMTYPE,ADMSR,DISHR,PTDIS,PRV,PRVTAXID,NPRV,PRVTYPE,PRVFNAME,PRVMNAME,PRVLNAME,PRVSUFFIX,PRVSPEC,PRVCITY,PRVST,PRVZIP,BILLTYPE,SVCSITE,STATUS,ADMDX,ECODE,DX1,DX2,DX3,DX4,DX5,DX6,DX7,DX8,DX9,DX10,DX11,DX12,DX13,REV,CPT,MOD1,MOD2,OP,FDATE,LDATE,QTY,CHG,TPAY,PREPAID,COPAY,COINS,DED,PATACCT,DISDAT,PRVCTRY,DRG,DRGVER,APC,APCVER,NDC,PRVBILL,NPRVBILL,PRVLNAMEBILL,SUBSLNAME,SUBSFNAME,SUBSMI,MEMSLNAME,MEMSFNAME,MEMSMI,RECTYPE
filter-row-if-matched NAIC NAIC
generate-uuid rowkey

https://github.com/hydrator/wrangler/blob/51003e4e23b4895383042f154a76c21218fbd37d/core/src/main/java/co/cask/wrangler/steps/column/SplitToColumns.java

It seems there is no special character evaluation in SplitToColumns.java and escaping via * is not working. Should the escape be
?

The cdap-ui code does substitution and does have double escape with known substitutions. https://github.com/caskdata/cdap/blob/f9eaadc1ce4b01de9dbcfa1331912985218b71da/cdap-ui/app/cdap/components/DataPrep/Directives/ExtractFields/UsingDelimiterModal/index.js

POJO Example demonstrating same behavior without metacharacter delimiter and with commented metacharacter delimiter.

C:\Users\ted\Desktop\test>java RegexTestHarness

Enter your regex: *
Exception in thread "main" java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
*
^
at java.util.regex.Pattern.error(Unknown Source)
at java.util.regex.Pattern.sequence(Unknown Source)
at java.util.regex.Pattern.expr(Unknown Source)
at java.util.regex.Pattern.compile(Unknown Source)
at java.util.regex.Pattern.<init>(Unknown Source)
at java.util.regex.Pattern.compile(Unknown Source)
at RegexTestHarness.main(RegexTestHarness.java:16)

C:\Users\ted\Desktop\test>java RegexTestHarness

Enter your regex: *
Enter input string to search: "who uses an asterisk as a delimiter"*"crazzy"
I found the text "*" starting at index 37 and ending at index 38.

Enter your regex:
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.

workaround is to pass double slashes "
*" in the directive within the wrangler plugin.

This code also appears to fix the issue:
public SplitToColumns(int lineno, String detail, String column, String regex) {
super(lineno, detail);
this.column = column;

if (regex.matches("[]\\*\\?\\^\\\\\\$]")) {
regex = "
" + regex;
}

this.regex = regex;
}

Release Notes

None

Attachments

2
  • 25 May 2017, 02:10 AM
  • 23 May 2017, 04:35 PM

Activity

Show:
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Labels

Affects versions

Components

Priority

Created May 23, 2017 at 4:35 PM
Updated September 5, 2019 at 12:10 AM