Asterisk(*) as Delimiter in split-to-columns works in Data Prep, but fails in Pipeline
Description
Files delimited with an asterisk parse within data preparation, but throw an error in preview or published pipeline. Either no error is generated or this is seen in the error table: Dangling meta character '' near index 0\x0D\x0A\x0D\x0A^
find-and-replace body s/\"//g split-to-columns body * drop body set columns NAIC,NPLAN,PRODUCT,CLAIM,LINE,VERSION,IGROUP,ESSN,CONTRACT,SEQNO,MEMSSN,REL,SEX,DOB,PATCITY,PATST,PATZIP,PDATE,ADMDAT,ADMHR,ADMTYPE,ADMSR,DISHR,PTDIS,PRV,PRVTAXID,NPRV,PRVTYPE,PRVFNAME,PRVMNAME,PRVLNAME,PRVSUFFIX,PRVSPEC,PRVCITY,PRVST,PRVZIP,BILLTYPE,SVCSITE,STATUS,ADMDX,ECODE,DX1,DX2,DX3,DX4,DX5,DX6,DX7,DX8,DX9,DX10,DX11,DX12,DX13,REV,CPT,MOD1,MOD2,OP,FDATE,LDATE,QTY,CHG,TPAY,PREPAID,COPAY,COINS,DED,PATACCT,DISDAT,PRVCTRY,DRG,DRGVER,APC,APCVER,NDC,PRVBILL,NPRVBILL,PRVLNAMEBILL,SUBSLNAME,SUBSFNAME,SUBSMI,MEMSLNAME,MEMSFNAME,MEMSMI,RECTYPE filter-row-if-matched NAIC NAIC generate-uuid rowkey
POJO Example demonstrating same behavior without metacharacter delimiter and with commented metacharacter delimiter.
C:\Users\ted\Desktop\test>java RegexTestHarness
Enter your regex: * Exception in thread "main" java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0 * ^ at java.util.regex.Pattern.error(Unknown Source) at java.util.regex.Pattern.sequence(Unknown Source) at java.util.regex.Pattern.expr(Unknown Source) at java.util.regex.Pattern.compile(Unknown Source) at java.util.regex.Pattern.<init>(Unknown Source) at java.util.regex.Pattern.compile(Unknown Source) at RegexTestHarness.main(RegexTestHarness.java:16)
C:\Users\ted\Desktop\test>java RegexTestHarness
Enter your regex: * Enter input string to search: "who uses an asterisk as a delimiter"*"crazzy" I found the text "*" starting at index 37 and ending at index 38.
Enter your regex: Enter input string to search: I found the text "" starting at index 0 and ending at index 0.
workaround is to pass double slashes " *" in the directive within the wrangler plugin.
This code also appears to fix the issue: public SplitToColumns(int lineno, String detail, String column, String regex) { super(lineno, detail); this.column = column;
if (regex.matches("[]\\*\\?\\^\\\\\\$]")) { regex = " " + regex; }
this.regex = regex; }
Release Notes
None
Attachments
2
25 May 2017, 02:10 AM
23 May 2017, 04:35 PM
Activity
Show:
Pinned fields
Click on the next to a field label to start pinning.
Files delimited with an asterisk parse within data preparation, but throw an error in preview or published pipeline. Either no error is generated or this is seen in the error table: Dangling meta character '' near index 0\x0D\x0A\x0D\x0A^
find-and-replace body s/\"//g
split-to-columns body *
drop body
set columns NAIC,NPLAN,PRODUCT,CLAIM,LINE,VERSION,IGROUP,ESSN,CONTRACT,SEQNO,MEMSSN,REL,SEX,DOB,PATCITY,PATST,PATZIP,PDATE,ADMDAT,ADMHR,ADMTYPE,ADMSR,DISHR,PTDIS,PRV,PRVTAXID,NPRV,PRVTYPE,PRVFNAME,PRVMNAME,PRVLNAME,PRVSUFFIX,PRVSPEC,PRVCITY,PRVST,PRVZIP,BILLTYPE,SVCSITE,STATUS,ADMDX,ECODE,DX1,DX2,DX3,DX4,DX5,DX6,DX7,DX8,DX9,DX10,DX11,DX12,DX13,REV,CPT,MOD1,MOD2,OP,FDATE,LDATE,QTY,CHG,TPAY,PREPAID,COPAY,COINS,DED,PATACCT,DISDAT,PRVCTRY,DRG,DRGVER,APC,APCVER,NDC,PRVBILL,NPRVBILL,PRVLNAMEBILL,SUBSLNAME,SUBSFNAME,SUBSMI,MEMSLNAME,MEMSFNAME,MEMSMI,RECTYPE
filter-row-if-matched NAIC NAIC
generate-uuid rowkey
https://github.com/hydrator/wrangler/blob/51003e4e23b4895383042f154a76c21218fbd37d/core/src/main/java/co/cask/wrangler/steps/column/SplitToColumns.java
It seems there is no special character evaluation in SplitToColumns.java and escaping via * is not working. Should the escape be
?
The cdap-ui code does substitution and does have double escape with known substitutions. https://github.com/caskdata/cdap/blob/f9eaadc1ce4b01de9dbcfa1331912985218b71da/cdap-ui/app/cdap/components/DataPrep/Directives/ExtractFields/UsingDelimiterModal/index.js
POJO Example demonstrating same behavior without metacharacter delimiter and with commented metacharacter delimiter.
C:\Users\ted\Desktop\test>java RegexTestHarness
Enter your regex: *
Exception in thread "main" java.util.regex.PatternSyntaxException: Dangling meta character '*' near index 0
*
^
at java.util.regex.Pattern.error(Unknown Source)
at java.util.regex.Pattern.sequence(Unknown Source)
at java.util.regex.Pattern.expr(Unknown Source)
at java.util.regex.Pattern.compile(Unknown Source)
at java.util.regex.Pattern.<init>(Unknown Source)
at java.util.regex.Pattern.compile(Unknown Source)
at RegexTestHarness.main(RegexTestHarness.java:16)
C:\Users\ted\Desktop\test>java RegexTestHarness
Enter your regex: *
Enter input string to search: "who uses an asterisk as a delimiter"*"crazzy"
I found the text "*" starting at index 37 and ending at index 38.
Enter your regex:
Enter input string to search:
I found the text "" starting at index 0 and ending at index 0.
workaround is to pass double slashes "
*" in the directive within the wrangler plugin.
This code also appears to fix the issue:
public SplitToColumns(int lineno, String detail, String column, String regex) {
super(lineno, detail);
this.column = column;
if (regex.matches("[]\\*\\?\\^\\\\\\$]")) {
regex = "
" + regex;
}
this.regex = regex;
}