Goal
This is a source plugin that would allow users to read and process mainframe files defined using COBOL Copybook. This should be basic first implementation.
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
Examples and guides- Integration tests
- Documentation for feature
- Short video demonstrating the feature
Use-case
It’s basically used for reading flat file or dataset that is generated on a z/OS IBM mainframe based on a fixed length COBOL copybook. This will also work on AS/400 computers. So, if a customer has flat files on HDFS that can be parsed using simple COBOL copybook then applying the copybook one is able to read the file and its fields.
Conditions
Supports only fixed length binary format that matches the copybook
Binary data should be converted to Base64 encoded
First implementation will not be able to handle complex nested structures of COBOL copybook
Also will not handle Redefines or iterators in the structure.
Supports compressed files - Native Compressed Codec
Options
User should be able to copy paste or provide a file that gets loaded into text section for COBOL copybook
User should have the ability to select the fields that one wants into the output schema. So they should be able to specify the field.
References
Input Format implementation : here
Design
- Assumptions:
- .cbl file will have the schema in data structure
- Both data file and .cbl files would reside on HDFS
- For each "AbstractFieldValue" read from the data file if the type is binary, the data will be encoded to Base64 format.
Integer.parseInt(Base64.decodeBase64(Base64.encodeBase64(value.toString().getBytes())).toString());
or
Base64.decodeInteger(Base64.encodeInteger(value.asBigInteger()));
It will depend on the field data type(int or BigInteger)
Examples
Properties :
copybookContents : Contents of the COBOL copybook file which will contain the data structure
binaryFilePath : Complete path of the .bin to be read.This will be a fixed length binary format file,that matches the copybook.
drop : Comma-separated list of fields to drop. For example: 'field1,field2,field3'.
maxSplitSize : Maximum split-size for each mapper in the MapReduce Job. Defaults to 128MB.
Example :
This example reads data from a local binary file "file:///home/cdap/DTAR020_FB.bin" and parses it using the schema given in the text area "COBOL Copybook"
It will drop field "DTAR020-DATE" and generate structured records with schema as specified in the text area.
{
"name": "CopybookReader",
"plugin": {
"name": "CopybookReader",
"type": "batchsource",
"properties": {
"drop" : "DTAR020-DATE",
"referenceName": "Copybook",
"copybookContents":
"000100* \n
000200* DTAR020 IS THE OUTPUT FROM DTAB020 FROM THE IML \n
000300* CENTRAL REPORTING SYSTEM \n
000400* \n
000500* CREATED BY BRUCE ARTHUR 19/12/90 \n
000600* \n
000700* RECORD LENGTH IS 27. \n
000800* \n
000900 03 DTAR020-KCODE-STORE-KEY. \n
001000 05 DTAR020-KEYCODE-NO PIC X(08). \n
001100 05 DTAR020-STORE-NO PIC S9(03) COMP-3. \n
001200 03 DTAR020-DATE PIC S9(07) COMP-3. \n
001300 03 DTAR020-DEPT-NO PIC S9(03) COMP-3. \n
001400 03 DTAR020-QTY-SOLD PIC S9(9) COMP-3. \n
001500 03 DTAR020-SALE-PRICE PIC S9(9)V99 COMP-3. ",
"binaryFilePath": "file:///home/cdap/DTAR020_FB.bin",
"maxSplitSize": "5"
}
}
}