...
- There will be an option box as input for the user to specify whether he wants the files to be reprocessed or not? A memory table will be specified by the user which will keep the track of all the processed files. If the user specifies not to reprocess the files, then memory table would be looked upon for the files to check if they are already processed, if memory table does not contain the file name, then current input file will be processed and memory table will be updated, otherwise the processing will be skipped.
All the excel files must have the same sheet name or number to be processed, otherwise run time exception will be thrown and processing will be terminated.
- If output schema is not provided then all of the columns will be processed. Otherwise, output schema will be used to extract the columns from excel files.(Please refer Question C for more details.)
- Top N rows emission will not be guaranteed, although total row limit on each sheet can be applied. As, in map reduce phase the input may be split up causing the distribution of rows to multiple mappers or reducers which may not return rows in sequenced manner. However as the size of excel file should not exceed 128mb (which is default block size for map jobs); then we may still get expected output by user.
...
- Along with the error record, row number, sheet name or number and excel file name will be written to the error dataset.
- RecordReader implemention for ExcelInputReader will return a whole row and the conditions like extraction of certain columns will be implemented at source plugin class.
- If the user wants to process all the columns then output schema will be required to generate the records. Only those fields which are present in output schema input will be emitted by the plugin.
Input Json format:
Code Block | ||||
---|---|---|---|---|
| ||||
{ "name": "ExcelInputReader", "type": "batchsource", "properties": { "filesPath": "file:///hadoop/hdfs/xyz.xls", "filesPattern": "*", "memoryTableName": "memory-table", "reprocess": "false", "sheetName": "memory-table", "sheetNo": "2", "columnList": "A,B,C", "skipFirstColumn": "false", "terminateIfEmptyRow": "false", "rowsLimit": "2000" , "outputSchema": "column1:dataType1,column2:dataType2", "ifErrorRecord" : "dataset", "errorDatasetName": "error-dataset" } } |
...