Excel Input Reader

Introduction 

The Excel Input Reader provides user the ability to read data from one or more Excel file(s). The Input Reader supports following types of Excel file(s)

  • Microsoft Excel 97(-2007) file format
  • Microsoft Excel XML (2007+) file format 

Use-case

Enterprise ETL Developer is able to read Excel files that are being uploaded to HDFS. Following are the use-cases that the plugin should support.  

  • User should be process Excel files stored on HDFS
  • User should have the ability to specify a path and regex for selecting the files to be processed
  • User should have the ability to specify a memory table that would keep track of the files processed, and he has the ability to specify whether he should be processing already processed files or not. 
  • User will provide the Sheet name or the Sheet number to be processed
  • User will also specify whether he should be skipping the first column or no
  • User should be able to specify the list of columns to be extracted by column name
  • User should be able to process all the columns
  • User should be able to see in the output records, the Sheet name and Excel file name
  • User should be able to terminate processing if there is a empty row in Excel Sheet
  • User should be able to limit the number of rows to be read
  • User should be able to specify the output schema and the type conversions should be handled automatically if they can
    • User should be able to specify how the error record should be handled either by specifying
      • Ignoring the record
      • Stopping the processing
      • Writing the record to error dataset
  • User should be able to see the Row Number when the error dataset is written.

Design

  • There will be an option box as input for the user to specify whether he wants the files to be reprocessed or not? A memory table will be specified by the user which will keep the track of all the processed files. If the user specifies not to reprocess the files, then memory table would be looked upon for the files to check if they are already processed, if memory table does not contain the file name, then current input file will be processed and memory table will be updated, otherwise the processing will be skipped.

  • All the excel files must have the same sheet name or number to be processed, otherwise run time exception will be thrown and processing will be terminated.

  • Top N rows emission will not be guaranteed, although total row limit on each sheet can be applied. As, in map reduce phase the input may be split up causing the distribution of rows to multiple mappers or reducers which may not return rows in sequenced manner. However as the size of excel file should not exceed 128mb (which is default block size for map jobs); then we may still get expected output by user.
  • User will be provided an option to check true in case termination is required when empty row is encountered and false otherwise.
  • A key value drop-down pair will be provided to the user for output schema definition where key will be column name and value will be an option from the drop-down list of cdap supported data types.

          User will get a drop down to select the values from standard action to be taken if the type conversions could not be handled(error record). Standard actions will be:
                Ignoring the record
                Termination of process
                Writing the error record to error dataset.

           This will happen at record level in transform

  • Along with the error record, row number, sheet name or number and excel file name will be written to the error dataset.
  • RecordReader implemention for ExcelInputReader will return a whole row and the conditions like extraction of certain columns will be implemented at source plugin class.
  • If the user wants to process all the columns then output schema will be required to generate the records. Only those fields which are present in output schema input will be emitted by the plugin.

Input Json format:

{
  "name": "ExcelInputReader",
  "type": "batchsource",
  "properties": {
        "filePath": "file:///home/cdap",
        "filePattern": ".*",
        "memoryTableName": "inventory-memory-table",
        "reprocess": "false",
        "sheetName": "Sheet1",
        "sheetNo": "-1",
        "columnList": "A,B",
        "columnMapping": "B:name,C:age"
        "skipFirstRow": "false",
        "terminateIfEmptyRow": "false",
        "rowsLimit": "2000" ,
        "outputSchema": "A:string",
        "ifErrorRecord" : "Ignore error and continue",
        "errorDatasetName": ""
   }
}

Analyzing https://poi.apache.org/spreadsheet/index.html library to implement plug-in.

More details will be added based on findings.

Questions and/or Clarifications

 1.Clarifications:

     Requirement : User is able to specify what should happen when there is error in processing

     Understanding:

     User would have an input field to provide the action to be taken in erroneous scenario. Each action will be mapped to a constant: 

      

Action to takeMapped Constant/Input
Ignore error and continueIgnore
Exit on errorExit
Write to error datasetWriteToDataset

   -  Based on input from user, appropriate action will take place in plug-in (in catch block/ when exception occurs)

   - We will maintain number of retries (when we are supposed to continue by ignoring errors)

      Error records could throw exceptions like:

  • IllegalCharacterException
  • Type conversion error
  • NULL or EMPTY value for non-nullable column value.

This will happen at record level in transform

  A. Is this understanding correct? 

  B. If above understanding is correct then we can use text-dropdown widget and allow user to specify the action to be taken if there is an error while processing

  C.Top N rows emission will not be guaranteed, although total row limit on each sheet can be applied. As, in map reduce phase the input may be split up causing the

     distribution of rows to multiple mappers or reducer which may not return rows in sequenced manner.

 

2. Questions: 

    A. What are memory table and error dataset? Can we get the reference for the same if they are different from just the normal Table?

    B. Requirement states that - "User should be able to see in the output records, the Sheet name and Excel file name" 

        Should there be Sheet Name and Excel File name corresponding to each record as output columns in emitted datasets?

    C. Can we just use the output schema definition(key-value pair) as the input from the user to provide the list of columns to be extracted from user instead of having a

        special input for the same as mentioned in this requirement: "User should be able to specify the list of columns to be extracted by column name"? Because having

        list of columns input with output schema input doesn't add any value and increases the complexity of the plugin.

        Suppose if a user provides three columns in "list of columns" input field and just add two columns in "output schema" input. This situation contradicts the inputs and

        unnecessarily requires validation on plugin side. Whereas user can always mention those three fields in "output schema" input field.

    D. How will the output schema fields be mapped to the excel column name?

        a. Will the field names in the output schema be same as excel column name? If no then, a separate excelColumn-OutputField mapping input will be provided to the

            user which will be an optional field. If the user provides mapping for the excel column then that mapped value will be treated as output field otherwise original excel

            column names will be used in output record.

       b. If the user specified field names in output schema are different than the excel column names or mapped values, a validation error will be thrown.

    E. Should we also provide the user the ability to skip first row in case, the sheet has headers?

 

Assumptions

1. All the excel files/sheet specified should have required output columns otherwise it would be considered as error excel.

2. After process termination because of errors, already processed data will be passed to Sink.

3. All the excel files must have the same sheet name or number to be processed, otherwise run time exception will be thrown and processing will be terminated.

4. if the user does not want to  process all the columns then only the columns present in column list input will be included in output records. String type will be treated as

    default output data type for all those columns which are not available in output schema input, 



Table of Contents

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature

Created in 2020 by Google Inc.