Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 10 Next »

Goal 

The Excel Input Reader provides user the ability to read data from one or more Excel file(s). The Input Reader supports following types of Excel file(s)

  • Microsoft Excel 97(-2007) file format
  • Microsoft Excel XML (2007+) file format 

Checklist

  • User stories documented 
  • User stories reviewed 
  • Design documented 
  • Design reviewed 
  • Feature merged 
  • Examples and guides 
  • Integration tests 
  • Documentation for feature 
  • Short video demonstrating the feature

Use-case

Enterprise ETL Developer is able to read Excel files that are being uploaded to HDFS. Following are the use-cases that the plugin should support.  

  • User should be process Excel files stored on HDFS
  • User should have the ability to specify a path and regex for selecting the files to be processed
  • User should have the ability to specify a memory table that would keep track of the files processed, and he has the ability to specify whether he should be processing already processed files or not. 
  • User will provide the Sheet name or the Sheet number to be processed
  • User will also specify whether he should be skipping the first column or no
  • User should be able to specify the list of columns to be extracted by column name
  • User should be able to process all the columns
  • User should be able to see in the output records, the Sheet name and Excel file name
  • User should be able to terminate processing if there is a empty row in Excel Sheet
  • User should be able to limit the number of rows to be read
  • User should be able to specify the output schema and the type conversions should be handled automatically if they can
    • User should be able to specify how the error record should be handled either by specifying
      • Ignoring the record
      • Stopping the processing
      • Writing the record to error dataset
  • User should be able to see the Row Number when the error dataset is written.

Design

  • There will be an option box as input for the user to specify whether he wants the files to be reprocessed or not? A memory table will be specified by the user which will keep the track of all the processed files. If the user specifies not to reprocess the files, then memory table would be looked upon for the files to check if they are already processed, if memory table does not contain the file name, then current input file will be processed and memory table will be updated, otherwise the processing will be skipped.

  • All the excel files must have the same sheet name or number to be processed, otherwise run time exception will be thrown and processing will be terminated.

  • If output schema is not provided then all of the columns will be processed. Otherwise, output schema will be used to extract the columns from excel files.(Please refer Question C for more details.)
  • Top N rows emission will not be guaranteed, although total row limit on each sheet can be applied. As, in map reduce phase the input may be split up causing the distribution of rows to multiple mappers or reducers which may not return rows in sequenced manner.
  • User will be provided an option to check true in case termination is required when empty row is encountered and false otherwise.
  • A key value drop-down pair will be provided to the user for output schema definition where key will be column name and value will be an option from the drop-down list of cdap supported data types.

          User will get a drop down to select the values from standard action to be taken if the type conversions could not be handled(error record). Standard actions will be:
                Ignoring the record
                Termination of process
                Writing the error record to error dataset.

  • Along with the error record, row number, sheet name or number and excel file name will be written to the error dataset.

Input Json format:

{
  "name": "ExcelInputReader",
  "type": "batchsource",
  "properties": {
        "filesPath": "file:///hadoop/hdfs/xyz.xls",
        "memoryTableName": "memory-table",
        "reprocess": "false",
        "sheetName": "memory-table",
        "sheetNo": "2",
        "skipFirstColumn": "false",
        "terminateIfEmptyRow": "false",
        "rowsLimit": "2000" ,
        "outputSchema": "column1:dataType1,column2:dataType2",
       "IfErrorRecord" : "dataset",
       "errorDatasetName": "error-dataset"
   }
}

Analyzing https://poi.apache.org/spreadsheet/index.html library to implement plug-in.

More details will be added based on findings.

Example

 

Questions/Clarifications

 1.Clarifications:

     Requirement : User is able to specify what should happen when there is error in processing

     Understanding:

     User would have an input field to provide the action to be taken in erroneous scenario. Each action will be mapped to a constant: 

      

Action to takeMapped Constant/Input
Ignore error and continueIgnore
Exit on errorExit
Write to error datasetWriteToDataset

   -  Based on input from user, appropriate action will take place in plug-in (in catch block/ when exception occurs)

   - We will maintain number of retries (when we are supposed to continue by ignoring errors)

      Error records could throw exceptions like:

  • IllegalCharacterException
  • Type conversion error
  • NULL or EMPTY value for non-nullable column value.

  A. Is this understanding correct? 

  B. If above understanding is correct then we can use text-dropdown widget and allow user to specify the action to be taken if there is an error while processing

  C.Top N rows emission will not be guaranteed, although total row limit on each sheet can be applied. As, in map reduce phase the input may be split up causing the distribution of rows to multiple mappers or reducer

      which may not return rows in sequenced manner.

 

2. Questions

    A. What are memory table and error dataset? Can we get the reference for the same if they are different from just the normal Table?

    B. Requirement states that - "User should be able to see in the output records, the Sheet name and Excel file name" 

        Should there be Sheet Name and Excel File name corresponding to each record as output columns in emitted datasets?

    C. Can we just use the output schema definition(key-value pair) as the input from the user to provide the list of columns to be extracted from user instead of having a special input for the same as mentioned in this                      requirement: "User should be able to specify the list of columns to be extracted by column name"? Because having list of columns input with output schema input doesn't add any value and increases the complexity of          the plugin.

        Suppose if a user provides three columns in "list of columns" input field and just add two columns in "output schema" input. This situation contradicts the inputs and unnecessarily requires validation on plugin side.                 Whereas user can always mention those three fields in "output schema" input field.

 

Assumptions

1. All the excel files/sheet specified should have required output columns otherwise it would be considered as error excel.

2. After process termination because of errors, already processed data will be passed to Sink.

3. All the excel files must have the same sheet name or number to be processed, otherwise run time exception will be thrown and processing will be terminated.



  • No labels