Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Goal 

The Excel Input Reader provides user the ability to read data from one or more Excel file(s). The Input Reader supports following types of Excel file(s)

  • Microsoft Excel 97(-2007) file format
  • Microsoft Excel XML (2007+) file format 

Checklist

  •  User stories documented 
  •  User stories reviewed 
  •  Design documented 
  •  Design reviewed 
  •  Feature merged 
  •  Examples and guides 
  •  Integration tests 
  •  Documentation for feature 
  •  Short video demonstrating the feature

Use-case

Enterprise ETL Developer is able to read Excel files that are being uploaded to HDFS. Following are the use-cases that the plugin should support.  

  • User should be process Excel files stored on HDFS
  • User should have the ability to specify a path and regex for selecting the files to be processed
  • User should have the ability to specify a memory table that would keep track of the files processed, and he has the ability to specify whether he should be processing already processed files or not. 
  • User will provide the Sheet name or the Sheet number to be processed
  • User will also specify whether he should be skipping the first column or no
  • User should be able to specify the list of columns to be extracted by column name
  • User should be able to process all the columns
  • User should be able to see in the output records, the Sheet name and Excel file name
  • User should be able to terminate processing if there is a empty row in Excel Sheet
  • User should be able to limit the number of rows to be read
  • User should be able to specify the output schema and the type conversions should be handled automatically if they can
    • User should be able to specify how the error record should be handled either by specifying
      • Ignoring the record
      • Stopping the processing
      • Writing the record to error dataset
  • User should be able to see the Row Number when the error dataset is written.

Design

  • There will be an option box as input for the user to specify whether he wants the files to be reprocessed or not? A memory table will be specified by the user which will keep the track of all the processed files. If the user specifies not to reprocess the files, then memory table would be looked upon for the files to check if they are already processed, if memory table does not contain the file name, then current input file will be processed and memory table will be updated, otherwise the processing will be skipped.

  • All the excel files must have the same sheet name or number to be processed, otherwise run time exception will be thrown and processing will be terminated.

...

More details will be added based on findings.

Example

 

Questions/Clarifications

 1.Clarifications:

     Requirement : User is able to specify what should happen when there is error in processing

     Understanding:

     User would have an input field to provide the action to be taken in erroneous scenario. Each action will be mapped to a constant: 

...

    E. Should we also provide the user the ability to skip first row in case, the sheet has headers?

 

Assumptions

1. All the excel files/sheet specified should have required output columns otherwise it would be considered as error excel.

...