Table of Contents |
---|
General
To support use cases of migrating files from OnPrem to Google cloud there is a need to for comphrensive file handling capabilities. This includes FileList, FileCompression, FileDecompression, FileEncryption , FileDecryption etc. There are few file level plugins available in CDAP like FileMove, FileDelete and this needs to be expanded.
UseCase 1
Proposed Design
- FileList Plugin - BatchSource Plugin - Implement a new FileList plugin ( Batchsource plugin) with similar capability of the current FileSource plugin but instead of actually reading the file contents it would just pass the filenames with full URI to be used for processing the following actions in pipeline. FileCompress
- FileCompressEncrypt Sink Plugin - Transform SparkSink Plugin - Implement Implement a Compression plugin similar to the Field Compression Plugin which accepts an input file URI , reads the file compress this and stored it temporarily on teh same node and spits out the compressed file URI location that can be used by the next processing action.
- Invoke the current Google Cloud Storage plugin to persist the file to Cloud storage.
- FileEncrypt Plugin - Transform Plugin - Plugin supporting PGP Encryption of files using a Public key.
- FileDecrypt Plugin - Transform Plugin - Plugin support PGP Decryption using Private Key stored in CDAP Secretsnew plugin of type SparkSink that will read the filename's from the FileList plugin , will compress the file using gzip / snappy then encrypt the file using PGP Public key and persist to Google Cloud storage or HDFS or FileSystem.
FileList Plugin
This is a Batchsource plugin similar to current FileSource Plugin but only list the filenames with full URI and not actually read the contents of the file.
...
Section | Field | Type | Description |
---|---|---|---|
Basic Configuration | Path | String | Path - Provide the path for the File or Directory. ( Text Field) This should also support other file sources like FTP / SFTP etc |
Recursive Processing | Boolean | List Files Recursively ( Boolean ) True / False | |
Output Schema | FileName | String | Record with FileName with full URI |
...
Queries
- If SFTP or FTP needs to be supported then its not clear how the credential information can be shared to the next step in the process.
FileCompressEncrypt Plugin
This plugin will take input file name thats passed from the FileList Plugin, Get the fileInputStream using the URI and then using Gzip or Snappy libs compress the file and store them locally on the node, encrypt the file using PGP public key and persist the file.
Plugin Properties
Section | Field | Type | Description |
---|---|---|---|
Basic Configuration | Input FileName | String | Full Name of File including path ( URI) |
Compress File | Boolean | True / False | |
Compression Algorithm | Dropdown | Snappy / Gzip | |
Output Schema | Compressed FileName (TBD) | String | Record with FileName with full URI | Compressed Content (TBD) | File Stream |
Queries
- Should this plugin just store the file locally and pass the new compressed file name or should this actually read the file , compress contents and passed the compressed contents as a stream. The issue with stream is if the file size if larger then it might not be efficient or we might have to do everything in memory which is not ideal.
- If we store the files locally we need some way to clean it up after the processing.
File Encrypt Plugin
...
String ( List) | Gzip / Snappy. Applicable only if above Compress File is set to true | |
Encrypt File | Boolean | True |
PGP Public Key Path | String | Location of PGP public key. Path to File |
PGP Public Key Access Userid | String | Userid to access the public key incase security is enabled |
PGP Public Key Access password | String | Password to access the key file |
OutFilePath | String | Path to store the output file from sync. The output filename will follow the format of <InputfileName Suffix>.gz.pgp The file path URI can contain filesystem , Hdfs, gcs - google file system or cloud store. |
MoveInput | Boolean | True / False - Move the source input file to a different path so the next run of the pipeline the same file will not be processed. |
MoveFilePath | String | Path to move the input on successful processing of the file. |
Queries
- What is the best approach to track processed files so they are not processed again. Proposing moving the input files after successful processing to a different directory so they dont get processed again in the next run.
Usecase 2
FileDecompressDecrypt Plugin
The plugin support decrypting files using PGP Public key and decompress file.
Plugin Properties
Section | Field | Type | Description |
---|---|---|---|
Basic Configuration |
Path | String |
Full Name of File including path ( URI)
Output
Schema
File DeCrypt Plugin
The plugin support decrypting files using PGP Public key . The private key can be loaded as a CDAP secret or provide a input file location .
Plugin Properties
Basic
Configuration
FileName
Full Name of File including path ( URI)
Output
Schema
Path containing file name or directory of files. | |||
Recursive Processing | Boolean | True / False | |
DeCompress File | Boolean | True / False | |
DeCompression Algorithm | String ( List) | Gzip / Snappy. Applicable only if above Compress File is set to true | |
DeEncrypt File | Boolean | True | |
PGP Private Key Path | String | Location of PGP public key. Path to File | |
PGP Private Key Access Userid | String | Userid to access the public key incase security is enabled | |
PGP Private Key Access password | String | Password to access the key file | |
MoveInput | Boolean | True / False - Move the source input file to a different path so the next run of the pipeline the same file will not be processed. | |
MoveFilePath | String | Path to move the input on successful processing of the file. | |
Output Schema | Output | String | Each Row from the file read. |