FileManagment Plugins - FileCompress / FileDecompress

FileManagment Plugins - FileCompress / FileDecompress



General 



To support use cases of migrating files from OnPrem to Google cloud there is a need to for comphrensive file handling capabilities. This includes FileList, FileCompression, FileDecompression, FileEncryption , FileDecryption etc.  There are few file level plugins available in CDAP like FileMove, FileDelete and this needs to be expanded. 



UseCase 1





Proposed Design 





  1. FileList Plugin - BatchSource Plugin  - Implement a new FileList plugin ( Batchsource plugin) with similar capability of the current FileSource plugin but instead of actually reading the file contents  it would just pass the filenames with full URI to be used for processing the following actions in pipeline. 

  2. FileCompressEncrypt Sink Plugin -  SparkSink Plugin -   Implement a new plugin of type SparkSink that will read the filename's from the FileList plugin , will compress the file using gzip / snappy then encrypt the file using PGP Public key and persist to Google Cloud storage or HDFS or FileSystem. 

FileList Plugin 

This is a Batchsource plugin similar to current FileSource Plugin but only list the filenames with full URI and not actually read the contents of the file. 

Plugin Properties

Section

Field

Type

Description

Section

Field

Type

Description

Basic

Configuration

Path



String

Path - Provide the path for the File or Directory. ( Text Field)

This should also support other file sources like FTP / SFTP etc

Recursive Processing

Boolean

List Files Recursively ( Boolean ) True / False

Output

Schema

FileName

String

Record with FileName with full URI



Queries 

  • If SFTP or FTP needs to be supported then its not clear how the credential information can be shared to the next step in the process. 

FileCompressEncrypt Plugin


This plugin will take input file name thats passed from the FileList Plugin, Get the fileInputStream using the URI and then using Gzip or Snappy libs compress the file , encrypt the file using PGP public key and persist the file. 



Plugin Properties

Section

Field

Type

Description

Section

Field

Type

Description

Basic

Configuration

Input FileName



String

Full Name of File including path ( URI)

Compress File

Boolean

True / False

Compression Algorithm

String ( List)

Gzip / Snappy. Applicable only if above Compress File is set to true

Encrypt File

Boolean

True

PGP Public Key Path

String

Location of PGP public key. Path to File

PGP Public Key Access Userid

String

Userid to access the public key incase security is enabled

PGP Public Key Access

password

String

Password to access the key file

OutFilePath

String

Path to store the output file from sync. The output filename will follow the format of <InputfileName Suffix>.gz.pgp

The file path URI can contain filesystem , Hdfs, gcs - google file system or cloud store.

MoveInput

Boolean

True / False - Move the source input file to a different path so the next run of the pipeline the same file will not be processed.

MoveFilePath

String

Path to move the input on successful processing of the file.





Queries 

  1. What is the best approach to track processed files so they are not processed again.  Proposing moving the input files after successful processing to a different directory so they dont get processed again in the next run. 





Usecase 2







FileDecompressDecrypt Plugin

The plugin support decrypting  files using PGP Public key and decompress file. 



Plugin Properties

Section

Field

Type

Description

Section

Field

Type

Description

Basic

Configuration

Path



String

Path containing file name or directory of files.



Recursive Processing

Boolean

True / False



DeCompress File

Boolean

True / False



DeCompression Algorithm

String ( List)

Gzip / Snappy. Applicable only if above Compress File is set to true



DeEncrypt File

Boolean

True



PGP Private Key Path

String

Location of PGP public key. Path to File



PGP Private Key Access Userid

String

Userid to access the public key incase security is enabled



PGP Private Key Access

password

String

Password to access the key file











MoveInput

Boolean

True / False - Move the source input file to a different path so the next run of the pipeline the same file will not be processed.



MoveFilePath

String

Path to move the input on successful processing of the file.

Output

Schema

Output

String

Each Row from the file read.









Created in 2020 by Google Inc.