ADLS Batch Source

Plugin version: 0.15.0

Azure Data Lake Store Batch Source reads data from Azure Data Lake Store files and converts it into StructuredRecord.

Configuration

Property

Macro Enabled?

Description

Property

Macro Enabled?

Description

Reference Name

No

Required. Name used to uniquely identify this source for lineage, annotating metadata, etc.

Azure Data Lake Store Path

Yes

Required. Path files under to Azure Data Lake Store directory.

Azure Data Lake Store Client Id

Yes

Required. Microsoft Azure client Id which is typically Application Id.

Azure Data Lake Store Refresh Token

Yes

Required. Refresh URL to access Microsoft Azure Data Store.

Azure Data Lake Store Credentials

Yes

Required. Key to access Microsoft Azure Data Store.

Maximum Split Size

Yes

Optional. Maximum split-size for each mapper in the MapReduce Job.

Default is 128 MB.

Regex Path Filter

Yes

Optional. Regex to filter out filenames in the path. To use the TimeFilter, input timefilter. The TimeFilter assumes that it is reading in files with the File log naming convention of YYYY-MM-DD-HH-mm-SS-Tag. The TimeFilter reads in files from the previous hour if the field timeTable is left blank. If it’s currently 2015-06-16-15 (June 16th 2015, 3pm), it will read in files that contain 2015-06-16-14 in the filename. If the field timeTable is present, then it will read in files that have not yet been read. 

Read files recursively

No

Optional. Boolean value to determine if files are to be read recursively from the path.

Default is false.

Path Field

No

Optional. If specified, each output record will include a field with this name that contains the file URI that the record was read from. Requires a customized version of CombineFileInputFormat, so it cannot be used if an inputFormatClass is given.

Use File Name as Path Field

No

Optional. If true and a pathField is specified, only the filename will be used. If false, the full URI will be used.

Default is false.

Input Format Class

Yes

Optional. Name of the input format class, which must be a subclass of FileInputFormat. Defaults to CombineTextInputFormat. 

File System Properties

Yes

Optional. A JSON string representing a map of properties needed for the distributed file system.

Ignore Non-Existing Folders

No

Optional. Identify if path needs to be ignored or not, for case when directory or file does not exists. If set to true it will treat the not present folder as 0 input and log a warning.

Default is false.

Time Table

Yes

Optional. Name of the Table that keeps track of the last time files were read in. 

Output Schema

No

Required. The output schema for the data.

Example

This example connects to Microsoft Azure Data Lake Store and reads in files found in the specified directory. This example uses Microsoft Azure Data Lake Store Path xyz.azuredatalakestore.net, using the Azure Data Lake Store Client Id, Azure Data Lake Store Refresh Token, and Azure Data Lake Store Credentials:

Property

Value

Property

Value

Reference Name

store

Azure Data Lake Store Path

adl://xyz.azuredatalakestore.net/adls

Azure Data Lake Store Client Id

2016c0cb-9b0a-411d-9976-457112a6baca

Azure Data Lake Store Refresh Token

https://login.windows.net/6f3d9678-d0b4-4d7e-ac55-128e30605fac/oauth2/token

Azure Data Lake Store Credentials

d1cF7CwFJKlMWXPz30OZ0XD8DErPsSWf0zXyH4iDzKA=

Read files recursively

false

Output Schema

{\"type\":\"record\",\"name\":\"etlSchemaBody\",\"fields\":[{\"name\":\"offset\",\"type\":\"long\"},{\"name\":\"body\",\"type\":\"string\"}]}

 

Created in 2020 by Google Inc.