ADLS Batch Source
Plugin version: 0.15.0
Azure Data Lake Store Batch Source reads data from Azure Data Lake Store files and converts it into StructuredRecord.
Configuration
Property | Macro Enabled? | Description |
|---|---|---|
Reference Name | No | Required. Name used to uniquely identify this source for lineage, annotating metadata, etc. |
Azure Data Lake Store Path | Yes | Required. Path files under to Azure Data Lake Store directory. |
Azure Data Lake Store Client Id | Yes | Required. Microsoft Azure client Id which is typically Application Id. |
Azure Data Lake Store Refresh Token | Yes | Required. Refresh URL to access Microsoft Azure Data Store. |
Azure Data Lake Store Credentials | Yes | Required. Key to access Microsoft Azure Data Store. |
Maximum Split Size | Yes | Optional. Maximum split-size for each mapper in the MapReduce Job. Default is 128 MB. |
Regex Path Filter | Yes | Optional. Regex to filter out filenames in the path. To use the TimeFilter, input |
Read files recursively | No | Optional. Boolean value to determine if files are to be read recursively from the path. Default is false. |
Path Field | No | Optional. If specified, each output record will include a field with this name that contains the file URI that the record was read from. Requires a customized version of CombineFileInputFormat, so it cannot be used if an inputFormatClass is given. |
Use File Name as Path Field | No | Optional. If true and a pathField is specified, only the filename will be used. If false, the full URI will be used. Default is false. |
Input Format Class | Yes | Optional. Name of the input format class, which must be a subclass of FileInputFormat. Defaults to CombineTextInputFormat. |
File System Properties | Yes | Optional. A JSON string representing a map of properties needed for the distributed file system. |
Ignore Non-Existing Folders | No | Optional. Identify if path needs to be ignored or not, for case when directory or file does not exists. If set to true it will treat the not present folder as 0 input and log a warning. Default is false. |
Time Table | Yes | Optional. Name of the Table that keeps track of the last time files were read in. |
Output Schema | No | Required. The output schema for the data. |
Example
This example connects to Microsoft Azure Data Lake Store and reads in files found in the specified directory. This example uses Microsoft Azure Data Lake Store Path xyz.azuredatalakestore.net, using the Azure Data Lake Store Client Id, Azure Data Lake Store Refresh Token, and Azure Data Lake Store Credentials:
Property | Value |
|---|---|
Reference Name |
|
Azure Data Lake Store Path |
|
Azure Data Lake Store Client Id |
|
Azure Data Lake Store Refresh Token |
|
Azure Data Lake Store Credentials |
|
Read files recursively |
|
Output Schema |
|
Created in 2020 by Google Inc.