ADLS Batch Source
Plugin version: 0.15.0
Azure Data Lake Store Batch Source reads data from Azure Data Lake Store files and converts it into StructuredRecord.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Reference Name | No | Required. Name used to uniquely identify this source for lineage, annotating metadata, etc. |
Azure Data Lake Store Path | Yes | Required. Path files under to Azure Data Lake Store directory. |
Azure Data Lake Store Client Id | Yes | Required. Microsoft Azure client Id which is typically Application Id. |
Azure Data Lake Store Refresh Token | Yes | Required. Refresh URL to access Microsoft Azure Data Store. |
Azure Data Lake Store Credentials | Yes | Required. Key to access Microsoft Azure Data Store. |
Maximum Split Size | Yes | Optional. Maximum split-size for each mapper in the MapReduce Job. Default is 128 MB. |
Regex Path Filter | Yes | Optional. Regex to filter out filenames in the path. To use the TimeFilter, input |
Read files recursively | No | Optional. Boolean value to determine if files are to be read recursively from the path. Default is false. |
Path Field | No | Optional. If specified, each output record will include a field with this name that contains the file URI that the record was read from. Requires a customized version of CombineFileInputFormat, so it cannot be used if an inputFormatClass is given. |
Use File Name as Path Field | No | Optional. If true and a pathField is specified, only the filename will be used. If false, the full URI will be used. Default is false. |
Input Format Class | Yes | Optional. Name of the input format class, which must be a subclass of FileInputFormat. Defaults to CombineTextInputFormat. |
File System Properties | Yes | Optional. A JSON string representing a map of properties needed for the distributed file system. |
Ignore Non-Existing Folders | No | Optional. Identify if path needs to be ignored or not, for case when directory or file does not exists. If set to true it will treat the not present folder as 0 input and log a warning. Default is false. |
Time Table | Yes | Optional. Name of the Table that keeps track of the last time files were read in. |
Output Schema | No | Required. The output schema for the data. |
Example
This example connects to Microsoft Azure Data Lake Store and reads in files found in the specified directory. This example uses Microsoft Azure Data Lake Store Path xyz.azuredatalakestore.net
, using the Azure Data Lake Store Client Id, Azure Data Lake Store Refresh Token, and Azure Data Lake Store Credentials:
Property | Value |
---|---|
Reference Name |
|
Azure Data Lake Store Path |
|
Azure Data Lake Store Client Id |
|
Azure Data Lake Store Refresh Token |
|
Azure Data Lake Store Credentials |
|
Read files recursively |
|
Output Schema |
|
Â
Created in 2020 by Google Inc.