Microsoft Azure Blob Store Batch Source
Plugin version: 0.15.0
Batch source to use Microsoft Azure Blob Storage as a source.
Use this source when you need to read from Microsoft Azure Blob Storage. For example, you may want to read in files from Microsoft Azure Blob Storage, parse them, and then store them in a Microsoft SQL Server Database.
The plugin requires Hadoop 2.8 as dependency.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Reference Name | No | Required. This will be used to uniquely identify this source for lineage, annotating metadata, etc. |
Path | Yes | Required. The path on Microsoft Azure Blob Storage to use as input. The path uses filename expansion (globbing) to read files. The path must start with |
Account | Yes | Required. The Microsoft Azure Blob Storage account to use. The account must end with |
Authentication Method | No | Required. The authentication method to use to connect to Microsoft Azure. Can be either Default is |
Azure Blob Store Storage Key | Yes | Optional. The storage key for the container on the Microsoft Azure Storage account. Must be a valid base64 encoded storage key provided by Microsoft Azure. Required when authentication method is set to |
SAS Token | Yes | Optional. The SAS token to use to connect to the specified container. Required when authentication method is set to |
Container | Yes | Optional. The container to connect to. Required when authentication method is set to |
Maximum Split Size | Yes | Optional. Maximum split size for each mapper in MapReduce Job. Defaults to 128 MB. |
Regex Path Filter | Yes | Optional. Regex to filter out files in the path. It accepts regular expression which is applied to the complete path and returns the list of files that match the specified pattern. To use the TimeFilter, enter ‘timefilter’. The TimeFilter assumes it is reading in files with the File Log naming convention of ‘YY-MM-DD-HH-mm-SS-Tag’. The TimeFilter reads in files from the previous hour if the field Time Table is left blank. If it’s currently 2015-06-16-15 (June 16 2015, 3pm), it will read in the files that contain ‘2015-06-16-14’ in the filename. If the field Time Table is present, then it will read in the files that have not yet been read. Defaults to '*' which indicates no files will be filtered. |
Read files recursively | No | Optional. Boolean value to determine if files are to be read recursively from the path. Default is false. |
Path Field | No | Optional. If specified, each output record will include a field with this name that contains the file URI that the record was read from. Requires a customized version of CombineFileInputFormat so it cannot be used if an Input Format Class is specified. |
Use File Name as Path Field | No | Optional. If set to Default is false. |
Input Format Class | Yes | Optional. Name of the InputFormatClass, which must be a subclass of FileInputFormat. Defaults to CombinePathTrackingInputFormat, which is a customized version of CombineTextInputFormat that records the file path each record was read from. |
File System Properties | Yes | Optional. A JSON string representing a map of properties needed for the distributed file system. |
Ignore Non-Existing Folders | No | Optional. Identify if path needs to be ignored or not, for case when directory or file does not exists. If set to true it will treat the not present folder as 0 input and log a warning. Default is false. |
Time Table | Yes | Optional. Name of the table that keeps track of the time files that were read in. If this is null or empty, the Regex is used to filter filenames. |
Output Schema | No | Required. Schema of the table to read. This can be fetched by clicking the Get Schema button. |
Example
This example connects to Microsoft Azure Blob Storage and reads in files found in the specified directory. This example uses Microsoft Azure Storage mystorageaccount.blob.core.windows.net
, using the mystorageaccount
account name:
Property | Value |
---|---|
Path |
|
Account |
|
Authentication Method |
|
Azure Blob Store Storage Key |
|
Read files recursively |
|
Ignore Non-Existing Folders |
|
Created in 2020 by Google Inc.