Microsoft Azure Blob Store Batch Source

Plugin version: 0.15.0

Batch source to use Microsoft Azure Blob Storage as a source.

Use this source when you need to read from Microsoft Azure Blob Storage. For example, you may want to read in files from Microsoft Azure Blob Storage, parse them, and then store them in a Microsoft SQL Server Database.

The plugin requires Hadoop 2.8 as dependency.

Configuration

Property

Macro Enabled?

Description

Property

Macro Enabled?

Description

Reference Name

No

Required. This will be used to uniquely identify this source for lineage, annotating metadata, etc.

Path

Yes

Required. The path on Microsoft Azure Blob Storage to use as input. The path uses filename expansion (globbing) to read files. The path must start with wasb:// or wasbs://, for example, wasb://mycontainer@mystorageaccount.blob.core.windows.net/filename.txt.

Account

Yes

Required. The Microsoft Azure Blob Storage account to use. The account must end with .blob.core.windows.net. For example, mystorageaccount.blob.core.windows.net, where mystorageaccount is the Microsoft Azure Storage account name.

Authentication Method

No

Required. The authentication method to use to connect to Microsoft Azure. Can be either Storage Account Key or SAS Token.

Default is Storage Account Key.

Azure Blob Store Storage Key

Yes

Optional. The storage key for the container on the Microsoft Azure Storage account. Must be a valid base64 encoded storage key provided by Microsoft Azure. Required when authentication method is set to Storage Account Key.

SAS Token

Yes

Optional. The SAS token to use to connect to the specified container. Required when authentication method is set to SAS Token.

Container

Yes

Optional. The container to connect to. Required when authentication method is set toSAS Token.

Maximum Split Size

Yes

Optional. Maximum split size for each mapper in MapReduce Job. Defaults to 128 MB.

Regex Path Filter

Yes

Optional. Regex to filter out files in the path. It accepts regular expression which is applied to the complete path and returns the list of files that match the specified pattern. To use the TimeFilter, enter ‘timefilter’. The TimeFilter assumes it is reading in files with the File Log naming convention of ‘YY-MM-DD-HH-mm-SS-Tag’. The TimeFilter reads in files from the previous hour if the field Time Table is left blank. If it’s currently 2015-06-16-15 (June 16 2015, 3pm), it will read in the files that contain ‘2015-06-16-14’ in the filename. If the field Time Table is present, then it will read in the files that have not yet been read.

Defaults to '*' which indicates no files will be filtered.

Read files recursively

No

Optional. Boolean value to determine if files are to be read recursively from the path.

Default is false.

Path Field

No

Optional. If specified, each output record will include a field with this name that contains the file URI that the record was read from. Requires a customized version of CombineFileInputFormat so it cannot be used if an Input Format Class is specified.

Use File Name as Path Field

No

Optional. If set to true and the Path Field is specified, only the filename will be used. If set to false, the full URI will be used.

Default is false.

Input Format Class

Yes

Optional. Name of the InputFormatClass, which must be a subclass of FileInputFormat. Defaults to CombinePathTrackingInputFormat, which is a customized version of CombineTextInputFormat that records the file path each record was read from.

File System Properties

Yes

Optional. A JSON string representing a map of properties needed for the distributed file system. 

Ignore Non-Existing Folders

No

Optional. Identify if path needs to be ignored or not, for case when directory or file does not exists. If set to true it will treat the not present folder as 0 input and log a warning.

Default is false.

Time Table

Yes

Optional. Name of the table that keeps track of the time files that were read in. If this is null or empty, the Regex is used to filter filenames.

Output Schema

No

Required. Schema of the table to read. This can be fetched by clicking the Get Schema button.

Example

This example connects to Microsoft Azure Blob Storage and reads in files found in the specified directory. This example uses Microsoft Azure Storage mystorageaccount.blob.core.windows.net, using the mystorageaccount account name:

Property

Value

Property

Value

Path

wasb://mycontainer@mystorageaccount.blob.core.windows.net/filename.txt

Account

mystorageaccount.blob.core.windows.net

Authentication Method

storageAccountKey

Azure Blob Store Storage Key

XXXXXEEESSS/YYYY=

Read files recursively

false

Ignore Non-Existing Folders

false





Created in 2020 by Google Inc.