File Streaming Source

Plugin version: 2.11.0

File streaming source. Watches a directory and streams file contents of any new files added to the directory. Files must be atomically moved or renamed.

This source is used whenever you want to read files from a directory in a streaming context. For example, you might want to read access logs as they are moved into an archive directory.

Configuration

Property

Macro Enabled?

Description

Property

Macro Enabled?

Description

Reference Name

No

Required. This will be used to uniquely identify this source for lineage, annotating metadata, etc.

File Path

Yes

Required. The path of the directory to monitor. Files must be written to the monitored directory by “moving” them from another location within the same file system. File names starting with . are ignored.

Ignore Threshold

Yes

Optional. Ignore files that are older than this many seconds.

Default is 60.

Extension Whitelist

Yes

Optional. Comma separated list of file extensions to accept. If not specified, all files in the directory will be read. Otherwise, only files with an extension in this list will be read.

Format

No

Required. Format of files in the directory. Supported formats are ‘text’, ‘csv’, ‘tsv’, ‘clf’, ‘grok’, and ‘syslog’. The default format is ‘text’.

Output Schema

No

Required. Schema of files in the directory.

Example

This example reads from the ‘/data/events’ directory on an HDFS cluster whose namenode is running on namenode.example.com. The source will monitor the directory for any new files that are added. It will parse those files as comma separated files with three columns: timestamp, user, and action.

Property

Value

Property

Value

Reference Name

fileRT

File Path

hdfs://namenode.example.com/data/events

Format

csv



Created in 2020 by Google Inc.