File Streaming Source
Plugin version: 2.11.0
File streaming source. Watches a directory and streams file contents of any new files added to the directory. Files must be atomically moved or renamed.
This source is used whenever you want to read files from a directory in a streaming context. For example, you might want to read access logs as they are moved into an archive directory.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Reference Name | No | Required. This will be used to uniquely identify this source for lineage, annotating metadata, etc. |
File Path | Yes | Required. The path of the directory to monitor. Files must be written to the monitored directory by “moving” them from another location within the same file system. File names starting with . are ignored. |
Ignore Threshold | Yes | Optional. Ignore files that are older than this many seconds. Default is 60. |
Extension Whitelist | Yes | Optional. Comma separated list of file extensions to accept. If not specified, all files in the directory will be read. Otherwise, only files with an extension in this list will be read. |
Format | No | Required. Format of files in the directory. Supported formats are ‘text’, ‘csv’, ‘tsv’, ‘clf’, ‘grok’, and ‘syslog’. The default format is ‘text’. |
Output Schema | No | Required. Schema of files in the directory. |
Example
This example reads from the ‘/data/events’ directory on an HDFS cluster whose namenode is running on namenode.example.com. The source will monitor the directory for any new files that are added. It will parse those files as comma separated files with three columns: timestamp, user, and action.
Property | Value |
---|---|
Reference Name |
|
File Path |
|
Format |
|
Created in 2020 by Google Inc.