XML Reader Batch Source

Plugin version: 2.11.0

The XML Reader plugin is a source plugin that allows users to read XML files stored on HDFS.

A user would like to read XML files that have been dropped into HDFS. These can range in size from small to very large XML files. The XMLReader will read and parse the files, and when used in conjunction with the XMLParser plugin, fields can be extracted. This reader emits one XML event, specified by the node path property, for each file read.

Configuration

Property

Macro Enabled?

Description

Property

Macro Enabled?

Description

Reference Name

No

Required. This will be used to uniquely identify this source for lineage, annotating metadata, etc.

Path

Yes

Required. Path to file(s) to be read. If a directory is specified, terminate the path name with a ‘/‘. This leverages glob syntax as described in the Java Documentation.

Node Path

Yes

Required. Node path (XPath) to emit as an individual event from the XML schema. Example: '/book/price' to read only the price from under the book node. For more information about XPaths, see the Java Documentation.

Action After Processing File

No

Required. Action to be taken after processing of the XML file. Possible actions are: (DELETE) delete from HDFS; (ARCHIVE) archive to the target location; and (MOVE) move to the target location.

Default is None.

Reprocessing Required

No

Required. Specifies whether the files should be reprocessed. If set to No, the files are tracked and will not be processed again on future runs of the pipeline.

Default is Yes.

Temporary Folder

Yes

Required. An existing folder path with read and write access for the current user. This is required for storing temporary files containing paths of the processed XML files. These temporary files will be read at the end of the job to update the file track table.

Default is /tmp.

File Pattern

Yes

Optional. The regular expression pattern used to select specific files. This should be used in cases when the glob syntax in the Path is not precise enough. See examples in the “Usage Notes” below.

Target Folder

Yes

Optional. Target folder path if the user select an action for after the process, either one of ARCHIVE or MOVE. Target folder must be an existing directory.

Enable processing external entities

Yes

Optional. This enables processing external entities while reading xml file. Defaults to false. Note: The external entities should be enabled only if necessary. It posts security risk of malicious code execution. Please read more about xxe xml vulnerability here_Processing).

Default is Off.

Enable XML parser to support DTDs

No

Optional. This sets supporting DTDs while processing xml file. This property needs to be set false if external entities needs to be evaluated.

Default is Off.

Output Schema

No

Required. The output schema for the data.

Usage Notes

When specifying a regular expression for filtering files, you must use glob syntax in the folder path. This usually means ending the path with '/*'.

Here are some regular expression pattern examples:

  1. Use '^' to select files with names starting with 'catalog', such as '^catalog'.

  2. Use '$' to select files with names ending with 'catalog.xml', such as 'catalog.xml$'.

  3. Use '.*' to select files with a name that contains 'catalogBook', such as 'catalogBook.*'.

Example

This example reads data from the folder hdfs:/cdap/source/xmls/ and emits XML records on the basis of the node path /catalog/book/title. It will generate structured records with the fields offset, fileName, and record. It will move the XML files to the target folder hdfs:/cdap/target/xmls/ and update the processed file information in the table named trackingTable.

Property

Value

Property

Value

Reference Name

referenceName

Path

hdfs:/cdap/source/xmls/*

Node Path

/catalog/book/title

Action After Processing File

Move

Reprocessing Required

No

Temporary Folder

hdfs:/cdap/target/xmls/

File Pattern

^catalog.*

Target Folder

hdfs:/cdap/target/xmls/

For this XML as an input:

<catalog> <book id="bk104"> <author>Corets, Eva</author> <title>Oberon's Legacy</title> <genre>Fantasy</genre> <price><base>5.95</base><tax><surcharge>13.00</surcharge><excise>13.00</excise></tax></price> <publish_date>2001-03-10</publish_date> <description><name><name>In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.</name></name></description> </book> <book id="bk105"> <author>Corets, Eva</author> <title>The Sundered Grail</title> <genre>Fantasy</genre> <price><base>5.95</base><tax><surcharge>14.00</surcharge><excise>14.00</excise></tax></price> <publish_date>2001-09-10</publish_date> <description><name>The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy.</name></description> </book> </catalog>

The output records will be:

offset

filename

record

offset

filename

record

2

hdfs:/cdap/source/xmls/catalog.xml

<title>Oberon’s Legacy</title>

13

hdfs:/cdap/source/xmls/catalog.xml

<title>The Sundered Grail</title>



Created in 2020 by Google Inc.