XML Parser Transformation

Plugin version: 2.11.0

The XML Parser Transform uses XPath to extract fields from a complex XML event. This plugin should generally be used in conjunction with the XML Reader Batch Source. The XML Reader will provide individual events to the XML Parser, which will be responsible for extracting fields from the events and mapping them to the output schema.

The transform takes an input record that contain XML events or records, parses it using the specified XPaths and returns a structured record according to the specified schema. For example, this plugin can be used in conjunction with the XML Reader Batch Source to extract values from XMLNews documents and create structured records which are easier to query.

Configuration

Property

Macro Enabled?

Description

Property

Macro Enabled?

Description

Input field to parse as an XML record

Yes

Required. The field in the input record that is the source of the XML event or record.

XML encoding

Yes

Required. The source XML character set encoding.

Default is UTF-8.

XPath Mappings

No

Required. Mapping of the field names to the XPaths of the XML record. A comma-separated list, each element of which is a field name, followed by a colon, followed by an XPath expression. XPath location paths can include predicates and supports XPath 1.0. Example : <field-name>:<XPath expression>.

Field Name Schema Type Mapping

No

Required. Mapping of field names in the output schema to data types. Consists of a comma-separated list, each element of which is a field name followed by a colon and a type, where the field names are the same as used in the xPathMappings, and the type is one of: boolean, int, long, float, double, bytes, or string. Example : <field-name>:<data-type>.

Error handling

No

Required. The action to take in case of an error.

  • Ignore error and continue

  • Exit on error: Stops processing upon encountering an error

  • Write to error dataset: Writes the error record to an error dataset and continues.

Fail on Array

No

Optional. Whether to allow XPaths that are arrays. If false, the first element will be chosen.

Default is false.

Disallow Doctype DTD

No

Optional. This prevents processing any DTDs while reading xml files. This defaults to false from the plugin but when configuring the plugin via UI this will be set to true. This is to prevent xxe based xml vulnerabilities while reading the xml file. Please read more about xxe xml vulnerability here_Processing).

Load external DTD

No

Optional. Enable loading external DTD while reading xml file. Sets http://apache.org/xml/features/load-external-dtd

Default is Off.

Enable External Parameter Entities

No

Optional. Enable external parameter entities while reading xml file. Sets http://xml.org/sax/features/external-parameter-entities

Default is Off.

Enable External General Entities

No

Optional. Enable external parameter entities while reading xml file. Sets http://xml.org/sax/features/external-general-entities

Default is Off.

Output Schema

No

Required. The output schema for the data.

Example

This example parses an XML record received in the "body" field of the input record following the XPath Mappings for each field name. The output structured record will be created using the type specified for each field in the Field Name Schema Type Mapping. Only years and prices will be passed on for books with a price over 35.00:

Property

Value

Property

Value

Input field to parse as an XML record

body

XML encoding

UTF-8

XPath Mappings

category://book/@category title://book/title year:/bookstore/book[price>35.00]/year, price:/bookstore/book[price>35.00]/price, subcategory://book/subcategory

Field Name Schema Type Mapping

category:string title:string year:int price:double subcategory:string

Error handling

Ignore error and continue

For example, suppose the transform receives these input records:

offset

body

offset

body

1

<bookstore><book category="cooking"><subcategory><type>Continental</type><genre>European cuisines</genre></subcategory><title lang="en">Everyday Italian</title><author>Giada DeLaurentiis</author><year>2005</year><price>30.00</price></book></bookstore>

2

<bookstore><book category="children"><subcategory><type>Series</type><genre>fantasy literature</genre></subcategory><title lang="en">Harry Potter</title><author>J. K. Rowling</author><year>2005</year><price>49.99</price></book></bookstore>

The output records will contain:

category

title

year

price

subcategory

category

title

year

price

subcategory

cooking

Everyday Italian

null

null

<subcategory><type>Continental</type><genre>European cuisines</genre></subcategory>

children

Harry Potter

2005

49.99

<subcategory><type>Series</type><genre>fantasy literature</genre></subcategory>

Here, since the subcategory contains child nodes, the plugin will return the complete subcategory node (along with its child elements) as string as <subcategory><type>Continental</type><genre>European cuisines</genre></subcategory> . This is to ensure that the plugin returns a single XML event for a structured record instead of the two child events: <type>Continental</type> and <genre>European cuisines</genre>.



Created in 2020 by Google Inc.