XML Parser Transformation
Plugin version: 2.11.0
The XML Parser Transform uses XPath to extract fields from a complex XML event. This plugin should generally be used in conjunction with the XML Reader Batch Source. The XML Reader will provide individual events to the XML Parser, which will be responsible for extracting fields from the events and mapping them to the output schema.
The transform takes an input record that contain XML events or records, parses it using the specified XPaths and returns a structured record according to the specified schema. For example, this plugin can be used in conjunction with the XML Reader Batch Source to extract values from XMLNews documents and create structured records which are easier to query.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Input field to parse as an XML record | Yes | Required. The field in the input record that is the source of the XML event or record. |
XML encoding | Yes | Required. The source XML character set encoding. Default is UTF-8. |
XPath Mappings | No | Required. Mapping of the field names to the XPaths of the XML record. A comma-separated list, each element of which is a field name, followed by a colon, followed by an XPath expression. XPath location paths can include predicates and supports XPath 1.0. Example : |
Field Name Schema Type Mapping | No | Required. Mapping of field names in the output schema to data types. Consists of a comma-separated list, each element of which is a field name followed by a colon and a type, where the field names are the same as used in the xPathMappings, and the type is one of: boolean, int, long, float, double, bytes, or string. Example : |
Error handling | No | Required. The action to take in case of an error.
|
Fail on Array | No | Optional. Whether to allow XPaths that are arrays. If false, the first element will be chosen. Default is false. |
Disallow Doctype DTD | No | Optional. This prevents processing any DTDs while reading xml files. This defaults to |
Load external DTD | No | Optional. Enable loading external DTD while reading xml file. Sets http://apache.org/xml/features/load-external-dtd Default is Off. |
Enable External Parameter Entities | No | Optional. Enable external parameter entities while reading xml file. Sets http://xml.org/sax/features/external-parameter-entities Default is Off. |
Enable External General Entities | No | Optional. Enable external parameter entities while reading xml file. Sets http://xml.org/sax/features/external-general-entities Default is Off. |
Output Schema | No | Required. The output schema for the data. |
Example
This example parses an XML record received in the "body" field of the input record following the XPath Mappings for each field name. The output structured record will be created using the type specified for each field in the Field Name Schema Type Mapping. Only years and prices will be passed on for books with a price over 35.00:
Property | Value |
---|---|
Input field to parse as an XML record |
|
XML encoding |
|
XPath Mappings | category://book/@category
title://book/title
year:/bookstore/book[price>35.00]/year,
price:/bookstore/book[price>35.00]/price,
subcategory://book/subcategory |
Field Name Schema Type Mapping | category:string
title:string
year:int
price:double
subcategory:string |
Error handling |
|
For example, suppose the transform receives these input records:
offset | body |
---|---|
1 | <bookstore><book category="cooking"><subcategory><type>Continental</type><genre>European cuisines</genre></subcategory><title lang="en">Everyday Italian</title><author>Giada DeLaurentiis</author><year>2005</year><price>30.00</price></book></bookstore> |
2 | <bookstore><book category="children"><subcategory><type>Series</type><genre>fantasy literature</genre></subcategory><title lang="en">Harry Potter</title><author>J. K. Rowling</author><year>2005</year><price>49.99</price></book></bookstore> |
The output records will contain:
category | title | year | price | subcategory |
---|---|---|---|---|
cooking | Everyday Italian | null | null | <subcategory><type>Continental</type><genre>European cuisines</genre></subcategory> |
children | Harry Potter | 2005 | 49.99 | <subcategory><type>Series</type><genre>fantasy literature</genre></subcategory> |
Here, since the subcategory contains child nodes, the plugin will return the complete subcategory node (along with its child elements) as string as <subcategory><type>Continental</type><genre>European cuisines</genre></subcategory>
. This is to ensure that the plugin returns a single XML event for a structured record instead of the two child events: <type>Continental</type>
and <genre>European cuisines</genre>
.
Related content
Created in 2020 by Google Inc.