The HTTP Batch source plugin is available in the Hub.

Plugin version: 1.4.0

This plugin reads data from HTTP/HTTPS pages. Paginated APIs are supported. Data in JSON, XML, CSV, TSV, TEXT and BLOB formats is supported.

Configuration

General Properties

Property	Macro Enabled?	Description

Property	Macro Enabled?	Description
Reference Name	No	Required. Name used to uniquely identify this source for lineage, annotating metadata, etc.
URL	Yes	Required. URL to fetch to the first page. The URL must start with a protocol (e.g. http://).
HTTP Method	Yes	Required. HTTP request method.
Headers	Yes	Optional. Headers to send with each HTTP request.
Request Body	Yes	Optional. Body to send with each HTTP request.

Format Properties

Property	Macro Enabled?	Description

Property	Macro Enabled?	Description
Format	Yes	Required. Format of the HTTP response. This determines how the response is converted into output records. Possible values are: JSON - retrieves all records from the given json path and transforms them into records according to the mapping. XML - retrieves all records from the given XPath and transforms them into records according to the mapping. TSV - tab separated values. Columns are mapped to record fields in the order they are listed in schema. CSV - comma separated values. Columns are mapped to record fields in the order they are listed in schema. Text - transforms a single line of text into a single record with a string field `body` containing the result. BLOB - transforms the entire response into a single record with a byte array field `body` containing the result. Default is json.
JSON/XML Result Path	Yes	Path to the results. When the format is XML, this is an XPath. When the format is JSON, this is a JSON path. See “JSON/XML Result Path Examples” below.
JSON/XML Fields Mapping	Yes	Optional. Mapping of fields in a record to fields in retrieved element. The left column contains the name of schema field. The right column contains path to it within a relative to an element. It can be either XPath or JSON path. See “JSON/XML Fields Mapping Example” below.
CSV Skip First Row	Yes	Required. Whether to skip the first row of the HTTP response. This is usually set if the first row is a header row.

JSON/XML Result Path Examples

JSON path example:

{
     "errors": [],
     "response": {
       "books": [
         {
           "id": "1159142",
           "title": "Agile Web Development with Rails",
           "author": "Sam Ruby, Dave Thomas, David Heinemeier Hansson",
           "printInfo": {
             "page": 488,
             "coverType": "hard",
             "publisher": "Pragmatic Bookshelf"
           }
         },
         {
           "id": "2375753",
           "title": "Flask Web Development",
           "author": "Miguel Grinberg",
           "printInfo": {
             "page": 543,
             "coverType": "hard",
             "publisher": "O'Reilly Media, Inc"
           }
         },
         {
           "id": "547307",
           "title": "Alex Homer, ASP.NET 2.0 Visual Web Developer 2005",
           "author": "David Sussman",
           "printInfo": {
             "page": 543,
             "coverType": "hard",
             "publisher": "unknown"
           }
         }
       ]
     }
}

JSON path to fetch books is /response/books. However, if we need to fetch only printInfo we can specify /response/books/printInfo as well.

XPath example:

        Giada De Laurentiis
        2005
        
         15.0
         Discount up to 50%
        
     
     
        
        James McGovern
        Per Bothner
        2003
        
         49.99
         No discount
        
     
     ...
  
  
     ...

XPath to fetch all books is /bookstores/bookstore/book. However a more precise selections can be done, for example, /bookstores/bookstore/book[@category='web'].

JSON/XML Fields Mapping Example

Example response:

{
   "startAt":1,
   "maxResults":5,
   "total":15599,
   "issues":[
      {
         "id":"20276",
         "key":"NETTY-14",
         "fields":{
            "issuetype":{
               "name":"Bug",
               "subtask":false
            },
            "fixVersions":[
               "4.1.37"
            ],
            "description":"Test description for NETTY-14",
            "project":{
               "id":"10301",
               "key":"NETTY",
               "name":"Netty-HTTP",
               "projectCategory":{
                  "id":"10002",
                  "name":"Infrastructure"
               }
            }
         }
      },
      {
         "id":"19124",
         "key":"NETTY-13",
         "fields":{
            "issuetype":{
               "self":"https://issues.cask.co/rest/api/2/issuetype/4",
               "name":"Improvement",
               "subtask":false
            },
            "fixVersions":[

            ],
            "description":"Test description for NETTY-13",
            "project":{
               "id":"10301",
               "key":"NETTY",
               "name":"Netty-HTTP",
               "projectCategory":{
                  "id":"10002",
                  "name":"Infrastructure"
               }
            }
         }
      }
   ]
}

Assume the result path is /issues.

The mapping is:

Field Name	Field Path

Field Name	Field Path
type	/fields/issuetype/name
description	/fields/description
projectCategory	/fields/project/projectCategory/name
isSubtask	/fields/issuetype/subtask
fixVersions	/fields/fixVersions

The result records are:

key	type	isSubtask	description	projectCategory	fixVersions

key	type	isSubtask	description	projectCategory	fixVersions
NETTY-14	Bug	false	Test description for NETTY-14	Infrastructure	[“4.1.37”]
NETTY-13	Improvement	false	Test description for NETTY-13	Infrastructure	[]

Note: The field key was mapped without being included into the mapping. Mapping entries like key: /key can be omitted as long as the field is present in schema.

Authentication

QAuth2 Properties

Property	Macro Enabled?	Description

Property	Macro Enabled?	Description
OAuth2 Enabled	No	Optional. If set to True, the plugin will perform QAuth2 authentication. Default is False.
Auth URL	Yes	Optional. Endpoint for the authorization server used to retrieve the authorization code.
Token URL	Yes	Optional. Endpoint for the resource server, which exchanges the authorization code for an access token.
Client ID	Yes	Optional. Client identifier obtained during the Application registration process.
Client Secret	Yes	Optional. Client secret obtained during the Application registration process.
Scopes	Yes	Optional. Scope of the access request, which might have multiple space-separated values.
Refresh Token	Yes	Optional. Token used to receive accessToken, which is end product of OAuth2.

Service Account Properties

Property	Macro Enabled?	Description

Property	Macro Enabled?	Description
Service Account Type	Yes	Optional. Select one of the following options: File Path. File path where the service account is located. JSON. JSON content of the service account.
Service Account File Path	Yes	Optional. Path on the local file system of the service account key used for authorization. Can be set to ‘auto-detect’ when running on a Dataproc cluster. When running on other clusters, the file must be present on every node in the cluster.
Service Account JSON	Yes	Optional. Contents of the service account JSON file.
Service Account Scope	Yes	Optional. The additional Google credential scopes required to access entered url, cloud-platform is included by default, visit https://developers.google.com/identity/protocols/oauth2/scopes for more information. Scope example:

Basic Authentication Properties

Property	Macro Enabled?	Description

Property	Macro Enabled?	Description
Username	Yes	Optional. Username for basic authentication.
Password	Yes	Optional. Password for basic authentication.

HTTP Proxy Properties

Property	Macros Enabled?	Description

Property	Macros Enabled?	Description
Proxy URL	Yes	Optional. Proxy URL. Must contain a protocol, address, and port.

HTTP Error Handling Properties

Property	Macro Enabled?	Description

Property	Macro Enabled?	Description
HTTP Errors Handling	No	Optional Defines the error handling strategy to use for certain HTTP response codes. The left column contains a regular expression for HTTP status code. The right column contains an action which is done in case of match. If HTTP status code matches multiple regular expressions, the first specified in mapping is matched. See “HTTP Errors Handling Example” below.
Non-HTTP Error Handling	No	Required. Error handling strategy to use when the HTTP response cannot be transformed to an output record. Possible values are: Stop on error - Fails pipeline due to erroneous record. Send to error - Sends erroneous record’s text to error port and continues. Skip on error - Ignores erroneous records. Default is `Stop on error`.
Retry Policy	No	Required. Policy used to calculate delay between retries. Default is `Exponential`.
Linear Retry Interval	Yes	Optional. Interval between retries. Is only used if retry policy is “linear”. Default is 30.
Max Retry Duration	Yes	Optional. Maximum time in seconds retries can take. Default is 600.
Connect Timeout	Yes	Optional. Maximum time in seconds connection initialization is allowed to take. Default is 120.
Read Timeout	Yes	Optional. Maximum time in seconds fetching data from the server is allowed to take. Default is 120.

HTTP Error Handling Example

HTTP Code Regexp	Error Handling

HTTP Code Regexp	Error Handling
2..	Success
401	Retry and fail
4..	Fail
5..	Retry and send to error
.*	Fail

Note: Pagination types “Link in response header”, “Link in response body”, “Token in response body” do not support “Send to error”, “Skip”, “Retry and send to error”, “Retry and skip” options.

Property	Macro Enable?	Description

Property	Macro Enable?	Description
Pagination Type	No	Optional. Strategy used to determine how to get next page. You can select the following pagination types. Depending on the pagination type you select, you’ll need to set other properties too: None Link in response header Link in response body Token in response body Increment an index Custom Default is None.
Pagination Type: None	No	Only single page is loaded.
Wait Time Between Pages (milliseconds)	Yes	Optional. Time in milliseconds to wait between HTTP requests for the next page. Default is 0.
Pagination Type: Link in response header	No	In response there is a “Link” header, which contains a url marked as “next”. Example:
Pagination Type: Link in response body	No	Every page contains a next page url. This pagination type is only supported for JSON and XML formats. Pagination happens until no next page field is present or until page contains no elements.
Next Page JSON/XML Field Path	Yes	A JSON path or an XPath to a field which contains next page url. It can be either relative or absolute url. Example page response: Next page field path is `_links/next`.
Pagination Type: Token in response body	No	Every page contains a token, which is appended as an url parameter to obtain next page. This type of pagination is only supported for JSON and XML formats. Pagination happens until no next page token is present on the page or until page contains no elements.
Next Page Token Path	Yes	A JSON path or an XPath to a field which contains next page token.
Next Page Url Parameter	Yes	A parameter which is appended to url in order to specify next page token. Example plugin config: First page response: Next page fetched by plugin will be url with `&pageToken=CAEQAA` appended.
Pagination Type: Increment and index	No	Pagination by incrementing a {pagination.index} placeholder value in url. For this pagination type url is required to contain above placeholder.
Start Index	Yes	Start value of {pagination.index} placeholder
Max Index	Yes	Maximum value of {pagination.index} placeholder. If empty, pagination will happen until the page with no elements.
Index Increment	Yes	A value which the {pagination.index} placeholder is incremented by. Increment can be negative.
Pagination Type: Custom	No	Pagination using user provided code. The code decides how to retrieve a next page url based on previous page contents and headers and when to finish pagination.
Custom Pagination Python Code	No	A code which implements retrieving a next page url based on previous page contents and headers. Example code: The above code iterates over first five pages of searchcode.com results. When ‘None’ is returned the iteration is stopped.

SSL/TLS Properties

Property	Macro Enabled?	Description

Property	Macro Enabled?	Description
Verify HTTPS Trust Certificates	Yes	Optional. If False, untrusted trust certificates (e.g. self signed), will not lead to an error. Do not disable this in production environment on a network you do not entirely trust. Especially public internet. Default is True.
Keystore File	Yes	Optional. A path to a file which contains keystore.
Keystore Type	Yes	Optional. Format of a keystore. Default is Java KeyStore (JKS).
Keystore Password	Yes	Optional. Password for a keystore. If a keystore is not password protected leave it empty.
Keystore Key Algorithm	Yes	Optional. An algorithm used for keystore. Default is SunX509.
TrustStore File	Yes	Optional. A path to a file that contains truststore.
TrustStore Type	Yes	Optional. Format of a truststore. Default is Java KeyStore (JKS).
TrustStore Password	Yes	Optional. Password for a truststore. If a truststore is not password protected leave it empty.
TrustStore Key Algorithm	Yes	Optional. An algorithm used for truststore. Default is SunX509.
Transport Protocols	Yes	Optional. Transport protocols which are allowed for connection. Default is TLSv1.2.
Cipher Suites	Yes	Optional. Cipher suites which are allowed for connection. Colons, commas or spaces are also acceptable separators.

CDAP Documentation

HTTP Batch Source

Configuration

General Properties

Format Properties

JSON/XML Result Path Examples

JSON path example:

XPath example:

JSON/XML Fields Mapping Example

Authentication

QAuth2 Properties

Service Account Properties

Basic Authentication Properties

HTTP Proxy Properties

HTTP Error Handling Properties

HTTP Error Handling Example

SSL/TLS Properties

Configuration

General Properties

Format Properties

JSON/XML Result Path Examples

JSON path example:

XPath example:

JSON/XML Fields Mapping Example

Authentication

QAuth2 Properties

Service Account Properties

Basic Authentication Properties

HTTP Proxy Properties

HTTP Error Handling Properties

HTTP Error Handling Example

Pagination Properties

SSL/TLS Properties