HTTP Streaming Source

The HTTP Streaming source plugin is available in the. Hub.

This plugin reads data from HTTP/HTTPS pages periodically. Paginated APIs are supported. For paginated APIs plugin reads available data and then waits for new pages to appear. Data in JSON, XML, CSV, TSV, TEXT and BLOB formats is supported.

Configuration

Property	Macro Enabled?	Description
General
Reference Name	No	Required. Name used to uniquely identify this source for lineage, annotating metadata, etc.
URL	Yes	Required. Url to fetch to the first page. The url must start with a protocol (e.g. http://).
HTTP Method	Yes	Required. HTTP request method.
Headers	Yes	Optional. Headers to send with each HTTP request.
Request Body	Yes	Optional. Body to send with each HTTP request.
Max Pages Per Fetch	Yes	Optional. Maximum number of pages put to RDD in one blocking reading. Empty value means that the maximum is not enforced.
Format
Format	Yes	Required. Format of the HTTP response. This determines how the response is converted into output records. Possible values are: JSON. Retrieves all records from the given json path and transforms them into records according to the mapping. XML. Retrieves all records from the given XPath and transforms them into records according to the mapping. TSV. Tab separated values. Columns are mapped to record fields in the order they are listed in schema. CSV. Comma separated values. Columns are mapped to record fields in the order they are listed in schema. Text. Transforms a single line of text into a single record with a string field `body` containing the result. BLOB. Transforms the entire response into a single record with a byte array field `body` containing the result. Default is json.
JSON/XML Result Path	Yes	Optional. Path to the results. When the format is XML, this is an XPath. When the format is JSON, this is a JSON path. For examples, see below.
JSON/XML Fields Mapping	Yes	Optional. Mapping of fields in a record to fields in retrieved element. The left column contains the name of schema field. The right column contains path to it within a relative to an element. It can be either XPath or JSON path. For an example, see below.
CSV Skip First Row	Yes	Optional. Whether to skip the first row of the HTTP response. This is usually set if the first row is a header row. Default is false.
OAuth2
OAuth2 Enabled	No	Required. If true, plugin will perform OAuth2 authentication. Default is False.
Auth URL	Yes	Optional. Endpoint for the authorization server used to retrieve the authorization code.
Token URL	Yes	Optional. Endpoint for the resource server, which exchanges the authorization code for an access token.
Client ID	Yes	Optional. Client identifier obtained during the Application registration process.
Client Secret	Yes	Optional. Client secret obtained during the Application registration process.
Scopes	Yes	Optional. Scope of the access request, which might have multiple space-separated values.
Refresh Token	Yes	Optional. Token used to receive accessToken, which is end product of OAuth2.

JSON/XML Result Path Examples

JSON path example:

{
     "errors": [],
     "response": {
       "books": [
         {
           "id": "1159142",
           "title": "Agile Web Development with Rails",
           "author": "Sam Ruby, Dave Thomas, David Heinemeier Hansson",
           "printInfo": {
             "page": 488,
             "coverType": "hard",
             "publisher": "Pragmatic Bookshelf"
           }
         },
         {
           "id": "2375753",
           "title": "Flask Web Development",
           "author": "Miguel Grinberg",
           "printInfo": {
             "page": 543,
             "coverType": "hard",
             "publisher": "O'Reilly Media, Inc"
           }
         },
         {
           "id": "547307",
           "title": "Alex Homer, ASP.NET 2.0 Visual Web Developer 2005",
           "author": "David Sussman",
           "printInfo": {
             "page": 543,
             "coverType": "hard",
             "publisher": "unknown"
           }
         }
       ]
     }
}

The JSON path to fetch books is /response/books. However, if we need to fetch only printInfo, we can specify /response/books/printInfo as well.

XPath example:

        Giada De Laurentiis
        2005
        
         15.0
         Discount up to 50%
        
     
     
        
        James McGovern
        Per Bothner
        2003
        
         49.99
         No discount
        
     
     ...
  
  
     ...

XPath to fetch all books is /bookstores/bookstore/book. However a more precise selections can be done. E.g. /bookstores/bookstore/book[@category='web'].

JSON/XML Fields Mapping Example

Example response:

{
   "startAt":1,
   "maxResults":5,
   "total":15599,
   "issues":[
      {
         "id":"20276",
         "key":"NETTY-14",
         "fields":{
            "issuetype":{
               "name":"Bug",
               "subtask":false
            },
            "fixVersions":[
               "4.1.37"
            ],
            "description":"Test description for NETTY-14",
            "project":{
               "id":"10301",
               "key":"NETTY",
               "name":"Netty-HTTP",
               "projectCategory":{
                  "id":"10002",
                  "name":"Infrastructure"
               }
            }
         }
      },
      {
         "id":"19124",
         "key":"NETTY-13",
         "fields":{
            "issuetype":{
               "self":"https://issues.cask.co/rest/api/2/issuetype/4",
               "name":"Improvement",
               "subtask":false
            },
            "fixVersions":[

            ],
            "description":"Test description for NETTY-13",
            "project":{
               "id":"10301",
               "key":"NETTY",
               "name":"Netty-HTTP",
               "projectCategory":{
                  "id":"10002",
                  "name":"Infrastructure"
               }
            }
         }
      }
   ]
}

Assume the result path is /issues.

The mapping is:

Field Name	Field Path
type	/fields/issuetype/name
description	/fields/description
projectCategory	/fields/project/projectCategory/name
isSubtask	/fields/issuetype/subtask
fixVersions	/fields/fixVersions

The result records are:

key	type	isSubtask	description	projectCategory	fixVersions
NETTY-14	Bug	false	Test description for NETTY-14	Infrastructure	[“4.1.37”]
NETTY-13	Improvement	false	Test description for NETTY-13	Infrastructure	[]

Note that field key was mapped without being included into the mapping. Mapping entries like key: /key can be omitted as long as the field is present in schema.