Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

...

As a framework Apache HttpComponents HttpClient is be used, a successor of Commons HttpClient.

It seems the most widely used/supported by community framework. It is very simple to find all kind of solutions and workaround already implemented, which makes plugin development and maintenance easy. Framework has a built in support for compession, https tunneling, digest auth and lot of other functions.

Properties:

SectionNameDescriptionDefaultWidgetValidations
General




URL

The url we will request. "{pagination.index}" can be included into url to represent a changing part needed for some pagination strategies.

E.g:

https://my.api.com/api/v1/user?maxResults=10&name=John&pageNumber={pagination_index}


Text BoxValidate it contains protocol.
HTTP Method

Possible values:

  • GET
  • PUT
  • POST
  • DELETE
  • HEAD
GETRadio group
HeadersKey-value map of headers
KeyValue
DropdownThis is

Request Body

Text AreaNo validation [1]
Connect Timeout

Maximum seconds to connect to server. (seconds)

0 - wait forever

120Text BoxIf is_ number and >=0Read Timeout

Maximum seconds to wait for data. (seconds)

0 - wait forever

120Text BoxIf is_ number and >=0Error Handling
Error Handling Per Status
Error Handling

HTTP Errors Handling

This is a map in which user can define which error status codes produce which results. Possible values are: RETRY, FAIL, SKIP, SEND_TO_ERROR

.

, ALERT

Example:

500: RETRY

404: SEND_TO_ERROR

*: FAIL

Wildcard (*) means "otherwise" or "for all other codes do ..".


If the field is empty. Any status_code>=400 will yield a pipeline failure.


KeyValue Dropdown

If using SEND_TO_ERROR or SKIP or SEND_TO_ALERTS and current pagination type does not support it throw a validation error. [2]

Non-HTTP Error Handling

Handling of type casting and any other unhandled exceptions thrown during transformation of a record:

Possible values are:

  • "Skip on error" - ignores any errors
  • "Stop on error" - fails pipeline
  • "Send to error" - send to error handler
Stop on errorDropdown listIf using "Send to error" or "Skip on error" and current pagination type does not support it throw a validation error. [2]
Retry Policy

Possible values are:

    • Exponential
    • Linear

Exponential

Radio group
Linear Retry IntervalThe interval between retries (seconds)30
Text BoxIf is_ number and >=0Retry CountTotal number of retries to make before failing5Text BoxIf is_ number and >=0Basic authenticationUsernameUsed for basic authentication.Text BoxPasswordUsed for basic authentication.PasswordHTTP Proxy:
Proxy URI
Number
  • if not set and retryPolicy is linear, fail.
Max retry durationMax seconds it takes to do retries600Number
Connect Timeout

Maximum seconds to connect to server. (seconds)

0 - wait forever

120Number
Read Timeout

Maximum seconds to wait for data. (seconds)

0 - wait forever

120Number
Basic authenticationUsernameUsed for basic authentication.
Text Box
PasswordUsed for basic authentication.
Password
HTTP Proxy:

Proxy URL

Example: http://proxy.com:8080

Note for me: test this with https proxies.


Text Box
Username

Text Box
Password

Password


[1] Unfortunately we cannot do validation here. Even though most commonly body in API requests is a JSON for JSON APIs or an XML for XML SOAP APIs. Theoretically it can be anything.

[2] Pagination types, where next page url is on previous taken from the previous page, are the one which do not support SEND_TO_ERROR or IGNORE.

...

Parallelization

There are two reasons why we should not parallelize the requests:

...

NameDescriptionDefaultWidgetValidations

Pagination type

Possible values are:

  • None
  • Link in response header
  • Link in response body
  • Increment an index
  • Token in Response Body
  • Custom
NoneDropdown listSelect
  • "Link in response body": Next Page field is set
  • "Token in Response Body": "Next Page Token Path" and "Next Page Url Parameter" are set.
  • "Increment an index": {pagination.index} is in url, start index, increment are set.
  • Custom: python code is set.
Start IndexInitial value for index which replaces {pagination.index} in url. See example here
Text Box
  • If set and pagination type is not "Increment an index", fail.
  • If set and no {pagination.index} in url, fail.
  • Assert if is_number
Max Index

Max value for index which replaces {pagination.index} in url.

If this is empty, plugin will load pages until no results or 404 is returned.

Also plugin may stop iteration before max index is reached, if no more records.


Text Box
  • If set and pagination type is not "Increment an index", fail.
  • If set and no {pagination.index} in url, fail.
  • Assert if is_number
Index IncrementIncrement value for index which replaces {pagination.index} in url.
Text Box
  • If set and pagination type is not "Increment an index", fail.
  • If set and no {pagination.index} in url, fail.
  • Assert if is_number
Next Page JSON/XML Field Path

Link to a field which in JSON or an XML containing next page url. See an example here


Text Box
  • If set and pagination type is not "Link in response body", fail.
  • If the content type is not XML or JSON, fail.
Custom Pagination Python CodeA code fragment which determines how next page url is generated and also when to finish iteration. For more info see Custom PaginationPython code
Next Page Token PathLink to a field in JSON or an XML containing next page token.

  • If set and pagination type is not "
Custom
  • Token in Response Body", fail.
Wait time between pagesThe number of milliseconds to wait before requesting the next page.1000Text Box
  • If the content type is not XML or JSON, fail.
  • Validate to have at least one element
Next Page Url ParameterFor type "Token in Response Body" this is used as next page token name in added to url
Text Box
  • If set and pagination type is not "Token in Response Body", fail.
Custom Pagination Python CodeA code fragment which determines how next page url is generated and also when to finish iteration. For more info see Custom Pagination
Python codeIf set and pagination type is not "Custom" fail.
Wait time between pagesThe number of milliseconds to wait before requesting the next page.1000Number
  • Assert if is_number and > 0.
  • If not set and Pagination type is non 'None' fail.

The above is a bit messy cause we cannot dynamically change the content of widget depending on pagination type. Which makes it a mix of properties for different pag_types. Is not super user-friendly for end-user. For now I will a placeholder which says which pagination type property coresponds to.

Pagination type is none

Plugin will request a single page.

...

The plugin stops reading when a page returns no records or 404. Or when reached max_index (if it's not empty) 

...

Different APIs use very different styles of pagination. In the simple cases they return link in header or some field of response JSON.

...

Pagination by next page token

Here's an example of pagination from youtube API. NextPageToken field contains a token, which should be included in url to get next page. "&page_token=CAEQAA"

...

Code Block
${url}
${url}&nextPageToken=${nextPageToken1}
${url}&nextPageToken=${nextPageToken2}
...

...


Anchor
custom_pagination
custom_pagination
Custom pagination

Different APIs use very different styles of pagination. In the simple cases they return link in header or some field of response JSON.

  • For example API where user wants to paginate by time in the following way: &start_time={something}&end_time={something+10000}. Two dependent variables are involved here. It would be very problematic to give ability to configure something like this via widget.  
  • Let's images another case. User wants to download a webserver directory. So "pages" in this case are files on webserver. Let's say he analyses/backups a whole site. So we need to paginate based on results from parsing HTML.
  • Let's assume another example. User wants to skip certain pages in API. Let's say the API pagination is time based, meaning something like this is appended to url "&start_time=1389075585". But he only wants to get pages for the weekends.

...

SectionNameDescriptionDefaultWidgetValidations
Format




Format

Possible values:

  • JSON
  • XML
  • TSV
  • DelimitedCSV
  • Text
Radio group
  • Blob

Dropdown list
JSON/XML Result Path

For JSON a simple slash separated path is used e.g. /library/books/items.

For XML an XPath is used.


Text BoxFail if used with non JSON/XML formatDelimiter (Delimited Type)Used only for delimited type. For CSV this should be comma.Text BoxFail if used with non delimited type

1 JSON format

JSON entries are converted into StructuredRecord using StructuredRecordStringConverter.java

To specify where we should take record from user needs to specify JSON Result Path.

Example:

...

JSON/XML Fields Mapping

Mapping of schema field name to jsonPath (past the result path).

Example (Jira API):

FieldNameFieldPath
name/key
type/fields/issuetype/name
description/fields/description
projectCategory/fields/project/projectCategory/name
isSubtask/fields/issuetype/subtask
fixVersions/fields/fixVersions

Schema fields which are not in the map, will use fieldName:/fieldName mapping.



  • if key is not present in schema fail
  • if used for non JSON/XML fail

1 JSON format

JSON entries are converted into StructuredRecord using StructuredRecordStringConverter.java

To specify where we should take record from user needs to specify JSON Result Path.

Example:

Code Block
{
 "pageInfo": {
  "totalResults": 208,
  "resultsPerPage": 2
 },
 "items": [
  {
   "kind": "youtube#searchResult",
   "etag": "\"Bdx4f4ps3xCOOo1WZ91nTLkRZ_c/yrJNwvacPS7tA7BQCQmeIZr7fg8\"",
   "id": {
    "kind": "youtube#channel",
    "channelId": "UCfkRcekMTa5GA2DdNKba7Jg"
   },
   "snippet": {
    "publishedAt": "2015-02-12T22:12:43.000Z",
    "channelId": "UCfkRcekMTa5GA2DdNKba7Jg",
    "title": "Cask",
    "description": "Founded by developers for developers, Cask is an open source big data software company focused on developers. Cask's flagship offering, the Cask Data ...",
    "thumbnails": {
     ...
    },
    "channelTitle": "Cask",
    "liveBroadcastContent": "upcoming"
   }
  },
  {
   "kind": "youtube#searchResult",
   "etag": "\"Bdx4f4ps3xCOOo1WZ91nTLkRZ_c/uv6u8PSG0DsOqN9m77o06Jl4LnA\"",
   "id": {
    "kind": "youtube#video",
    "videoId": "ntOXeYecj7o"
   },
   "snippet": {
    "publishedAt": "2016-12-21T19:32:03.000Z",
    "channelId": "UCfkRcekMTa5GA2DdNKba7Jg",
    "title": "Cask Product Tour",
    "description": "In this video, we take you on a product tour of CDAP, CDAP extensions and the Cask Market. Cask Data Application Platform (CDAP) is the first Unified ...",
    "thumbnails": {
     ...
    },
    "channelTitle": "Cask",
    "liveBroadcastContent": "none"
   }
  }
 ]
}

...

We may add functionality for XML parsing to https://github.com/cdapio/cdap/blob/release/6.0/cdap-formats/src/main/java/io/cdap/cdap/format/ similarly to what we have for JSONto separate project so other projects can re-use that.

XML below will be used as basis for examples in this section.

...

2.1 STEP 1 - Get XML by XPath

XML parsing is done by default Java DOM parser. Which is able to get items by a specified XPath. XPath is super flexible it allows user to get nodes by attribute value, as well as to group nodes from different parents into single result, as well as chose nodes conditionally etc. etc. 

Some XPath examples:

Code Block
/bookstores/bookstore/book[position()<3]
//title[@lang]
//title[@lang='en']
/bookstores/bookstore/book/price[text()] # convert all subelements to string
/bookstores/bookstore/book[price>35.00]/title

...

Code Block
{"bookstores": {"bookstore": [
    {
        "book": [
            {
                "year": 2005,
                "author": "Giada De Laurentiis",
                "price": {
                    "value": 15,
                    "policy": "Discount up to 50%"
                },
                "category": "cooking",
                "title": {
                    "lang": "en",
                    "content": "Everyday Italian"
                }
            },
            {
                "year": 2003,
                "author": [
                    "James McGovern",
                    "Per Bothner"
                ],
                "price": {
                    "value": 49.99,
                    "policy": "No discount"
                },
                "category": "web",
                "title": {
                    "lang": "en",
                    "content": "XQuery Kick Start"
       
        }             },
           
...         ],
        "id": 1
    },
    {
        ...
        "id": 2
    }
]}}

On a side note: look at author fields, they are of different type in the above JSON. This will be handled. If schema has field type = union, and there is a value not a list in place, we consider it as a list with a single element.

2.3 STEP 3 - Generate StructuredRecord from JSON

Converting JSON into Structured?ecord is a simple task. We do this via StructuredRecordStringConverter.java. Example:

Code Block
{
    "year": 2003,
    "author": [
        "James McGovern",
        "Per Bothner"
    ],
    "price": {
        "value": 49.99,
        "policy": "No discount"
    },
    "category": "web",
    "title": {
        "lang": "en",
        "content": "XQuery Kick Start"
    }
}

will yield records compatible with below schema:

Code Block
year: string
author: union
price: record
   - value:double
   - policy:string
category: string
title: record
   - lang:string
   - content:string

3 Delimited format

Will use the functionality from cdap-formats/DelimitedStringsRecordFormat.java to validate schema and convert input TCS/CSV to StructuredRecord.  The class supports a columns mapping as last field of schema. 

4 Text format

Will use the functionality from cdap-formats/TextRecordFormat.java  to validate schema and convert input text to StructuredRecord. 

OAuth2

1.1 General concepts

Before using OAuth2 usually user has to create an application via service site (e.g. Twitter) and register redirect uri, than receive a client_id and client_secret. Which will be used by an application during authentication.

The further steps are shown in the diagram below:

Image Removed

We are implementing grant types "Authorization Code Grant" and "Refresh Token". Other types are not suitable or rarely used. Click here for some context.

Properties:

...

CDAP will start a local server on the given url. Only localhost urls are allowed. For more info click here.

This is a URL where service callbacks with authCode after user enters username and password and agrees to grant the permissions.

This URL is also usually configured when registering the OAuth2 application in the service (e.g. Twitter). If the URL registered there is not equal to the one we send, OAuth2 will get denied.

...

A page, where the user is directed to enter his credentials.

Example: https://www.facebook.com/dialog/oauth

...

A page, where CDAP can exchange authCode for accessToken and refreshToken. Or refresh the accessToken.

Example: https://graph.facebook.com/v3.3/oauth/access_token

...

User should obtain this when registering the OAuth2 application in the service (e.g. Twitter).

...

This is optional.

Scope is a mechanism in OAuth 2.0 to limit an application's access to a user's account. An application can request one or more scopes, this information is then presented to the user in the consent screen, and the access token issued to the application will be limited to the scopes granted.

...

Note: OAuth2 implementation will reside in one of CDAP core modules, since this will need to be re-used by different plugins.

1.2 UI Changes Required. Need to expose an authentication dialog pop-up window to a user

We need to show authentication dialog from the service where user enters his username and passwords, as well as agrees to grant a certain access to our application.

This will require change to UI. We can implement this as plugin-function:

Code Block
"plugin-function": {
  "method": "POST",
  "widget": "showPageToUser",
  "output-property": "void",
  "plugin-method": "showPageToUser"
}

...

  1. Once the user clicks the button, UI runs showPageToUser to get url and headers from plugin using post-function. Example of return from plugin:

    Code Block
    url = https://www.facebook.com/v3.3/dialog/oauth?
      client_id=3MVG96_7YM2sI9wT6c13RTPp6RDeRBPFc0F5sYfIrKBZPdTK2Yr7jiTwq8u3ykXyBHtlf3lnNMWSN1rqfjA_y
      &redirect_uri=http://localhost:27435
    headers = {} # empty

    On the side note: The url syntax is established by an RFC: https://tools.ietf.org/html/rfc6749#section-4.1.1

    IMPORTANT: During this call plugin will also start a callback server (for more information click here)

  2. UI shows the page.
  3. Once the page is closed, ui does another call to plugin.
    During this call plugin waits for a callback to be complete. Than exchanges received authCode to pair accessToken, refreshToken and return it back to UI.
  4. UI populates the widget field refreshToken with the value. 

refreshToken usually has a permanent lifetime (unlike accessToken), unless invalidated manually by user. With this token during every pipeline run we can ask for an accessToken from tokenUrl. No need for callback server or UI/user involvement at this point.

If OAuth2 is enabled and refreshToken field is not populated. Pipeline deployment will fail. So effectively user will be asked to authenticate once and the information for further authentication will be saved in widget.

...

We will need to start an HTTP callback server on localhost. Let's say http://localhost:27435. The port should be statically configured via widget, we cannot get a random ephemeral port, since callback_url needs to be constant. It is saved on service provider (e.g. Facebook) as a static configuration of OAuth2 application.

After user enters his credentials on authentication page (of let's say Facebook), browser will redirect the response authCode to this server. Since this request is done by client browser (not remote server), this address is not required to be public IP address and we can safely use localhost.

1.4 Refreshing access tokens

Since we save refresh token. instead of access token token (which are short-lived). During every pipeline we will need to get an access token. This is a very simple process. We have to execute only one request. Which will return an access token.

Code Block
POST /token HTTP/1.1
Host: server.example.com
Content-Type: application/x-www-form-urlencoded

grant_type=refresh_token&refresh_token=tGzv3JOkF0XG5Qx2TlKWIA
&client_id=s6BhdRkqt3&client_secret=7Fjfp0ZBr1KtDRbnfVdmIw

Facebook case

All the APIs I checked: Google APIs, PayPal, WordPress, Microsoft Azure, Okta support refreshing access token. Actually this is parf of RFC. The only API which does not is Facebook. Instead they use concept they have made up called fb_exchange_token. Here's more info. Since facebook is widely spread, I suggest we just add ugly check "if url contains facebook.com" (or talking in fancy Java terms create a factory class, which creates oauth2 handlers depending on url provided) than save long-lived-access token instead of refreshToken and do a refresh the way facebook wants. The factory can than be used to implement behavior for other services with non-default oauth2 implementations.

...

   },
            ...
        ],
        "id": 1
    },
    {
        ...
        "id": 2
    }
]}}

On a side note: look at author fields, they are of different type in the above JSON. This will be handled. If schema has field type = union, and there is a value not a list in place, we consider it as a list with a single element.

2.3 STEP 3 - Generate StructuredRecord from JSON

Converting JSON into Structured?ecord is a simple task. We do this via StructuredRecordStringConverter.java. Example:

Code Block
{
    "year": 2003,
    "author": [
        "James McGovern",
        "Per Bothner"
    ],
    "price": {
        "value": 49.99,
        "policy": "No discount"
    },
    "category": "web",
    "title": {
        "lang": "en",
        "content": "XQuery Kick Start"
    }
}

will yield records compatible with below schema:

Code Block
year: string
author: array
price: record
   - value:double
   - policy:string
category: string
title: record
   - lang:string
   - content:string

3 Delimited format

Will use the functionality from cdap-formats/DelimitedStringsRecordFormat.java to validate schema and convert input TCS/CSV to StructuredRecord.  The class supports a columns mapping as last field of schema. 

4 Text format

Will use the functionality from cdap-formats/TextRecordFormat.java  to validate schema and convert input text to StructuredRecord. 

OAuth2

Moved design information into a separate doc: Plugin OAuth2 Common Module

Properties:

NameDescriptionDefaultWidgetValidations
OAuth2 EnabledTrue or false.falseRadio group
Auth URL

A page, where the user is directed to enter his credentials.

Example: https://www.facebook.com/dialog/oauth



Text BoxAssert to be empty if OAuth2 is disabled and the not empty if enabled.
Token URL

A page, where CDAP can exchange authCode for accessToken and refreshToken. Or refresh the accessToken.

Example: https://graph.facebook.com/v3.3/oauth/access_token


Text BoxAssert to be empty if OAuth2 is disabled and the not empty if enabled.
Client IDUser should obtain this when registering the OAuth2 application in the service (e.g. Twitter).
Text BoxAssert to be empty if OAuth2 is disabled and the not empty if enabled.
Client Secret

User should obtain this when registering the OAuth2 application in the service (e.g. Twitter).


PasswordAssert to be empty if OAuth2 is disabled and the not empty if enabled.
Scope

This is optional.

Scope is a mechanism in OAuth 2.0 to limit an application's access to a user's account. An application can request one or more scopes, this information is then presented to the user in the consent screen, and the access token issued to the application will be limited to the scopes granted.


Text BoxAssert to be empty if OAuth2 is disabled.
Refresh Token

This is populated by the button "Login via OAuth 2.0". Since we save Refresh Token (not an access token which is short lived), this should be done only once, during initial pipeline deployment. For more information click here.

UI should put an actual value into secure store and put macro function ${secure(key)} a value for extra safety.



Fail is empty and OAuth2 is enabled.

SSL/TLS

Some general definitions for more context:

...

Should we provide an option for user to skip identity check during HTTPs connection? This is not recommended anywhere you read about it, but it might be useful in case user is testing some API which is in development stage.If url starts with "https" the plugin by default will try to use TLS.

NameDescriptionDefaultWidgetValidations
Verify HTTPs Trust CertificatesIf false will allow connection to untrusted https sources.true

Keystore FilePath to a keystore file
Text BoxCheck if file exists
Keystore Type

According to Oracle docs. There are 3 supported keystore types.

Possible values:

  • Java KeyStore (JKS)
  • Java Cryptography Extension KeyStore (JCEKS)
  • PKCS #12
JKSRadio Group
Keystore PasswordLeave empty if keystore is not password protected
PasswordTry to load keystore with given password
Keystore Key AlgorithmSunX509 is default in Java.SunX509Text Box
TrustStore FilePath to a truststore file. If empty use default Java truststores.
Text BoxCheck if file exists
TrustStore Type

According to Oracle docs. There are 3 supported truststore types.

Possible values:

  • Java KeyStore (JKS)
  • Java Cryptography Extension KeyStore (JCEKS)
  • PKCS #12
JKSRadio Group
TrustStore PasswordLeave empty if keystore is not password protected
PasswordTry to load truststore with given password
Truststore Trust Algorithm
SunX509Text Box
Transport ProtocolsUser can add multiple protocols. Which will be offered by client during handshake.TLSv1.2ArrayValidate if names are correct
Cipher Suites

User can add multiple cipher suites. They will be offered by client during handshake.

If empty use default cipher suites.

This is textBox with comma separated list of ciphers. Since sometimes there can be 20, 30 or more ciphers it is not usable for user to add every one of them manually into an array.


Text Box

Validate if supported by current java implementation

...