Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

Provide introduction and basic background around why, where and how this plugin would be useful.

Use case(s)

  • Use case #1
  • Use case #2
  • Use case #3
  • Use case #n

User Storie(s)

  • User story #1
  • User story #2
  • User story #3
  • User story #m

Plugin Type

  •  Batch Source
  •  Batch Sink 
  •  Real-time Source
  •  Real-time Sink
  •  Action
  •  Post-Run Action
  •  Aggregate
  •  Join
  •  Spark Model
  •  Spark Compute

There are a ton of datasets that are available from web urls. Some examples include government data such as http://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data or datasets from https://www.data.gov/. The problem is that in order to set up a repeatable process to pull this information into your cluster, you need to first download the data and write it into a file before you can use it in a batch workflow. Ideally, there should be a way to configure a single pipeline to pull that data in, store it to a temporary file on HDFS, then kick off a spark or MR workflow to process and load it into the cluster. This would also make demos and example pipelines from the marketplace much easier to leverage since there would be no local configuration needed in most cases.  

Use case(s)

  • I would like to store consumer complaints data from http://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data into my cluster. The data is updated nightly, and is 260mb, so i would like to build a pipeline to run every 24 hours and refresh the data from the site. Using this plugin, i configure it to pull data from the url in csv format, and store it in hdfs. then i configure a File source, a CSV parser, and a table sink to process the data in hydrator.
  • The inventory team publishes a product inventory xml feed that contains all the current products and the quantity of the item. I would like to configure a pipeline to query that service every hour, and write the data into a table. Using this plugin, i configure it to request information from the url, provide my credentials and expected format as request headers, and write the results to the /tmp/ dir in HDFS so that the rest of the pipeline can process the data.
  • I have partnered with a 3rd party company that is sending me a large amount of data in a gzip file. The file is stored on a webserver somewhere and i am given a url to download it. Using this plugin, i configure it to pull the binary file, store the .gz file on hdfs, and use the file source to natively read that data for processing.

User Storie(s)

  • As a pipeline developer, i would like to fetch data from an external webservice or url by providing the request method (GET, POST), url, payload (If POST), request headers, charset (if text, otherwise binary), timeouts, and a file path in hdfs. 
  • As a pipeline developer, i would like to have the option to flatten multi line json and xml responses into a single line by removing newlines and additional spaces.
  • As a pipeline developer, I would like to download files in excess of 1gb without failing
  • As a pipeline developer, I would like the plugin to retry an configurable amount of time before failing the pipeline
  • As a pipeline developer, I would like to be able to download text or binary data and store it in hdfs for further processing
  • As a pipeline developer, I would like to be able to send basic auth credentials by providing a username and password in the config
  • As a pipeline developer, I would like to be able to read from http and https endpoints.

Plugin Type

  •  Action

Configurables

This section defines properties that are configurable for this plugin. 

User Facing NameTypeDescriptionConstraintsMacro Enabled?
HDFS File PathStringThe location to write the data in HDFS yes
URLString
Required. The URL to fetch data from.
 yes
Request MethodSelect
The HTTP request method.
GET, POST 
Request BodyString
Optional request body
 yes
Request HeadersKeyValue
An optional string of header values to send in each request where the keys and values are
delimited by a colon (":") and each pair is delimited by a newline ("\n").
 yes
Text or Binary?SelectShould be data be written as text (JSON, XML, txt files) or Binary (zip, gzip, images) data?Text, Binary 
CharsetSelectIf text data is selected, this should be the charset of the text being returned. Defaults to UTF-8."ISO-8859-1", "US-ASCII", "UTF-8", "UTF-16", "UTF-16BE", "UTF-16LE" 
Should Follow Redirects?Select
Whether to automatically follow redirects. Defaults to true.
true,false 
Number of RetriesSelect
The number of times the request should be retried if the request fails. Defaults to 0.
0,1,2,3,4,5,6,7,8,9,10 
Connect TimeoutString
The time in milliseconds to wait for a connection. Set to 0 for infinite. Defaults to 60000 (1 minute).
  
Read TimeoutString
The time in milliseconds to wait for a read. Set to 0 for infinite. Defaults to 60000 (1 minute).
  

Design / Implementation Tips

  • Tip #1
  • Tip #2Please use HTTPPoller and HTTPCallback in Hydrator plugins as a reference.
  • The workflow token should contain the file path for the data that was written so that the file source can read from it

Design

Approach(s)

Properties

Security

Limitation(s)

Future Work

  • Some future work – HYDRATOR-99999
  • Another future work – HYDRATOR-99999

Test Case(s)

  • Test case #1
  • Test case #2

Sample Pipeline

Please attach one or more sample pipeline(s) and associated data. 

Pipeline #1

Pipeline #2

 

 

Table of Contents

Table of Contents
stylecircle

Checklist

  •  User stories documented 
  •  User stories reviewed 
  •  Design documented 
  •  Design reviewed 
  •  Feature merged 
  •  Examples and guides 
  •  Integration tests 
  •  Documentation for feature 
  •  Short video demonstrating the feature