Introduction

There are a ton of datasets that are available from web urls. Some examples include government data such as http://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data or datasets from https://www.data.gov/. The problem is that in order to set up a repeatable process to pull this information into your cluster, you need to first download the data and write it into a file before you can use it in a batch workflow. Ideally, there should be a way to configure a single pipeline to pull that data in, store it to a temporary file on HDFS, then kick off a spark or MR workflow to process and load it into the cluster. This would also make demos and example pipelines from the marketplace much easier to leverage since there would be no local configuration needed in most cases.

Use case(s)

I would like to store consumer complaints data from http://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data into my cluster. The data is updated nightly, and is 260mb, so i would like to build a pipeline to run every 24 hours and refresh the data from the site. Using this plugin, i configure it to pull data from the url in csv format, and store it in hdfs. then i configure a File source, a CSV parser, and a table sink to process the data in hydrator.
The inventory team publishes a product inventory xml feed that contains all the current products and the quantity of the item. I would like to configure a pipeline to query that service every hour, and write the data into a table. Using this plugin, i configure it to request information from the url, provide my credentials and expected format as request headers, and write the results to the /tmp/ dir in HDFS so that the rest of the pipeline can process the data.
I have partnered with a 3rd party company that is sending me a large amount of data in a gzip file. The file is stored on a webserver somewhere and i am given a url to download it. Using this plugin, i configure it to pull the binary file, store the .gz file on hdfs, and use the file source to natively read that data for processing.

User Storie(s)

As a pipeline developer, i would like to fetch data from an external webservice or url by providing the request method (GET, POST), url, payload (If POST), request headers, charset (if text, otherwise binary), timeouts, and a file path in hdfs.
As a pipeline developer, i would like to have the option to flatten multi line json and xml responses into a single line by removing newlines and additional spaces.
As a pipeline developer, I would like to download files in excess of 1gb without failing
As a pipeline developer, I would like the plugin to retry an configurable amount of time before failing the pipeline
As a pipeline developer, I would like to be able to download text or binary data and store it in hdfs for further processing
As a pipeline developer, I would like to be able to send basic auth credentials by providing a username and password in the config
As a pipeline developer, I would like to be able to read from http and https endpoints.

Plugin Type

Action

Configurables

This section defines properties that are configurable for this plugin.

User Facing Name	Type	Description	Constraints	Macro Enabled?
HDFS File Path	String	The location to write the data in HDFS		yes
URL	String	Required. The URL to fetch data from.		yes
Request Method	Select	The HTTP request method.	GET, POST
Request Body	String	Optional request body		yes
Request Headers	KeyValue	An optional string of header values to send in each request where the keys and values are delimited by a colon (":") and each pair is delimited by a newline ("\n").		yes
Text or Binary?	Select	Should be data be written as text (JSON, XML, txt files) or Binary (zip, gzip, images) data?	Text, Binary
Charset	Select	If text data is selected, this should be the charset of the text being returned. Defaults to UTF-8.	"ISO-8859-1", "US-ASCII", "UTF-8", "UTF-16", "UTF-16BE", "UTF-16LE"
Should Follow Redirects?	Select	Whether to automatically follow redirects. Defaults to true.	true,false
Number of Retries	Select	The number of times the request should be retried if the request fails. Defaults to 0.	0,1,2,3,4,5,6,7,8,9,10
Connect Timeout	String	The time in milliseconds to wait for a connection. Set to 0 for infinite. Defaults to 60000 (1 minute).
Read Timeout	String	The time in milliseconds to wait for a read. Set to 0 for infinite. Defaults to 60000 (1 minute).

Design / Implementation Tips

Please use HTTPPoller and HTTPCallback in Hydrator plugins as a reference.
The workflow token should contain the file path for the data that was written so that the file source can read from it

Design

Approach(s)

Properties

Security

Limitation(s)

Future Work

Some future work – HYDRATOR-99999
Another future work – HYDRATOR-99999

Test Case(s)

Test case #1
Test case #2

Sample Pipeline

Please attach one or more sample pipeline(s) and associated data.

Pipeline #1

Pipeline #2

Table of Contents

Checklist

User stories documented
User stories reviewed
Design documented
Design reviewed
Feature merged
Examples and guides
Integration tests
Documentation for feature
Short video demonstrating the feature

HTTPToHDFS Action