There are a ton of datasets that are available from web urls. Some examples include government data such as http://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data or datasets from https://www.data.gov/. The problem is that in order to set up a repeatable process to pull this information into your cluster, you need to first download the data and write it into a file before you can use it in a batch workflow. Ideally, there should be a way to configure a single pipeline to pull that data in, store it to a temporary file on HDFS, then kick off a spark or MR workflow to process and load it into the cluster. This would also make demos and example pipelines from the marketplace much easier to leverage since there would be no local configuration needed in most cases.
Use case(s)
I would like to store consumer complaints data from http://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data into my cluster. The data is updated nightly, and is 260mb, so i would like to build a pipeline to run every 24 hours and refresh the data from the site. Using this plugin, i configure it to pull data from the url in csv format, and store it in hdfs. then i configure a File source, a CSV parser, and a table sink to process the data in hydrator.
The inventory team publishes a product inventory xml feed that contains all the current products and the quantity of the item. I would like to configure a pipeline to query that service every hour, and write the data into a table. Using this plugin, i configure it to request information from the url, provide my credentials and expected format as request headers, and write the results to the /tmp/ dir in HDFS so that the rest of the pipeline can process the data.
I have partnered with a 3rd party company that is sending me a large amount of data in a gzip file. The file is stored on a webserver somewhere and i am given a url to download it. Using this plugin, i configure it to pull the binary file, store the .gz file on hdfs, and use the file source to natively read that data for processing.
User Storie(s)
As a pipeline developer, i would like to fetch data from an external webservice or url by providing the request method (GET, POST), url, payload (If POST), request headers, charset (if text, otherwise binary), timeouts, and a file path in hdfs.
As a pipeline developer, i would like to have the option to flatten multi line json and xml responses into a single line by removing newlines and additional spaces.
As a pipeline developer, I would like to download files in excess of 1gb without failing
As a pipeline developer, I would like the plugin to retry an configurable amount of time before failing the pipeline
As a pipeline developer, I would like to be able to download text or binary data and store it in hdfs for further processing
As a pipeline developer, I would like to be able to send basic auth credentials by providing a username and password in the config
As a pipeline developer, I would like to be able to read from http and https endpoints.
Plugin Type
Action
Configurables
This section defines properties that are configurable for this plugin.
User Facing Name
Type
Description
Constraints
Macro Enabled?
HDFS File Path
String
The location to write the data in HDFS
yes
URL
String
Required. The URL to fetch data from.
yes
Request Method
Select
The HTTP request method.
GET, POST
Request Body
String
Optional request body
yes
Request Headers
KeyValue
An optional string of header values to send in each request where the keys and values are delimited by a colon (":") and each pair is delimited by a newline ("\n").
yes
Text or Binary?
Select
Should be data be written as text (JSON, XML, txt files) or Binary (zip, gzip, images) data?
Text, Binary
Charset
Select
If text data is selected, this should be the charset of the text being returned. Defaults to UTF-8.