Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
Provide introduction and basic background around why, where and how this plugin would be useful.
Use case(s)
- Use case #1
- Use case #2
- Use case #3
- Use case #n
User Storie(s)
- User story #1
- User story #2
- User story #3
- User story #m
Plugin Type
- Batch Source
- Batch Sink
- Real-time Source
- Real-time Sink
- Action
- Post-Run Action
- Aggregate
- Join
- Spark Model
- Spark Compute
There are a ton of datasets that are available from web urls. Some examples include government data such as http://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data or datasets from https://www.data.gov/. The problem is that in order to set up a repeatable process to pull this information into your cluster, you need to first download the data and write it into a file before you can use it in a batch workflow. Ideally, there should be a way to configure a single pipeline to pull that data in, store it to a temporary file on HDFS, then kick off a spark or MR workflow to process and load it into the cluster. This would also make demos and example pipelines from the marketplace much easier to leverage since there would be no local configuration needed in most cases.
Use case(s)
- I would like to store consumer complaints data from http://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data into my cluster. The data is updated nightly, and is 260mb, so i would like to build a pipeline to run every 24 hours and refresh the data from the site. Using this plugin, i configure it to pull data from the url in csv format, and store it in hdfs. then i configure a File source, a CSV parser, and a table sink to process the data in hydrator.
- The inventory team publishes a product inventory xml feed that contains all the current products and the quantity of the item. I would like to configure a pipeline to query that service every hour, and write the data into a table. Using this plugin, i configure it to request information from the url, provide my credentials and expected format as request headers, and write the results to the /tmp/ dir in HDFS so that the rest of the pipeline can process the data.
- I have partnered with a 3rd party company that is sending me a large amount of data in a gzip file. The file is stored on a webserver somewhere and i am given a url to download it. Using this plugin, i configure it to pull the binary file, store the .gz file on hdfs, and use the file source to natively read that data for processing.
User Storie(s)
- As a pipeline developer, i would like to fetch data from an external webservice or url by providing the request method (GET, POST), url, payload (If POST), request headers, charset (if text, otherwise binary), timeouts, and a file path in hdfs.
- As a pipeline developer, i would like to have the option to flatten multi line json and xml responses into a single line by removing newlines and additional spaces.
- As a pipeline developer, I would like to download files in excess of 1gb without failing
- As a pipeline developer, I would like the plugin to retry an configurable amount of time before failing the pipeline
- As a pipeline developer, I would like to be able to download text or binary data and store it in hdfs for further processing
- As a pipeline developer, I would like to be able to send basic auth credentials by providing a username and password in the config
- As a pipeline developer, I would like to be able to read from http and https endpoints.
Plugin Type
- Action
Configurables
This section defines properties that are configurable for this plugin.
User Facing Name | Type | Description | Constraints | Macro Enabled? |
---|---|---|---|---|
HDFS File Path | String | The location to write the data in HDFS | yes | |
URL | String | Required. The URL to fetch data from. | yes | |
Request Method | Select | The HTTP request method. | GET, POST | |
Request Body | String | Optional request body | yes | |
Request Headers | KeyValue | An optional string of header values to send in each request where the keys and values are | yes | |
Text or Binary? | Select | Should be data be written as text (JSON, XML, txt files) or Binary (zip, gzip, images) data? | Text, Binary | |
Charset | Select | If text data is selected, this should be the charset of the text being returned. Defaults to UTF-8. | "ISO-8859-1", "US-ASCII", "UTF-8", "UTF-16", "UTF-16BE", "UTF-16LE" | |
Should Follow Redirects? | Select | Whether to automatically follow redirects. Defaults to true. | true,false | |
Number of Retries | Select | The number of times the request should be retried if the request fails. Defaults to 0. | 0,1,2,3,4,5,6,7,8,9,10 | |
Connect Timeout | String | The time in milliseconds to wait for a connection. Set to 0 for infinite. Defaults to 60000 (1 minute). | ||
Read Timeout | String | The time in milliseconds to wait for a read. Set to 0 for infinite. Defaults to 60000 (1 minute). |
Design / Implementation Tips
- Tip #1
- Tip #2Please use HTTPPoller and HTTPCallback in Hydrator plugins as a reference.
- The workflow token should contain the file path for the data that was written so that the file source can read from it
Design
Approach(s)
Properties
Security
Limitation(s)
Future Work
- Some future work – HYDRATOR-99999
- Another future work – HYDRATOR-99999
Test Case(s)
- Test case #1
- Test case #2
Sample Pipeline
Please attach one or more sample pipeline(s) and associated data.
Pipeline #1
Pipeline #2
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature