Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The developer to load webpage click and view data (customer id, timestamp, action, url) into a partitioned fileset. After loading the data, the developer wants to de-duplicate records and calculate how many times each customer clicked and viewed over the past hour, past day, and past month. 

User Stories:

  1. (3.4) A developer should be able to create pipelines that contain aggregations (GROUP BY -> count/sum/unique)
  2. (3.5) A developer should be able to create a pipeline with multiple sources, with one happening after the otherA control some parts of the pipeline running before others. For example, one source -> sink branch running before another source -> sink branch.
  3. (3.5) A developer should be able to use a Spark ML job as a pipeline stage
  4. A (3.4) A developer should be able to rerun failed pipeline runs without reconfiguring the pipeline
  5. A (3.4) A developer should be able to de-duplicate records in a pipeline
  6. A (3.5) A developer should be able to join multiple branches of a pipeline
  7. A (3.5) A developer should be able to use an Explore action as a pipeline stage
  8. A (3.5) A developer should be able to create pipelines that contain Spark Streaming jobs
  9. A (3.5) A developer should be able to create pipelines that run based on various conditions, including input data availability and Kafka events

...