Amazon Kinesis source and sink

Overview

Amazon Kinesis Streams (similar to kafka) provides infrastructure to collect and process large streams of data in realtime. It works on the concept of shards, each stream can have a default value of 25 shards (specific to region). the streams are capable of rapid and continuos data ingestion and aggregation. The response time of data processing is realtime. 

Motivation

By adding the capability of sending and receiving data from kinesis streams users will be able to integrate CDAP with their already running applications based on AWS infrastructure. We will be creating hydrator source and sink for Kinesis which will allow user to integrate them with our hydrator pipelines. users will be able to use their existing Kinesis stream events as a source and can send the data processed in pipeline to Kinesis stream  providing more options for data processing. Users having heavy dependency on the AWS infrastructure can have a single point of data exchange between CDAP and AWS and could create access points within kinesis for their various aws applications.

Integrating kinesis will also help hydrator to store and process various formats that it doesn`t currently support, such as processing of images as blob objects.


Requirements

Kinesis provides stream producer library KPL and stream consumer library KCL, We would explore both the options with and without the native library in order to make the plugin lightweight. Users will be able to input their aws application credentials and will be able to add kinesis streams to hydrator.

The data is ingested in Kinesis in the form of blobs. the plugin should be able to accept various formats, convert them into blob and push it to kinesis. Similarly the data received from kinesis will be in key value format, value being as a blob, the source plugin should be able to convert it back to string format


Inputs

InputRequiredDefault
access Key IDYes 
Secret Access KeyYes 
Stream name:Yes 
stream start point SHARD.LATEST
Key: data keyYeskey
Value: data to be storedYes 
Shard name: Defaults to next available shard

 

Limitations:

By default a shard can only ingest data at the rate of 1 mb / sec and a stream instance can have max 25 shards
We will need 1 plugin instance per shard

 

Created in 2020 by Google Inc.