AWS DynamoDb batch sink

AWS DynamoDb batch sink

Introduction

A batch sink that pushes data from hydrator pipelines into dynamoDb tables.

Use case(s)

  • An organization wants to parse the logs generated by a system and want to store the metadata in dynamodb tables.

User Storie(s)

Plugin Type

Batch Source
Batch Sink 
Real-time Source
Real-time Sink
Action
Post-Run Action
Aggregate
Join
Spark Model
Spark Compute

Configurables

This section defines properties that are configurable for this plugin. 

User Facing Name

Type

Description

Constraints

User Facing Name

Type

Description

Constraints

Table name

String

Name of the dynamo db table

Naming convention constraints from AWS

Primary key fields

List<Map<String,String>>

Primary key fields of the table

There should be at least 1 primary key

endpoint url

String

AWS endpoint url for DynamoDb instance

Optional, could be reconstructed using regionId

region id

String

AWS region id for DynamoDb instance.

 

access id

String

AWS access id

 

access key

password

AWS access key

 

Primary key types

List<Map<String,String>>

Key types for the primary keys, used for creating the table

The primary key type can only have 2 values HASH and RANGE

Read Capacity Units

Long

 The number of strongly consistent reads per second of items up to 4 KB in size per second.

 

Write capacity units

Long

The number of 1 KB writes per second. 

 

Design / Implementation Tips

Design

DynamoDB Sink JSON format:

{ "name": "DynamoDb", "type": "batchsink", "properties": { "endpointUrl": "", "regionId": "us-east-1", "accessKey": "xyz", "secretAccessKey": "abc", "tableName": "Movies", "primaryKeyFields": "Id:N", "primaryKeyTypes": "Id:HASH", "readCapacityUnits": "10", "writeCapacityUnits": "10" } }

 

Approach(s)

  1. Dropdown with the list of regions will be provided to user, to select the region for AWS Dynamo DB to connect to. Supported regions are:          
    "us-gov-west-1", "us-east-1", "us-east-2", "us-west-1", "us-west-2", "eu-west-1", "eu-west-2", "eu-central-1", "ap-south-1","ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "ap-northeast-2", "sa-east-1", "cn-north-1", "ca-central-1", "getCurrentRegion".   (Referred from: http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/
    http://docs.aws.amazon.com/general/latest/gr/rande.html#ddb_region)

  2. If user does not select any region, then default region will be used, i.e. us-west-2.

  3. getCurrentRegion from the list, returns a Region object representing the region the application is running in, when running in EC2. If this method is called from a non-EC2 environment, it will return null.

  4. The plugin will support following CDAP data types in schema:   String, Number(int, long, float, double), Boolean, NULL, Map, List, Array of String and Array of Number, Bytes (will be converted to binary when storing to DynamoDB).

  5. Key value drop-down to take the name of the primary key fields and attribute type. The drop-down will allow following values: String, Number(int, long, float, double), Boolean, NULL, Map, List, Array of String and Array of Number.

  6. Key value drop-down to take the name of the primary key fields and key type. The drop-down will have the following values: "N"(number), "S"(string) and "B"(binary - the byte[] value received from the previous stage will be converted to binary when storing the data in DynamoDB).

  7. The plugin will retry to put data 3 times before failing.

  8. max batch size = 25 , max individual item size = 400 KB, total request size for each batch = 16MB 

    (http://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html)

 

Properties

  • endpointUrl: aws endpoint http://docs.aws.amazon.com/general/latest/gr/rande.html#ddb_region This could be reconstructed using regionId.

  • regionId: The region for AWS Dynamo DB to connect to.

  • accessKey: Access key for AWS Dynamo DB.

  • secretAccessKey: Secret access key for AWS Dynamo DB.

  • tableName: The table to write the data to. If the specified table does not exists, it will be created using the primary key attributes and key schema and the read and write capacity units.

  • primaryKeyFields: A comma-separated list of key-value pairs representing the primary key and its attribute type. The attribute type can have the following values: "N"(Number), "S"(String) or "B"(binary).

  • primaryKeyTypes:  A comma-separated list of key-value pairs representing the primary key and its attribute type. The key type can have the following values: "HASH" or "RANGE".

  • readCapacityUnits: The maximum number of strongly consistent reads consumed per second before DynamoDB returns a ThrottlingException. This will be used when creating a new table if the table name specified by the user does not exists.

  • writeCapacityUnits: The maximum number of writes consumed per second before DynamoDB returns a ThrottlingException. This will be used when creating a new table if the table name specified by the user does not exists.

Security

  • The AWS access keys should be a password field and macros enabled



NFR

1.This plugin should be able to read the data from DynamoDB table and emits the structured record to next stage successfully.

2.Only Performance measurement is in scope as part of NFR.

Limitation(s)

Future Work

Test Case(s)

DynamoDB batch sink - using partition key.

DynamoDB batch sink - using partition and sort key.

DynamoDB batch sink - using 100k data

Sample Pipeline

Table of Contents

Checklist

User stories documented 
User stories reviewed 
Design documented 
Design reviewed 
Feature merged 
Examples and guides 
Integration tests 
Documentation for feature 
Short video demonstrating the feature

Created in 2020 by Google Inc.