Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

One of the most efficient ways to load data from Amazon Redshift to s3 is using the UNLOAD command:  http://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html

Use case(s)

  • A financial customer would like to quickly unload financial reports into S3 that have been generated from processing that is happening in Redshift. The pipeline would have a Redshift to S3 action at the beginning, and then leverage the s3 source to read that data into a processing pipeline.

User Storie(s)

  • As a user, i would like to unload data from redshift to s3 using the unload command.
  • I would like to authenticate with IAM credentials as well as id and secret key pairs.
  • I would like to use that s3 data as an input into a hydrator pipeline
  • I would like

    Introduction

    One of the most efficient ways to load data from Amazon Redshift to s3 is using the UNLOAD command:  http://docs.aws.amazon.com/redshift/latest/dg/t_Unloading_tables.html

    Use case(s)

    • A financial customer would like to quickly unload financial reports into S3 that have been generated from processing that is happening in Redshift. The pipeline would have a Redshift to S3 action at the beginning, and then leverage the s3 source to read that data into a processing pipeline.

    User Storie(s)

    • As a user, i would like to unload data from redshift to s3 using the unload command.
    • I would like to authenticate with IAM credentials as well as id and secret key pairs.
    • I would like to use that s3 data as an input into a hydrator pipeline
    • I would like the location of the data to be passed via workflow token so that they next plugin can use it in a macro

    Plugin Type

    •  Action

    Configurables

    This section defines properties that are configurable for this plugin. 

    User Facing NameTypeDescriptionConstraints
    QueryStringselect statement to be used for unloading the data 
    Access KeyStringAWS access key for S3Optional
    Secret Access KeyStringAWS secret access key for S3Optional
    AWS IAM RoleStringIAM Role

    Optional

    (Cannot be specified with access and secretAccess keys)

    S3 Data PathStringAmazon S3 bucket (including key prefix). Should be of the format 's3://object-path/name-prefix' 
    Output Path Token Key to store the location of the data to be passed via workflow token so that they next plugin can use it in a macro

    Plugin Type

    •  Action

    Configurables

    This section defines properties that are configurable for this plugin. 

    User Facing NameTypeDescriptionConstraints
    QueryStringselect statement to be used for unloading the data 
    Access KeyStringAWS access key for S3Optional
    Secret Access KeyStringAWS secret access key for S3Optional
    AWS IAM RoleStringIAM Role

    Optional

    (Cannot be specified with access and secretAccess keys)

    S3 Data PathStringAmazon S3 bucket (including key prefix). Should be of the format 's3://object-path/name-prefix' 
    Output Path Token Key to store the location of the data to be passed via workflow token so that they next plugin can use it in a macro. Deafult is "filePath"Optional
    ManifestBooleanSpecify whether manifest file is to be created during unloading the data into S3. Default is false.Optional
    S3 DelimiterStringThe delimiter by which fields in a character-delimited file are to be separated.Optional
    ParallelBooleanTo write data in parallel to multiple files, according to the number of slices in the cluster or to a single file. The default option is true.Optional
    CompressionStringUnload a compressed file of type BZIP2 or GZIP. Default is NONE.Optional
    Allow Over-WriteBooleanBy default, UNLOAD fails if it finds files that it would possibly overwrite. If ALLOWOVERWRITE is specified, UNLOAD will overwrite existing files, including the manifest file. Deafult is false.Optional
    Add QuotesBooleanPlaces quotation marks around each unloaded data field, so that Amazon Redshift can unload data values that contain the delimiter itself. If you use ADDQUOTES, you must specify REMOVEQUOTES in the COPY if you reload the data. Deafult is false.Optional
    EscapeBooleanIf true, an escape character (\) is placed before every occurrence of - Linefeed: \n, Carriage return:\r, the delimiter character specified for the unloaded data, escape character \, quote character: " or ' (if both ESCAPE and ADDQUOTES are specified in the UNLOAD command). Deafult is false.Optional
    JDBC Redshift Cluster Databse URLString

    JDBC Redshift DB url for connecting to the redshift cluster. The url should inlcude the port and db name. Format:

    "jdbc:redshift://<endpoint-address>:<endpoint-port>/<db-name>"

     
    Redshift Master UserStringMaster user for redshift 
    Master User PasswordStringMaster user password 

    Design / Implementation Tips

    Design

  • User can connect to the S3 buckets using Either access and secret access key or using IAM role.
  • For using the IAM role to unload the data, the IAM role should have GET, LIST and PUT access on the bucket
    . Deafult is "filePath"Optional
    ManifestBooleanSpecify whether manifest file is to be created during unloading the data into S3. Default is false.Optional
    S3 DelimiterStringThe delimiter by which fields in a character-delimited file are to be separated.Optional
    ParallelBooleanTo write data in parallel to multiple files, according to the number of slices in the cluster or to a single file. The default option is true.Optional
    CompressionStringUnload a compressed file of type BZIP2 or GZIP. Default is NONE.Optional
    Allow Over-WriteBooleanBy default, UNLOAD fails if it finds files that it would possibly overwrite. If ALLOWOVERWRITE is specified, UNLOAD will overwrite existing files, including the manifest file. Deafult is false.Optional
    Add QuotesBooleanPlaces quotation marks around each unloaded data field, so that Amazon Redshift can unload data values that contain the delimiter itself. If you use ADDQUOTES, you must specify REMOVEQUOTES in the COPY if you reload the data. Deafult is false.Optional
    EscapeBooleanIf true, an escape character (\) is placed before every occurrence of - Linefeed: \n, Carriage return:\r, the delimiter character specified for the unloaded data, escape character \, quote character: " or ' (if both ESCAPE and ADDQUOTES are specified in the UNLOAD command). Deafult is false.Optional
    JDBC Redshift Cluster Databse URLString

    JDBC Redshift DB url for connecting to the redshift cluster. The url should inlcude the port and db name. Format:

    "jdbc:redshift://<endpoint-address>:<endpoint-port>/<db-name>"

     
    Redshift Master UserStringMaster user for redshift 
    Master User PasswordStringMaster user password 

    Design / Implementation Tips

    Design

    1. User can connect to the S3 buckets using Either access and secret access key or using IAM role.
    2. For using the IAM role to unload the data, the IAM role should have GET, LIST and PUT access on the bucket.
    3. Existing S3 source plugins read data with regions that have versions 2 and 4 signature version support. Hence, RedhsiftToS3 plugin will have support for only these regions: "us-east-1", "us-west-1", "us-west-2", "ap-southeast-1", "ap-southeast-2", "ap-northeast-1", "eu-west-1" and "sa-east-1".
    4. UNLOAD automatically creates files using Amazon S3 server-side encryption with AWS-managed encryption keys (SSE-S3). UNLOAD does not support Amazon S3 server-side encryption with encryption keys from SSE-KMS or a customer-supplied key (SSE-C).
    5. Plugin would always unload data with Amazon S3 server-side encryption.The plugin will unload data using Amazon S3 server-side encryption (SSE-S3). 
    6. The user should provide complete JDBC Redshift DB url for connecting to the redshift cluster. Any error in the connection string will result in run-time exception.
    7. If the table mentioned in the select query in unload, does not exists, the plugin will throw run-time exception.
    8.  If the user passes name-prefix in the S3 data path, the plugin will append * to the path and pass this new location as a workflow token.
      The next stage will then use file path globbing to read all unloaded files starting with the prefix.

     

    Approach(s)

    1. Plugin will check if the query starts with select and contains the from clause. Any other error, in the SQL statement will result in run-time failure.
    2. The user can either provide the credentials : accessKey and secretAccessKey or the IAM role (which has GET, LIST and PUT permissions on the bucket) for unloading the data from redshift to S3.
    3. S3 bucket is the full path, including bucket name, to the location on Amazon S3 where Amazon Redshift will write the output file objects, including the manifest file if MANIFEST is specified. The object names are prefixed with name-prefix
    4. Single ASCII character that is used to separate fields in the output file, such as a pipe character ( | ), a comma ( , ), or a tab ( \t ), can be specified using the config parameter "delimiter". The default delimiter is a pipe character. If the data contains the delimiter character, user will need to specify the ESCAPE option to escape the delimiter, or use ADDQUOTES to enclose the data in double quotes.

     

    Properties

    Code Block
    languagejava
    titleRedshiftToS3 - Action: Json
    {
      "name": "RedshiftToS3Action",
      "type": "action",
      "properties": {
        "accessKey": "accessKey",
        "secretAccessKey": "secretAccessKey",
        "iamRole": "",
        "query": "select * from venue",
        "s3DataPath": "s3://mybucket/test_",
        "outputPathToken": "",
        "manifest": "false",
        "delimiter": ",",
        "parallel": "false",
        "compression" : "NONE",
        "allowOverwrite" : "true",
        "addQuotes": "true",
        "escape": "false",
        "redshiftClusterURL" : "jdbc:redshift://test.cntuu3e3qg5h.us-west-1.redshift.amazonaws.com:5439/redshiftdb",
        "redshiftMasterUser" : "admin",
        "redshiftMasterPassword": "admin"
      }
    }

     

    Security

    Limitation(s)

    1. The Amazon S3 bucket where Amazon Redshift will write the output files must reside in the same region as your cluster.(http://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html)

    Future Work

    • Some future work – HYDRATOR-99999
    • Another future work – HYDRATOR-99999

    Test Case(s)

    • Test case #1
    • Test case #2

    Sample Pipeline

    Please attach one or more sample pipeline(s) and associated data. 

    Pipeline #1

    Pipeline #2

    )

    • RedshiftToS3 action - Using keys for connection

    • RedshiftToS3 action - Using IAM Role for connection

    • RedshiftToS3 action - Using compression option GZIP
    • RedshiftToS3 action - Using parallel true

    Sample Pipeline

    RedshiftToS3- with Keys

    RedshiftToS3- with IAM Role

    RedshiftToS3 - with Compression type GZIP

    RedshiftToS3- with Parallel true

     

     

    Table of Contents

    Table of Contents
    stylecircle

    Checklist

    •  User stories documented 
    •  User stories reviewed 
    •  Design documented 
    •  Design reviewed 
    •  Feature merged 
    •  Examples and guides 
    •  Integration tests 
    •  Documentation for feature 
    •  Short video demonstrating the feature