Installation on Amazon EMR using Bootstrap Actions

This section describes installing CDAP on Amazon EMR clusters using the Amazon EMR "Run If" Bootstrap Action to:

  • Install necessary EMR components;

  • Restrict CDAP installation to the EMR master node;

  • Download, install, and automatically configure CDAP for EMR; and

  • Run all services as the 'cdap' user

Information on Amazon EMR is available online.

CDAP 6.2 is compatible with Amazon EMR 4.6.0 through 5.3.1.

Using the Create Cluster Wizard

For any settings not listed or specified below, we recommend using the default settings.

  1. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/.

  2. Choose "Create cluster."

  3. In the Advanced OptionsStep 1: Software and Steps, set:

    • Vendor: Amazon

    • Release: emr-4.6.0 through emr-5.3.1

    • Software: Hadoop, HBase, Spark

    • No auto-terminate


    EMR Create Cluster Wizard: Step 1: Software and Steps

  4. In Step 2: Hardware, set:

    • Network: use defaults

    • EC2 Subnet: use defaults

    • Master

      • EC2 Instance type: m3.xlarge

      • Instance count: 1

    • Core

      • EC2 Instance type: m3.xlarge

      • Instance count: 4 (as a minimum)

    • Task

      • Instance count: 0 (not required)

    EMR Create Cluster Wizard: Step 2: Hardware

  5. In Step 3: General Cluster Settings, set:

    • Logging

    • Debugging

    • Termination protection (no auto-terminate)



    EMR Create Cluster Wizard: Step 3: General Cluster Settings

  6. In Step 3: General Cluster Settings, add a Bootstrap Action:

    • Type: Run If

    • Optional arguments:

      instance.isMaster=true "curl https://downloads.cdap.io/emr/install-6.0.0.sh | sudo bash -s"

     

    EMR Create Cluster Wizard: Add Bootstrap Action

  7. In Step 4: Security, set following defaults, and then add a security group (next step).

    EMR Create Cluster Wizard: Step 4: Security

  8. In Step 4: Security, set additional EC2 Security Groups to the master node:

    • Master (one of the following):

      • A Security Group with ports 11011/11015 open; or

      • An SSH Tunnel

    EMR Create Cluster Wizard: Assigning additional security group to master node

Once the cluster is created, CDAP services will start up. This will take about 10 minutes after the cluster is in a Waiting state.

Created in 2020 by Google Inc.