Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Table of Contents

Checklist

  •  User Stories Documented
  •  User Stories Reviewed
  •  Design Reviewed
  •  APIs reviewed
  •  Release priorities assigned
  •  Test cases reviewed
  •  Blog post

Introduction

The CDAP 4.0 UI is designed to provide operational insights about both - CDAP services as well as other service providers such as YARN, HBase and HDFS. The CDAP platform will need to expose additional APIs to surface this information.

Goals

The operational APIs should surface information for the Management Screen

Image ModifiedImage Modified

These designs translate into the following requirements:

  • CDAP Uptime
    • P1: Should indicate the time (number of hours, days?) for which the CDAP Master process has been running. 
    • P2: In an HA environment, it would be nice to indicate the time of the last master failover.
  • CDAP System Services
    • P1: Should indicate the current number of instances.
    • P1: Should have a way to scale services.
    • P1: Should show service logs
    • P2: Node name where container started
    • P2: Container name
    • P2: master.services YARN application name
  • Middle Drawer:
    • CDAP:
      • P1* (Stretch goal - only possible if there's a straightforward approach): # of masters, routers, kafka-servers, auth-
servers
      • servers 
      • P1: Router requests - # 200s, 404s, 500s
      • P1: # namespaces, artifacts, apps, programs, datasets, streams, views
      • P1: Transaction snapshot summary (invalid, in-progress, committing, committed)
      • P1: Logs/Metrics service lags
      • P2: Last GC pause time
    • HDFS:
      • P1: Space metrics:
yotal
      • total, free, used
      • P1: Nodes:
yotal
      • total, healthy, decommissioned, decommissionInProgress
      • P1: Blocks: missing, corrupt, under-replicated
    • YARN:
      • P1: Nodes: total, new, running, unhealthy, decommissioned, lost, rebooted
      • P1: Apps: total, submitted, accepted, running, failed, killed, new,  new_saving
      • P1: Memory: total, used, free
      • P1: Virtual Cores: total, used, free
      • P1: Queues: total, stopped, running, max_capacity, current_capacity
    • HBase
      • P1: Nodes: total_regionservers, live_regionservers, dead_regionservers, masters
      • P1: No. of namespaces, tables
      • P2: Last major compaction (time + info)
    • Zookeeper: Most of these are from the output of echo mntr | nc localhost 2181
    P1
        • P2: Num of alive connections
    P1
        • P2: Num of znodes
    P1
        • P2: Num of watches
    P1
        • P2: Num of ephemeral nodes
    P1
        • P2: Data size
    P1
        • P2: Open file descriptor count
    P1
    P1
        • P2: # of topics
    P1
        • P2Message in rate
    P1
        • P2Request rate
    P1
        • P2# of under replicated partitions
    P1
        • P2Partition counts
      • Sentry
      P1
          • P2: # of roles
      P1
          • P2: # of privileges
      P1
          • P2: memory: total, used, available
      P1
          • P2: requests per second
          • any more?
        • KMS
          • TBD: Having a hard time hitting the JMX endpoint for KMS
      • Component Overview
        • P1: YARN, HDFS, HBase
      , Zoookeeper, Kafka, Hive
        • P1: For each component: version, url, logs_url
        • P2:
      Sentry
        •  Zoookeeper, Kafka,
      KMS
        • Hive
        • P2: Sentry, KMS
        • P2: Distribution info
        • P2: Plus button - to store custom components and version, url, logs_url for each.

      User Stories

      1. As a CDAP admin, I would like a single place to perform health checks and monitoring for CDAP system services as well as service providers that CDAP depends upon. 
      2. As a CDAP admin, I would like to have insights into the health of all CDAP system services including master, log saver, explore container, metrics processor, metrics, streams, transaction server and dataset executor
      3. As a CDAP admin, I would like to know information about my CDAP setup including the version of CDAP
      4. As a CDAP admin, I would like to know the uptime of CDAP including optionally the time since the last failover in an HA scenario
      5. As a CDAP admin, I would like to know the versions and (optionally) links to the web UI and logs if available of the underlying infrastructure components.
      6. As a CDAP admin, I would like to have operational insights including
      statistics
      1. stats such as request rate, node status, available compute as well as storage capacity for the underlying infrastructure components that CDAP relies upon. These insights should help me understand the health of these components as well as help in root cause analysis in case CDAP fails or performs poorly.
       

      Design

      Data Sources

      Versions

      • CDAP - co.cask.cdap.common.utils.ProjectInfo
      • HBase - co.cask.cdap.data2.util.hbase.HBaseVersion
      • YARN - org.apache.hadoop.yarn.util.YarnVersionInfo
      • HDFS - org.apache.hadoop.util.VersionInfo
      • Zookeeper - No client API available. Will have to build a utility around echo stat | nc localhost 2181
      • Hive - org.apache.hive.common.util.HiveVersionInfo

      URL

      • CDAP - $(dashboard.bind.address) + $(dashboard.bind.port)
      • YARN - $(yarn.resourcemanager.webapp.address)
      • HDFS -  $(dfs.namenode.http-address)
      • HBase - hbaseAdmin.getClusterStatus().getMaster().toString()

      HDFS

      DistributedFileSystem - For HDFS

      statistics

      stats

      YARN

      YarnClient - for YARN

      statistics

      stats and info

      HBase

      HBaseAdmin - for HBase

      statistics

      stats and info

      Kafka

      JMX

      Reference: https://github.com/linkedin/kafka-monitor

      Zookeeper

      Option 1: Four letter commands - mntr. Drawbacks: mntr was introduced in 3.5.0 - users may be running older versions of Zookeeper

      Option 2: Zookeeper also exposes JMX - https://zookeeper.apache.org/doc/trunk/zookeeperJMX.html

      HiveServer2

      TBD

      Sentry

      JMX

      The following is available by enabling the sentry web service (ref: http://www.cloudera.com/documentation/enterprise/latest/topics/sg_sentry_metrics.html) and querying for metrics (API: http://[sentry-service-host]:51000/metrics?pretty=true).

      Code Block
      languagejava
      titleSentry JMX output
      collapsetrue
      {
        "version" : "3.0.0",
        "gauges" : {
          "buffers.direct.capacity" : {
            "value" : 57344
          },
          "buffers.direct.count" : {
            "value" : 5
          },
          "buffers.direct.used" : {
            "value" : 57344
          },
          "buffers.mapped.capacity" : {
            "value" : 0
          },
          "buffers.mapped.count" : {
            "value" : 0
          },
          "buffers.mapped.used" : {
            "value" : 0
          },
          "gc.PS-MarkSweep.count" : {
            "value" : 0
          },
          "gc.PS-MarkSweep.time" : {
            "value" : 0
          },
          "gc.PS-Scavenge.count" : {
            "value" : 2
          },
          "gc.PS-Scavenge.time" : {
            "value" : 26
          },
          "memory.heap.committed" : {
            "value" : 1029701632
          },
          "memory.heap.init" : {
            "value" : 1073741824
          },
          "memory.heap.max" : {
            "value" : 1029701632
          },
          "memory.heap.usage" : {
            "value" : 0.17999917863585554
          },
          "memory.heap.used" : {
            "value" : 185345448
          },
          "memory.non-heap.committed" : {
            "value" : 31391744
          },
          "memory.non-heap.init" : {
            "value" : 24576000
          },
          "memory.non-heap.max" : {
            "value" : 136314880
          },
          "memory.non-heap.usage" : {
            "value" : 0.2187954829289363
          },
          "memory.non-heap.used" : {
            "value" : 29825080
          },
          "memory.pools.Code-Cache.usage" : {
            "value" : 0.029324849446614582
          },
          "memory.pools.PS-Eden-Space.usage" : {
            "value" : 0.6523454156767787
          },
          "memory.pools.PS-Old-Gen.usage" : {
            "value" : 1.1440740671897877E-4
          },
          "memory.pools.PS-Perm-Gen.usage" : {
            "value" : 0.32970512204053926
          },
          "memory.pools.PS-Survivor-Space.usage" : {
            "value" : 0.22010480095358456
          },
          "memory.total.committed" : {
            "value" : 1061093376
          },
          "memory.total.init" : {
            "value" : 1098317824
          },
          "memory.total.max" : {
            "value" : 1166016512
          },
          "memory.total.used" : {
            "value" : 215170528
          },
          "org.apache.sentry.provider.db.service.persistent.SentryStore.group_count" : {
            "value" : 3
          },
          "org.apache.sentry.provider.db.service.persistent.SentryStore.privilege_count" : {
            "value" : 0
          },
          "org.apache.sentry.provider.db.service.persistent.SentryStore.role_count" : {
            "value" : 132
          },
          "threads.blocked.count" : {
            "value" : 1
          },
          "threads.count" : {
            "value" : 38
          },
          "threads.daemon.count" : {
            "value" : 27
          },
          "threads.deadlocks" : {
            "value" : [ ]
          },
          "threads.new.count" : {
            "value" : 0
          },
          "threads.runnable.count" : {
            "value" : 6
          },
          "threads.terminated.count" : {
            "value" : 0
          },
          "threads.timed_waiting.count" : {
            "value" : 8
          },
          "threads.waiting.count" : {
            "value" : 23
          }
        },
        "counters" : { },
        "histograms" : { },
        "meters" : { },
        "timers" : {
          "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.create-role" : {
            "count" : 0,
            "max" : 0.0,
            "mean" : 0.0,
            "min" : 0.0,
            "p50" : 0.0,
            "p75" : 0.0,
            "p95" : 0.0,
            "p98" : 0.0,
            "p99" : 0.0,
            "p999" : 0.0,
            "stddev" : 0.0,
            "m15_rate" : 0.0,
            "m1_rate" : 0.0,
            "m5_rate" : 0.0,
            "mean_rate" : 0.0,
            "duration_units" : "seconds",
            "rate_units" : "calls/second"
          },
          "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-privilege" : {
            "count" : 0,
            "max" : 0.0,
            "mean" : 0.0,
            "min" : 0.0,
            "p50" : 0.0,
            "p75" : 0.0,
            "p95" : 0.0,
            "p98" : 0.0,
            "p99" : 0.0,
            "p999" : 0.0,
            "stddev" : 0.0,
            "m15_rate" : 0.0,
            "m1_rate" : 0.0,
            "m5_rate" : 0.0,
            "mean_rate" : 0.0,
            "duration_units" : "seconds",
            "rate_units" : "calls/second"
          },
          "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-role" : {
            "count" : 0,
            "max" : 0.0,
            "mean" : 0.0,
            "min" : 0.0,
            "p50" : 0.0,
            "p75" : 0.0,
            "p95" : 0.0,
            "p98" : 0.0,
            "p99" : 0.0,
            "p999" : 0.0,
            "stddev" : 0.0,
            "m15_rate" : 0.0,
            "m1_rate" : 0.0,
            "m5_rate" : 0.0,
            "mean_rate" : 0.0,
            "duration_units" : "seconds",
            "rate_units" : "calls/second"
          },
          "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-privilege" : {
            "count" : 0,
            "max" : 0.0,
            "mean" : 0.0,
            "min" : 0.0,
            "p50" : 0.0,
            "p75" : 0.0,
            "p95" : 0.0,
            "p98" : 0.0,
            "p99" : 0.0,
            "p999" : 0.0,
            "stddev" : 0.0,
            "m15_rate" : 0.0,
            "m1_rate" : 0.0,
            "m5_rate" : 0.0,
            "mean_rate" : 0.0,
            "duration_units" : "seconds",
            "rate_units" : "calls/second"
          },
          "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-role" : {
            "count" : 0,
            "max" : 0.0,
            "mean" : 0.0,
            "min" : 0.0,
            "p50" : 0.0,
            "p75" : 0.0,
            "p95" : 0.0,
            "p98" : 0.0,
            "p99" : 0.0,
            "p999" : 0.0,
            "stddev" : 0.0,
            "m15_rate" : 0.0,
            "m1_rate" : 0.0,
            "m5_rate" : 0.0,
            "mean_rate" : 0.0,
            "duration_units" : "seconds",
            "rate_units" : "calls/second"
          },
          "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-authorizable" : {
            "count" : 0,
            "max" : 0.0,
            "mean" : 0.0,
            "min" : 0.0,
            "p50" : 0.0,
            "p75" : 0.0,
            "p95" : 0.0,
            "p98" : 0.0,
            "p99" : 0.0,
            "p999" : 0.0,
            "stddev" : 0.0,
            "m15_rate" : 0.0,
            "m1_rate" : 0.0,
            "m5_rate" : 0.0,
            "mean_rate" : 0.0,
            "duration_units" : "seconds",
            "rate_units" : "calls/second"
          },
          "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-role" : {
            "count" : 0,
            "max" : 0.0,
            "mean" : 0.0,
            "min" : 0.0,
            "p50" : 0.0,
            "p75" : 0.0,
            "p95" : 0.0,
            "p98" : 0.0,
            "p99" : 0.0,
            "p999" : 0.0,
            "stddev" : 0.0,
            "m15_rate" : 0.0,
            "m1_rate" : 0.0,
            "m5_rate" : 0.0,
            "mean_rate" : 0.0,
            "duration_units" : "seconds",
            "rate_units" : "calls/second"
          },
          "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-for-provider" : {
            "count" : 0,
            "max" : 0.0,
            "mean" : 0.0,
            "min" : 0.0,
            "p50" : 0.0,
            "p75" : 0.0,
            "p95" : 0.0,
            "p98" : 0.0,
            "p99" : 0.0,
            "p999" : 0.0,
            "stddev" : 0.0,
            "m15_rate" : 0.0,
            "m1_rate" : 0.0,
            "m5_rate" : 0.0,
            "mean_rate" : 0.0,
            "duration_units" : "seconds",
            "rate_units" : "calls/second"
          },
          "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-roles-by-group" : {
            "count" : 0,
            "max" : 0.0,
            "mean" : 0.0,
            "min" : 0.0,
            "p50" : 0.0,
            "p75" : 0.0,
            "p95" : 0.0,
            "p98" : 0.0,
            "p99" : 0.0,
            "p999" : 0.0,
            "stddev" : 0.0,
            "m15_rate" : 0.0,
            "m1_rate" : 0.0,
            "m5_rate" : 0.0,
            "mean_rate" : 0.0,
            "duration_units" : "seconds",
            "rate_units" : "calls/second"
          },
          "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.rename-privilege" : {
            "count" : 0,
            "max" : 0.0,
            "mean" : 0.0,
            "min" : 0.0,
            "p50" : 0.0,
            "p75" : 0.0,
            "p95" : 0.0,
            "p98" : 0.0,
            "p99" : 0.0,
            "p999" : 0.0,
            "stddev" : 0.0,
            "m15_rate" : 0.0,
            "m1_rate" : 0.0,
            "m5_rate" : 0.0,
            "mean_rate" : 0.0,
            "duration_units" : "seconds",
            "rate_units" : "calls/second"
          },
          "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-privilege" : {
            "count" : 0,
            "max" : 0.0,
            "mean" : 0.0,
            "min" : 0.0,
            "p50" : 0.0,
            "p75" : 0.0,
            "p95" : 0.0,
            "p98" : 0.0,
            "p99" : 0.0,
            "p999" : 0.0,
            "stddev" : 0.0,
            "m15_rate" : 0.0,
            "m1_rate" : 0.0,
            "m5_rate" : 0.0,
            "mean_rate" : 0.0,
            "duration_units" : "seconds",
            "rate_units" : "calls/second"
          },
          "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-role" : {
            "count" : 0,
            "max" : 0.0,
            "mean" : 0.0,
            "min" : 0.0,
            "p50" : 0.0,
            "p75" : 0.0,
            "p95" : 0.0,
            "p98" : 0.0,
            "p99" : 0.0,
            "p999" : 0.0,
            "stddev" : 0.0,
            "m15_rate" : 0.0,
            "m1_rate" : 0.0,
            "m5_rate" : 0.0,
            "mean_rate" : 0.0,
            "duration_units" : "seconds",
            "rate_units" : "calls/second"
          }
        }
      }

      KMS

      KMS also exposes JMX via the endpoint http://host:16000/kms/jmx

      TODO: CDAP Master Uptime?

      Caching

      It is not possible to hit HBase/YARN/HDFS for every request from the UI. As a result, the result of the statistics API will have to be cached, with a configurable time to live. The cache will be keyed by the service provider name, with the value as the statistics for the service provider. In addition to a time-to-live, the cache will also need to be invalidated when:

    • In the ProgramController and ProgramRuntimeService class hierarchy, so YARN statistics can be updated
    • In the Entity (app, dataset, stream) lifecycle (creation and deletion) when storage statistics will need to be updated 
    • P3: In the authorization class hierarchy, when the authorization statistics may need to be updated. However, we may choose to not do this, because we eventually want to get out of the business of managing authorization policies in CDAP
    • TODO: Invalidation for Kafka and Zookeeper

      Implementation

      Operational Stats Extensions

      The service provider stats fetchers will be implemented as extensions. Each such extension will be installed by CDAP in the master/ext/operations directory as jar files. Each jar file in this directory will be scanned for implementations of the OperationalStats interface defined below. 

      Code Block
      /**
       * Interface for all operational stats emitted using the operational stats extension framework.
       *
       * To emit stats using this framework, create JMX {@link MXBean} interfaces, then have the implementations of those
       * interfaces extend this class. At runtime, all sub-classes of this class will be registered with the
       * {@link MBeanServer} with the <i>name</i> property determined by {@link #getServiceName()} and the <i>type</i>
       * property determined by {@link #getStatType()}.
       */
      public interface OperationalStats {
        /**
         * Returns the service name for which this operational stat is emitted. Service names are case-insensitive, and will
         * be converted to lower case.
         */
        String getServiceName();
      
        /**
         * Returns the type of the stat. Stat types are case-insensitive, and will be converted to lower case.
         */
        String getStatType();
      
        /**
         * Collects the stats that are reported by this object.
         */
        void collect() throws IOException;
      }
      • There will be one implementation of this interface for every service provider and stat type.
      • For example, HDFSStorage can be an implementation that provides stats of type "storage" for the service provider HDFS. 
      • A single jar file may contain multiple implementations of this interface. 
      • These classes will be loaded in a separate classloader, but there will not be classloader isolation, so extensions will have classes from the CDAP classloader available. 
      • CDAP will provide a core extension installed at master/ext/operations/core/cdap-operations-extensions-core.jar which will contain stats for some standard service providers. Additional services can be configured by implementing OperationalStats for the service, and placing the jar file under master/ext/operations/

      Collecting and reporting stats

      For collecting and reporting OperationalStats, the JMX API will be used. Hence, in addition to implementing the OperationalStats interface so it can be recognized as an operational extension, each implementation should also define and implement a Java MXBean interface.

      After loading an operational stats extension, the MXBean it implements will be registered with the MBeanServer using the property name set to the value returned from getServiceName and type set to the value returned from getStatType. These properties can then be used to create an ObjectName to retrieve the stats from JMX.

      TODO: CDAP Master Uptime?

      Caching

      The collect method of every operational stats extension will be called at a configurable time interval, and is expected to refresh its stats. A call to an accessor method in the MXBean will simply return the current value cached inside the class.

      API changes

      New REST APIs

      The following REST APIs will be exposed from app fabric.

      List Service Providers

      This API lists

      PathMethodDescriptionResponse CodeResponse
      /v3/system/serviceprovidersGETLists all the available service providers and (optionally) minimal info about each (version, url and logs_url)

      Path

      GET /v3/system/serviceproviders

      Output

      code

      200 - On success

      500 - Any internal errors

      Code Block
      languagejava
      titleList Service Providers Response
      collapsetrue
      {
        "hdfs": {
          "version": "2.7.0",
          "url": "http://localhost:50070",
          "url": "http://localhost:50070/logs/"
        },
        "yarn": {
          "version": "2.7.0",
          "url": "http://localhost:8088",
          "logs": "http://localhost:8088/logs/"
        },
        "hbase": {
          "version": "1.0.0",
          "url": "http://localhost:50070",
          "logs": "http://localhost:60010/logs/"
        },
        "hive": {
          "version": 1.2
        },
        "zookeeper": {
          "version": "3.4.2"
        },
        "kafka": {
          "version": "2.10"
        }
      }

      Get Service Provider Statistics

      Path
      GET
      /v3/system/serviceproviders/{service-provider-name}/stats
      Response
      GETReturns stats for the specified service provider

      200 OK -

      Statistics

      stats for the specified service provider were successfully fetched

      503 Unavailable - Could not contact the service provider for status

      404 Not found - Service provider not found (not in the list returned by the list service providers API)

      Output

      500 - Any other internal errors

      Code Block
      languagejava
      titleCDAP
      Statistics Outputcollapsetrue{
      Stats Response
      collapsetrue
      {
          "services": {
            "masters": 2,
            "kafka-servers": 2,
            "routers": 1,
            "auth-servers": 1
          },
          "entities": {
            "namespaces": 10,
            "
      services
      apps": 
      {
      46,
            "
      masters
      artifacts": 
      2
      23,
            "
      kafka-servers
      datasets": 
      2
      68,
            "
      routers
      streams": 
      1,
      34,
            "programs": 78
          }
      }
      Code Block
      languagejava
      titleHDFS Stats Response
      collapsetrue
      {
          "storage": {
            "
      auth-servers
      total": 
      1
      3452759234,
            "
      namespaces
      used": 
      10
      34525543,
          
      },
        "available": 3443555345
       
      "entities":
       
      {
        },
          "
      apps
      nodes": 
      46,
      {
            "
      artifacts
      total": 
      23
      40,
            "
      datasets
      healthy": 
      68
      36,
            "
      streams
      decommissioned": 
      34
      3,
            "
      programs
      decommissionInProgress": 
      78
      1
          }
      } Code Block
      languagejava
      titleHDFS Statistics Output
      collapsetrue
      {
      ,
          "
      space
      blocks": {
            "
      total
      missing": 
      3452759234
      33,
            "
      used
      corrupt": 
      34525543
      3,
            "
      available
      underreplicated": 5
      
      3443555345
          }
      }
      ,
      Code Block
      languagejava
      titleYARN Stats Response
      collapsetrue
      {
          "nodes": {
            "total": 
      40
      35,
            "
      healthy
      new": 
      36
      0,
            "
      decommissioned
      running": 
      3
      30,
            "
      decommissionInProgress
      unhealthy": 1
      ,
       
      },
           "
      blocks
      decommissioned": 
      {
      2,
            "
      missing
      lost": 
      33
      1,
            "
      corrupt
      rebooted": 1
        
      3,
        },
          "
      underreplicated
      apps": 
      5
      {
      
      }
       
      } Code Block
      languagejava
      titleYARN Statistics Output
      collapsetrue
      {
           "
      nodes
      total": 
      {
      30,
            "
      total
      submitted": 
      35
      2,
            "
      new
      accepted": 
      0
      4,
            "running": 
      30
      20,
            "
      unhealthy
      failed": 1,
            "
      decommissioned
      killed": 
      2
      3,
            "
      lost
      new": 
      1
      0,
            "
      rebooted
      new_saving": 
      1
      0
          },
          "
      apps
      memory": {
            "total": 
      30
      8192,
            "
      submitted
      used": 
      2
      7168,
            "
      accepted
      available": 
      4,
      1024
          
      "running": 20,
      },
          "
      failed
      virtualCores": 
      1,
      {
            "
      killed
      total": 
      3
      36,
            "
      new
      used": 
      0
      12,
            "
      new_saving
      available": 
      0
      24
          },
          "
      memory
      queues": {
            "total": 
      8192
      10,
            "
      used
      stopped": 
      7168
      2,
            "
      available
      running": 
      1024
      8,
       
      },
           "
      virtualCores
      maxCapacity": 
      {
      32,
            "
      total
      currentCapacity":
      36,
       21
          }
      }
      Code Block
      languagejava
      titleHBase Stats Response
      collapsetrue
      {
          "
      used
      nodes": 
      12,
      {
            "
      available
      totalRegionServers": 
      24
      37,
       
      },
           "
      queues
      liveRegionServers": 
      {
      34,
            "
      total
      deadRegionServers": 
      10
      3,
            "
      stopped
      masters": 3
        
      2,
        },
          "
      running
      entities": 
      8,
      {
            "
      maxCapacity
      tables": 
      32
      56,
            "
      currentCapacity
      namespaces": 43
       
      21
         
      } }
      Code Block
      languagejava
      titleHBase Statistics Output
      collapsetrue
      {
          "nodes": {
            "totalRegionServers": 37,
            "liveRegionServers": 34,
            "deadRegionServers": 3,
            "masters": 3
          },
          "tables": 56,
          "namespaces": 43
      }

      TODO: Add output for Kafka, Zookeeper, Sentry, KMS

      CLI Impact or Changes

      New CLI commands will have to be added to front the two new APIs.

      List Service Providers

      list service providers

      Get Service Provider Statistics

      get statistics for service provider <service-provider>

      UI Impact or Changes

      The Management screen on the CDAP 4.0 UI will have to be implemented using the APIs exposed by this design in addition to existing APIs for getting System Service Status and Logs

      Security Impact

      Impact on Infrastructure Outages

      Test Scenarios
      }
      }

      TODO: Add responses for Kafka, Zookeeper, Sentry, KMS

      CLI Impact or Changes

      New CLI commands will have to be added to front the two new APIs.

      List Service Providers

      list service providers

      Get Service Provider Stats

      get stats for service provider <service-provider>

      UI Impact or Changes

      The Management screen on the CDAP 4.0 UI will have to be implemented using the APIs exposed by this design in addition to existing APIs for getting System Service Status and Logs

      Security Impact

      Currently CDAP does not enforce authorization for the system services APIs - 

      Jira Legacy
      serverCask Community Issue Tracker
      columnskey,summary,type,created,updated,due,assignee,reporter,priority,status,resolution
      serverId45b48dee-c8d6-34f0-9990-e6367dc2fe4b
      keyCDAP-6917
      . The APIs in this design should enforce the same authorization policies as will be implemented for the system services APIs. Ideally, only users that have ADMIN privileges on the CDAP instance should be able to execute these APIs successfully.

      Impact on Infrastructure Outages

      Test Scenarios

      Test IDTest DescriptionExpected Results
       T1Positive test for list API Should return all the configured service providers
       T2Positive test for stats of each service provider Should return the appropriate details for each service provider 
      T3 Stop a configured storage provider and hit the API to get its statsShould return 503 with a proper error message 
      T4Hit the API to get stats of a non-existent API Should return 404 with a proper error message 

      Releases

      Release 4.0.0

      • Ground work for collecting
      statistics
      • stats from infrastructure components.
      • Focus on HDFS, YARN, HBase
      , Hive, Kafka, Zookeeper (in that order)

      Release 4.1.0

      • More components such as Hive, Kafka, Zookeeper, Sentry, KMS  (in that order).

      Related Work

      N/A

      Future work

       

      Checklist

      •  User Stories Documented
      •  User Stories Reviewed
      •  Design Reviewed
      •  APIs reviewed
      •  Release priorities assigned
      •  Test cases reviewed
      •  Blog post
      • TBD