Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Task marked complete

Introduction

The CDAP 4.0 UI is designed to provide operational insights about both - CDAP services as well as other service providers such as YARN, HBase and HDFS. The CDAP platform will need to expose additional APIs to surface this information.

Goals

The operational APIs should surface information for the Management Screen

These designs translate into the following requirements:

  • CDAP Uptime
    • P1: Should indicate the time (number of hours, days?) for which the CDAP Master process has been running. 
    • P2: In an HA environment, it would be nice to indicate the time of the last master failover.
  • CDAP System Services
    • P1: Should indicate the current number of instances.
    • P1: Should have a way to scale services.
    • P1: Should show service logs
    • P2: Node name where container started
    • P2: Container name
    • P2: master.services YARN application name
  • Middle Drawer:
    • CDAP:
      • P1: # of masters, routers, kafka-servers, auth-servers
      • P1: Router requests - # 200s, 404s, 500s
      • P1: # namespaces, artifacts, apps, programs, datasets, streams, views
      • P1: Transaction snapshot summary (invalid, in-progress, committing, committed)
      • P1: Logs/Metrics service lags
      • P2: Last GC pause time
    • HDFS:
      • P1: Space metrics: yotal, free, used
      • P1: Nodes: yotal, healthy, decommissioned, decommissionInProgress
      • P1: Blocks: missing, corrupt, under-replicated
    • YARN:
      • P1: Nodes: total, new, running, unhealthy, decommissioned, lost, rebooted
      • P1: Apps: total, submitted, accepted, running, failed, killed, new,  new_saving
      • P1: Memory: total, used, free
      • P1: Virtual Cores: total, used, free
      • P1: Queues: total, stopped, running, max_capacity, current_capacity
    • HBase
      • P1: Nodes: total_regionservers, live_regionservers, dead_regionservers, masters
      • P1: No. of namespaces, tables
      • P2: Last major compaction (time + info)
    • Zookeeper: Most of these are from the output of echo mntr | nc localhost 2181
      • P1: Num of alive connections
      • P1: Num of znodes
      • P1: Num of watches
      • P1: Num of ephemeral nodes
      • P1: Data size
      • P1: Open file descriptor count
      • P1: Max file descriptor count
    • Kafka
    • Sentry
      • P1: # of roles
      • P1: # of privileges
      • P1: memory: total, used, available
      • P1: requests per second
      • any more?
    • KMS
      • TBD: Having a hard time hitting the JMX endpoint for KMS
  • Component Overview
    • P1: YARN, HDFS, HBase, Zoookeeper, Kafka, Hive
    • P1: For each component: version, url, logs_url
    • P2: Sentry, KMS
    • P2: Distribution info
    • P2: Plus button - to store custom components and version, url, logs_url for each.

User Stories

  1. As a CDAP admin, I would like to have insights into the health of all CDAP system services including master, log saver, explore container, metrics processor, metrics, streams, transaction server and dataset executor
  2. As a CDAP admin, I would like to know information about my CDAP setup including the version of CDAP
  3. As a CDAP admin, I would like to know the uptime of CDAP including optionally the time since the last failover in an HA scenario
  4. As a CDAP admin, I would like to know the versions and (optionally) links to the web UI and logs if available of the underlying infrastructure components.
  5. As a CDAP admin, I would like to have operational insights including statistics such as request rate, node status, available compute as well as storage capacity for the underlying infrastructure components that CDAP relies upon. These insights should help me understand the health of these components as well as help in root cause analysis in case CDAP fails or performs poorly.

 

Design

Data Sources

Versions

  • CDAP - co.cask.cdap.common.utils.ProjectInfo
  • HBase - co.cask.cdap.data2.util.hbase.HBaseVersion
  • YARN - org.apache.hadoop.yarn.util.YarnVersionInfo
  • HDFS - org.apache.hadoop.util.VersionInfo
  • Zookeeper - No client API available. Will have to build a utility around echo stat | nc localhost 2181
  • Hive - org.apache.hive.common.util.HiveVersionInfo

URL

  • CDAP - $(dashboard.bind.address) + $(dashboard.bind.port)
  • YARN - $(yarn.resourcemanager.webapp.address)
  • HDFS -  $(dfs.namenode.http-address)
  • HBase - hbaseAdmin.getClusterStatus().getMaster().toString()

HDFS

DistributedFileSystem - For HDFS statistics

YARN

YarnClient - for YARN statistics and info

HBase

HBaseAdmin - for HBase statistics and info

Kafka

JMX

Reference: https://github.com/linkedin/kafka-monitor

Zookeeper

Option 1: Four letter commands - mntr. Drawbacks: mntr was introduced in 3.5.0 - users may be running older versions of Zookeeper

Option 2: Zookeeper also exposes JMX - https://zookeeper.apache.org/doc/trunk/zookeeperJMX.html

HiveServer2

TBD

Sentry

JMX

The following is available by enabling the sentry web service (ref: http://www.cloudera.com/documentation/enterprise/latest/topics/sg_sentry_metrics.html) and querying for metrics (API: http://[sentry-service-host]:51000/metrics?pretty=true).

Code Block
languagejava
titleSentry JMX output
collapsetrue
{
  "version" : "3.0.0",
  "gauges" : {
    "buffers.direct.capacity" : {
      "value" : 57344
    },
    "buffers.direct.count" : {
      "value" : 5
    },
    "buffers.direct.used" : {
      "value" : 57344
    },
    "buffers.mapped.capacity" : {
      "value" : 0
    },
    "buffers.mapped.count" : {
      "value" : 0
    },
    "buffers.mapped.used" : {
      "value" : 0
    },
    "gc.PS-MarkSweep.count" : {
      "value" : 0
    },
    "gc.PS-MarkSweep.time" : {
      "value" : 0
    },
    "gc.PS-Scavenge.count" : {
      "value" : 2
    },
    "gc.PS-Scavenge.time" : {
      "value" : 26
    },
    "memory.heap.committed" : {
      "value" : 1029701632
    },
    "memory.heap.init" : {
      "value" : 1073741824
    },
    "memory.heap.max" : {
      "value" : 1029701632
    },
    "memory.heap.usage" : {
      "value" : 0.17999917863585554
    },
    "memory.heap.used" : {
      "value" : 185345448
    },
    "memory.non-heap.committed" : {
      "value" : 31391744
    },
    "memory.non-heap.init" : {
      "value" : 24576000
    },
    "memory.non-heap.max" : {
      "value" : 136314880
    },
    "memory.non-heap.usage" : {
      "value" : 0.2187954829289363
    },
    "memory.non-heap.used" : {
      "value" : 29825080
    },
    "memory.pools.Code-Cache.usage" : {
      "value" : 0.029324849446614582
    },
    "memory.pools.PS-Eden-Space.usage" : {
      "value" : 0.6523454156767787
    },
    "memory.pools.PS-Old-Gen.usage" : {
      "value" : 1.1440740671897877E-4
    },
    "memory.pools.PS-Perm-Gen.usage" : {
      "value" : 0.32970512204053926
    },
    "memory.pools.PS-Survivor-Space.usage" : {
      "value" : 0.22010480095358456
    },
    "memory.total.committed" : {
      "value" : 1061093376
    },
    "memory.total.init" : {
      "value" : 1098317824
    },
    "memory.total.max" : {
      "value" : 1166016512
    },
    "memory.total.used" : {
      "value" : 215170528
    },
    "org.apache.sentry.provider.db.service.persistent.SentryStore.group_count" : {
      "value" : 3
    },
    "org.apache.sentry.provider.db.service.persistent.SentryStore.privilege_count" : {
      "value" : 0
    },
    "org.apache.sentry.provider.db.service.persistent.SentryStore.role_count" : {
      "value" : 132
    },
    "threads.blocked.count" : {
      "value" : 1
    },
    "threads.count" : {
      "value" : 38
    },
    "threads.daemon.count" : {
      "value" : 27
    },
    "threads.deadlocks" : {
      "value" : [ ]
    },
    "threads.new.count" : {
      "value" : 0
    },
    "threads.runnable.count" : {
      "value" : 6
    },
    "threads.terminated.count" : {
      "value" : 0
    },
    "threads.timed_waiting.count" : {
      "value" : 8
    },
    "threads.waiting.count" : {
      "value" : 23
    }
  },
  "counters" : { },
  "histograms" : { },
  "meters" : { },
  "timers" : {
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.create-role" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-privilege" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-role" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-privilege" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-role" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-authorizable" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-role" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-for-provider" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-roles-by-group" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.rename-privilege" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-privilege" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    },
    "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-role" : {
      "count" : 0,
      "max" : 0.0,
      "mean" : 0.0,
      "min" : 0.0,
      "p50" : 0.0,
      "p75" : 0.0,
      "p95" : 0.0,
      "p98" : 0.0,
      "p99" : 0.0,
      "p999" : 0.0,
      "stddev" : 0.0,
      "m15_rate" : 0.0,
      "m1_rate" : 0.0,
      "m5_rate" : 0.0,
      "mean_rate" : 0.0,
      "duration_units" : "seconds",
      "rate_units" : "calls/second"
    }
  }
}

KMS

KMS also exposes JMX via the endpoint http://host:16000/kms/jmx

TODO: CDAP Master Uptime?

Caching

It is not possible to hit HBase/YARN/HDFS for every request from the UI. As a result, the result of the statistics API will have to be cached, with a configurable time to live. The cache will be keyed by the service provider name, with the value as the statistics for the service provider. In addition to a time-to-live, the cache will also need to be invalidated when:

  1. In the ProgramController and ProgramRuntimeService class hierarchy, so YARN statistics can be updated
  2. In the Entity (app, dataset, stream) lifecycle (creation and deletion) when storage statistics will need to be updated 
  3. P3: In the authorization class hierarchy, when the authorization statistics may need to be updated. However, we may choose to not do this, because we eventually want to get out of the business of managing authorization policies in CDAP
  4. TODO: Invalidation for Kafka and Zookeeper

API changes

New REST APIs

The following REST APIs will be exposed from app fabric.

List Service Providers

This API lists all the available service providers and (optionally) minimal info about each (version, url and logs_url)

Path

GET /v3/system/serviceproviders

Output

Code Block
{
  "hdfs": {
    "version": "2.7.0",
    "url": "http://localhost:50070",
    "url": "http://localhost:50070/logs/"
  },
  "yarn": {
    "version": "2.7.0",
    "url": "http://localhost:8088",
    "logs": "http://localhost:8088/logs/"
  },
  "hbase": {
    "version": "1.0.0",
    "url": "http://localhost:50070",
    "logs": "http://localhost:60010/logs/"
  },
  "hive": {
    "version": 1.2
  },
  "zookeeper": {
    "version": "3.4.2"
  },
  "kafka": {
    "version": "2.10"
  }
}

Get Service Provider Statistics

Path

GET /v3/system/serviceproviders/{service-provider-name}/stats

Response

200 OK - Statistics for the specified service provider were successfully fetched

503 Unavailable - Could not contact the service provider for status

404 Not found - Service provider not found (not in the list returned by the list service providers API)

Output

Code Block
languagejava
titleCDAP Statistics Output
collapsetrue
{
    "services": {
      "masters": 2,
      "kafka-servers": 2,
      "routers": 1,
      "auth-servers": 1,
      "namespaces": 10,
    },
    "entities": {
      "apps": 46,
      "artifacts": 23,
      "datasets": 68,
      "streams": 34,
      "programs": 78
    }
}
Code Block
languagejava
titleHDFS Statistics Output
collapsetrue
{
    "space": {
      "total": 3452759234,
      "used": 34525543,
      "available": 3443555345
    },
    "nodes": {
      "total": 40,
      "healthy": 36,
      "decommissioned": 3,
      "decommissionInProgress": 1
    },
    "blocks": {
      "missing": 33,
      "corrupt": 3,
      "underreplicated": 5
    }
}
Code Block
languagejava
titleYARN Statistics Output
collapsetrue
{
    "nodes": {
      "total": 35,
      "new": 0,
      "running": 30,
      "unhealthy": 1,
      "decommissioned": 2,
      "lost": 1,
      "rebooted": 1
    },
    "apps": {
      "total": 30,
      "submitted": 2,
      "accepted": 4,
      "running": 20,
      "failed": 1,
      "killed": 3,
      "new": 0,
      "new_saving": 0
    },
    "memory": {
      "total": 8192,
      "used": 7168,
      "available": 1024
    },
    "virtualCores": {
      "total": 36,
      "used": 12,
      "available": 24
    },
    "queues": {
      "total": 10,
      "stopped": 2,
      "running": 8,
      "maxCapacity": 32,
      "currentCapacity": 21
    }
}
Code Block
languagejava
titleHBase Statistics Output
collapsetrue
{
    "nodes": {
      "totalRegionServers": 37,
      "liveRegionServers": 34,
      "deadRegionServers": 3,
      "masters": 3
    },
    "tables": 56,
    "namespaces": 43
}

TODO: Add output for Kafka, Zookeeper, Sentry, KMS

CLI Impact or Changes

New CLI commands will have to be added to front the two new APIs.

List Service Providers

list service providers

Get Service Provider Statistics

get statistics for service provider <service-provider>

UI Impact or Changes

The Management screen on the CDAP 4.0 UI will have to be implemented using the APIs exposed by this design in addition to existing APIs for getting System Service Status and Logs

Security Impact

Impact on Infrastructure Outages

Test Scenarios

Releases

Release 4.0.0

  • Ground work for collecting statistics from infrastructure components.
  • Focus on HDFS, YARN, HBase, Hive, Kafka, Zookeeper (in that order)

Release 4.1.0

  • More components such as Sentry, KMS.

Related Work

  • N/A

Future work

 

Checklist

  •  User Stories Documented
  •  User Stories Reviewed
  •  Design Reviewed
  •  APIs reviewed
  •  Release priorities assigned
  •  Test cases reviewed
  •  Blog post