Overview
The CDAP 4.0 UI is designed to provide operational insights about both - CDAP services as well as other service providers such as YARN, HBase and HDFS. The CDAP platform will need to expose additional APIs to surface this information.
Requirements
The operational APIs should surface information for the Management Screen
These designs translate into the following requirements:
- CDAP Uptime
- P1: Should indicate the time (number of hours, days?) for which the CDAP Master process has been running.
- P2: In an HA environment, it would be nice to indicate the time of the last master failover.
- CDAP System Services:
- P1: Should indicate the current number of instances.
- P1: Should have a way to scale services.
- P1: Should show service logs
- P2: Node name where container started
- P2: Container name
- P2:
master.services
YARN application name
- Middle Drawer:
- CDAP:
- P1: # of masters, routers, kafka-servers, auth-servers
- P1: Router requests - # 200s, 404s, 500s
- P1: # namespaces, artifacts, apps, programs, datasets, streams, views
- P1: Transaction snapshot summary (invalid, in-progress, committing, committed)
- P1: Logs/Metrics service lags
- P2: Last GC pause time
- HDFS:
- P1: Space metrics: yotal, free, used
- P1: Nodes: yotal, healthy, decommissioned, decommissionInProgress
- P1: Blocks: missing, corrupt, under-replicated
- YARN:
- P1: Nodes: total, new, running, unhealthy, decommissioned, lost, rebooted
- P1: Apps: total, submitted, accepted, running, failed, killed, new, new_saving
- P1: Memory: total, used, free
- P1: Virtual Cores: total, used, free
- P1: Queues: total, stopped, running, max_capacity, current_capacity
- HBase
- P1: Nodes: total_regionservers, live_regionservers, dead_regionservers, masters
- P1: No. of namespaces, tables
- P2: Last major compaction (time + info)
- Zookeeper: Most of these are from the output of
echo mntr | nc localhost 2181
- P1: Num of alive connections
- P1: Num of znodes
- P1: Num of watches
- P1: Num of ephemeral nodes
- P1: Data size
- P1: Open file descriptor count
- P1: Max file descriptor count
- Kafka
- JMX Metrics that Kafka exposes: https://kafka.apache.org/documentation#monitoring
- https://github.com/linkedin/kafka-monitor may have some clues
- Sentry
- P1: # of roles
- P1: # of privileges
- P1: memory: total, used, available
- P1: requests per second
- any more?
- KMS
- CDAP:
- Component Overview
- P1: YARN, HDFS, HBase, Zoookeeper, Kafka, Hive
- P1: For each component: version, url, logs_url
- P2: Sentry, KMS
- P2: Distribution info
- P2: Plus button - to store custom components and version, url, logs_url for each.
Design
Data Sources
Data for these APIs will be sourced using:
- DistributedFileSystem - For HDFS statistics
- YarnClient - for YARN statistics and info
- HBaseAdmin - for HBase statistics and info
- Configuration and HBaseConfiguration - For HDFS, YARN and HBase info
Versions
- CDAP -
co.cask.cdap.common.utils.ProjectInfo
- HBase -
co.cask.cdap.data2.util.hbase.HBaseVersion
- YARN -
org.apache.hadoop.yarn.util.YarnVersionInfo
- HDFS -
org.apache.hadoop.util.VersionInfo
- Zookeeper - No client API available. Will have to build a utility around
echo stat | nc localhost 2181
- Hive -
org.apache.hive.common.util.HiveVersionInfo
URL
- CDAP -
$(dashboard.bind.address) + $(dashboard.bind.port)
- YARN -
$(yarn.resourcemanager.webapp.address)
- HDFS -
$(dfs.namenode.http-address)
- HBase - hbaseAdmin.getClusterStatus().getMaster().toString()
REST API
The following REST APIs will be exposed from app fabric.
Info
Path
/v3/system/serviceproviders/info
Output
{ "hdfs": { "version": "2.7.0", "url": "http://localhost:50070", "url": "http://localhost:50070/logs/" }, "yarn": { "version": "2.7.0", "url": "http://localhost:8088", "logs": "http://localhost:8088/logs/" }, "hbase": { "version": "1.0.0", "url": "http://localhost:50070", "logs": "http://localhost:60010/logs/" }, "hive": { "version": 1.2 }, "zookeeper": { "version": "3.4.2" }, "kafka": { "version": "2.10" } }
Statistics
Path
/v3/system/serviceproviders/statistics
Output
{ "cdap": { "masters": 2, "kafka-servers": 2, "routers": 1, "auth-servers": 1, "namespaces": 10, "apps": 46, "artifacts": 23, "datasets": 68, "streams": 34, "programs": 78 }, "hdfs": { "space": { "total": 3452759234, "used": 34525543, "available": 3443555345 }, "nodes": { "total": 40, "healthy": 36, "decommissioned": 3, "decommissionInProgress": 1 }, "blocks": { "missing": 33, "corrupt": 3, "underreplicated": 5 } }, "yarn": { "nodes": { "total": 35, "new": 0, "running": 30, "unhealthy": 1, "decommissioned": 2, "lost": 1, "rebooted": 1 }, "apps": { "total": 30, "submitted": 2, "accepted": 4, "running": 20, "failed": 1, "killed": 3, "new": 0, "new_saving": 0 }, "memory": { "total": 8192, "used": 7168, "available": 1024 }, "virtualCores": { "total": 36, "used": 12, "available": 24 }, "queues": { "total": 10, "stopped": 2, "running": 8, "maxCapacity": 32, "currentCapacity": 21 } }, "hbase": { "nodes": { "totalRegionServers": 37, "liveRegionServers": 34, "deadRegionServers": 3, "masters": 3 }, "tables": 56, "namespaces": 43 } }
Sentry
The following is available by enabling the sentry web service (ref: http://www.cloudera.com/documentation/enterprise/latest/topics/sg_sentry_metrics.html) and querying for metrics (API: http://[sentry-service-host]:51000/metrics?pretty=true).
{ "version" : "3.0.0", "gauges" : { "buffers.direct.capacity" : { "value" : 57344 }, "buffers.direct.count" : { "value" : 5 }, "buffers.direct.used" : { "value" : 57344 }, "buffers.mapped.capacity" : { "value" : 0 }, "buffers.mapped.count" : { "value" : 0 }, "buffers.mapped.used" : { "value" : 0 }, "gc.PS-MarkSweep.count" : { "value" : 0 }, "gc.PS-MarkSweep.time" : { "value" : 0 }, "gc.PS-Scavenge.count" : { "value" : 2 }, "gc.PS-Scavenge.time" : { "value" : 26 }, "memory.heap.committed" : { "value" : 1029701632 }, "memory.heap.init" : { "value" : 1073741824 }, "memory.heap.max" : { "value" : 1029701632 }, "memory.heap.usage" : { "value" : 0.17999917863585554 }, "memory.heap.used" : { "value" : 185345448 }, "memory.non-heap.committed" : { "value" : 31391744 }, "memory.non-heap.init" : { "value" : 24576000 }, "memory.non-heap.max" : { "value" : 136314880 }, "memory.non-heap.usage" : { "value" : 0.2187954829289363 }, "memory.non-heap.used" : { "value" : 29825080 }, "memory.pools.Code-Cache.usage" : { "value" : 0.029324849446614582 }, "memory.pools.PS-Eden-Space.usage" : { "value" : 0.6523454156767787 }, "memory.pools.PS-Old-Gen.usage" : { "value" : 1.1440740671897877E-4 }, "memory.pools.PS-Perm-Gen.usage" : { "value" : 0.32970512204053926 }, "memory.pools.PS-Survivor-Space.usage" : { "value" : 0.22010480095358456 }, "memory.total.committed" : { "value" : 1061093376 }, "memory.total.init" : { "value" : 1098317824 }, "memory.total.max" : { "value" : 1166016512 }, "memory.total.used" : { "value" : 215170528 }, "org.apache.sentry.provider.db.service.persistent.SentryStore.group_count" : { "value" : 3 }, "org.apache.sentry.provider.db.service.persistent.SentryStore.privilege_count" : { "value" : 0 }, "org.apache.sentry.provider.db.service.persistent.SentryStore.role_count" : { "value" : 132 }, "threads.blocked.count" : { "value" : 1 }, "threads.count" : { "value" : 38 }, "threads.daemon.count" : { "value" : 27 }, "threads.deadlocks" : { "value" : [ ] }, "threads.new.count" : { "value" : 0 }, "threads.runnable.count" : { "value" : 6 }, "threads.terminated.count" : { "value" : 0 }, "threads.timed_waiting.count" : { "value" : 8 }, "threads.waiting.count" : { "value" : 23 } }, "counters" : { }, "histograms" : { }, "meters" : { }, "timers" : { "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.create-role" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-privilege" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-role" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-privilege" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-role" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-authorizable" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-role" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-for-provider" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-roles-by-group" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.rename-privilege" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-privilege" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-role" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" } } }
TODO: CDAP Master Uptime?
Caching
It is not possible to hit HBase/YARN/HDFS for every request from the UI. As a result, the result of the statistics API will have to be cached, with a configurable timeout. Details TBD.