...
- CDAP Uptime
- P1: Should indicate the time (number of hours, days?) for which the CDAP Master process has been running.
- P2: In an HA environment, it would be nice to indicate the time of the last master failover.
- CDAP System Services:
- P1: Should indicate the current number of instances.
- P1: Should have a way to scale services.
- P1: Should show service logs
- P2: Node name where container started
- P2: Container name
- P2:
master.services
YARN application name
- Middle Drawer:
- CDAP:
- P1: # of masters, routers, kafka-servers, auth-servers
- P1: Router requests - # 200s, 404s, 500s
- P1: # namespaces, artifacts, apps, programs, datasets, streams, views
- P1: Transaction snapshot summary (invalid, in-progress, committing, committed)
- P1: Logs/Metrics service lags
- P2: Last GC pause time
- HDFS:
- P1: Space metrics: yotal, free, used
- P1: Nodes: yotal, healthy, decommissioned, decommissionInProgress
- P1: Blocks: missing, corrupt, under-replicated
- YARN:
- P1: Nodes: total, new, running, unhealthy, decommissioned, lost, rebooted
- P1: Apps: total, submitted, accepted, running, failed, killed, new, new_saving
- P1: Memory: total, used, free
- P1: Virtual Cores: total, used, free
- P1: Queues: total, stopped, running, max_capacity, current_capacity
- HBase
- P1: Nodes: total_regionservers, live_regionservers, dead_regionservers, masters
- P1: No. of namespaces, tables
- P2: Last major compaction (time + info)
- Zookeeper: Most of these are from the output of
echo mntr | nc localhost 2181
- P1: Num of alive connections
- P1: Num of znodes
- P1: Num of watches
- P1: Num of ephemeral nodes
- P1: Data size
- P1: Open file descriptor count
- P1: Max file descriptor count
- Kafka
- JMX Metrics that Kafka exposes: https://kafka.apache.org/documentation#monitoring
- https://github.com/linkedin/kafka-monitor may have some clues
- Sentry
- P1: # of roles
- P1: # of privileges
- P1: memory: total, used, available
- P1: requests per second
- any more?
- KMS
- CDAP:
- Component Overview
- P1: YARN, HDFS, HBase, Zoookeeper, Kafka, Hive
- P1: For each component: version, url, logs_url
- P2: Sentry, KMS
- P2: Distribution info
- P2: Plus button - to store custom components and version, url, logs_url for each.
Design
Data Sources
Data for these APIs will be sourced using:
- DistributedFileSystem - For HDFS statistics
- YarnClient - for YARN statistics and info
- HBaseAdmin - for HBase statistics and info
- Configuration and HBaseConfiguration - For HDFS, YARN and HBase info
Versions
- CDAP -
co.cask.cdap.common.utils.ProjectInfo
- HBase -
co.cask.cdap.data2.util.hbase.HBaseVersion
- YARN -
org.apache.hadoop.yarn.util.YarnVersionInfo
- HDFS -
org.apache.hadoop.util.VersionInfo
- Zookeeper - No client API available. Will have to build a utility around
echo stat | nc localhost 2181
- Hive -
org.apache.hive.common.util.HiveVersionInfo
...
- CDAP -
$(dashboard.bind.address) + $(dashboard.bind.port)
- YARN -
$(yarn.resourcemanager.webapp.address)
- HDFS -
$(dfs.namenode.http-address)
- HBase - hbaseAdmin.getClusterStatus().getMaster().toString()
REST API
The following REST APIs will be exposed from app fabric.
Info
Path
/v3/system/serviceproviders/info
Output
...
HDFS
DistributedFileSystem - For HDFS statistics
YARN
YarnClient - for YARN statistics and info
HBase
HBaseAdmin - for HBase statistics and info
Kafka
JMX
Zookeeper
Option 1: Four letter commands - mntr
HiveServer2
TBD
Sentry
JMX
The following is available by enabling the sentry web service (ref: http://www.cloudera.com/documentation/enterprise/latest/topics/sg_sentry_metrics.html) and querying for metrics (API: http://[sentry-service-host]:51000/metrics?pretty=true).
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{ "version" : "3.0.0", "gauges" : { "buffers.direct.capacity" : { "versionvalue" : 57344 }, "2buffers.7direct.0",count" : { "urlvalue" : "http://localhost:50070", 5 }, "buffers.direct.used" : { "urlvalue" : "http://localhost:50070/logs/" 57344 }, "yarn" "buffers.mapped.capacity" : { "versionvalue" : "2.7.0", "url": "http://localhost:8088"}, "logsbuffers.mapped.count" : "http://localhost:8088/logs/"{ }, "hbasevalue" : 0 { }, "version": "1buffers.0mapped.0used", : { "url": "http://localhost:50070", "logsvalue" : "http://localhost:60010/logs/" 0 }, "hive"gc.PS-MarkSweep.count" : { "versionvalue" : 1.20 }, "zookeeper"gc.PS-MarkSweep.time" : { "version": "3.4.2"value" : 0 }, "kafka"gc.PS-Scavenge.count" : { "versionvalue" : "2.10" } } |
Statistics
Path
/v3/system/serviceproviders/statistics
Output
Code Block |
---|
{, "cdap"gc.PS-Scavenge.time" : { "mastersvalue" : 2,26 "kafka-servers": 2}, "routersmemory.heap.committed" : { 1, "auth-serversvalue" : 1,1029701632 "namespaces": 10}, "appsmemory.heap.init" : { 46, "artifactsvalue" : 23,1073741824 "datasets": 68}, "streamsmemory.heap.max" : 34,{ "programsvalue" : 1029701632 78 }, "hdfs" "memory.heap.usage" : { "spacevalue" : {0.17999917863585554 "total": 3452759234, }, "memory.heap.used" : 34525543,{ "availablevalue" : 3443555345185345448 }, "nodes"memory.non-heap.committed" : { "totalvalue" : 40,31391744 "healthy": 36}, "decommissioned": 3,"memory.non-heap.init" : { "decommissionInProgressvalue" : 124576000 }, "blocks"memory.non-heap.max" : { "missingvalue" : 136314880 33, }, "corrupt"memory.non-heap.usage" : 3,{ "underreplicatedvalue" : 50.2187954829289363 } , }, "yarn"memory.non-heap.used" : { "nodesvalue" : 29825080 { }, "total"memory.pools.Code-Cache.usage" : 35,{ "newvalue" : 0,.029324849446614582 }, "running"memory.pools.PS-Eden-Space.usage" : 30,{ "unhealthyvalue" : 1, "decommissioned": 2, 0.6523454156767787 }, "lost"memory.pools.PS-Old-Gen.usage" : 1,{ "rebootedvalue" : 1.1440740671897877E-4 }, "apps"memory.pools.PS-Perm-Gen.usage" : { "totalvalue" : 30,0.32970512204053926 "submitted": 2}, "accepted": 4,"memory.pools.PS-Survivor-Space.usage" : { "runningvalue" : 20,0.22010480095358456 }, "failedmemory.total.committed" : 1,{ "killedvalue" : 1061093376 3, }, "newmemory.total.init" : 0,{ "new_savingvalue" : 01098317824 }, "memory.total.max" : { "totalvalue" : 1166016512 8192, }, "memory.total.used" : 7168,{ "availablevalue" : 1024215170528 }, "virtualCores"org.apache.sentry.provider.db.service.persistent.SentryStore.group_count" : { "totalvalue" : 3 36, }, "used": 12,org.apache.sentry.provider.db.service.persistent.SentryStore.privilege_count" : { "availablevalue" : 240 }, "queues"org.apache.sentry.provider.db.service.persistent.SentryStore.role_count" : { "total": 10, "stopped": 2, "running": 8, value" : 132 }, "maxCapacitythreads.blocked.count" : 32,{ "currentCapacityvalue" : 211 }, }, "hbasethreads.count" : { "nodesvalue" : 38 { }, "totalRegionServers": 37, "threads.daemon.count" : { "liveRegionServersvalue" : 27 34, }, "deadRegionServersthreads.deadlocks" : 3,{ "mastersvalue" : 3[ ] }, "tablesthreads.new.count" : 56,{ "namespacesvalue" : 0 43 } } |
Sentry
The following is available by enabling the sentry web service (ref: http://www.cloudera.com/documentation/enterprise/latest/topics/sg_sentry_metrics.html) and querying for metrics (API: http://[sentry-service-host]:51000/metrics?pretty=true).
Code Block |
---|
{ "version" : "3.0.0", "gauges" : {, "threads.runnable.count" : { "value" : 6 }, "threads.terminated.count" : { "value" : 0 }, "buffersthreads.directtimed_waiting.capacitycount" : { "value" : 573448 }, "buffersthreads.directwaiting.count" : { "value" : 523 }, }, "buffers.direct.usedcounters" : { }, "histograms" : { }, "valuemeters" : { 57344}, "timers" : },{ "buffers.mapped.capacityorg.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.create-role" : { "valuecount" : 0, }, "buffers.mapped.count"max" : {0.0, "valuemean" : 0.0 , }, "buffers.mapped.used"min" : {0.0, "valuep50" : 0.0 , }, "gc.PS-MarkSweep.count"p75" : {0.0, "valuep95" : 0 .0, }, "gc.PS-MarkSweep.time"p98" : {0.0, "valuep99" : 0.0 , }, "gc.PS-Scavenge.count"p999" : {0.0, "valuestddev" : 20.0, }, "gc.PS-Scavenge.time"m15_rate" : {0.0, "valuem1_rate" : 26 0.0, }, "memory.heap.committedm5_rate" : {0.0, "valuemean_rate" : 10297016320.0, }, "memory.heap.init"duration_units" : {"seconds", "valuerate_units" : 1073741824"calls/second" }, "memory.heap.maxorg.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-privilege" : { "valuecount" : 10297016320, }, "memory.heap.usage"max" : {0.0, "valuemean" : 0.17999917863585554 0, }, "memory.heap.usedmin" : {0.0, "valuep50" : 185345448 0.0, }, "memory.non-heap.committed"p75" : {0.0, "valuep95" : 313917440.0, }, "memory.non-heap.init"p98" : {0.0, "valuep99" : 24576000 0.0, }, "memory.non-heap.maxp999" : {0.0, "valuestddev" : 1363148800.0, }, "memory.non-heap.usagem15_rate" : {0.0, "valuem1_rate" : 0.21879548292893630, }, "memory.non-heap.used"m5_rate" : {0.0, "valuemean_rate" : 29825080 0.0, }, "memory.pools.Code-Cache.usage"duration_units" : {"seconds", "valuerate_units" : 0.029324849446614582"calls/second" }, "memory.pools.PS-Eden-Space.usageorg.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-role" : { "valuecount" : 0.6523454156767787 , }, "memory.pools.PS-Old-Gen.usagemax" : {0.0, "valuemean" : 1.1440740671897877E-40.0, }, "memory.pools.PS-Perm-Gen.usagemin" : {0.0, "valuep50" : 0.329705122040539260, }, "memory.pools.PS-Survivor-Space.usage"p75" : {0.0, "valuep95" : 0.220104800953584560, }, "memory.total.committed"p98" : {0.0, "valuep99" : 10610933760.0, }, "memory.total.initp999" : {0.0, "valuestddev" : 1098317824 0.0, }, "memory.total.max"m15_rate" : {0.0, "valuem1_rate" : 1166016512 0.0, }, "memory.total.usedm5_rate" : {0.0, "valuemean_rate" : 2151705280.0, }, "org.apache.sentry.provider.db.service.persistent.SentryStore.group_count" : {"duration_units" : "seconds", "valuerate_units" : 3"calls/second" }, "org.apache.sentry.provider.db.service.persistentthrift.SentryStoreSentryPolicyStoreProcessor.grant-privilege_count" : { "valuecount" : 0, },"max" : 0.0, "org.apache.sentry.provider.db.service.persistent.SentryStore.role_count "mean" : {0.0, "valuemin" : 132 0.0, "p50" : }0.0, "threads.blocked.count "p75" : {0.0, "valuep95" : 1 0.0, }, "threads.countp98" : {0.0, "valuep99" : 38 0.0, }, "threads.daemon.countp999" : {0.0, "valuestddev" : 27 0.0, }, "threads.deadlocks"m15_rate" : {0.0, "valuem1_rate" : [0.0, ] }, "threads.new.countm5_rate" : {0.0, "valuemean_rate" : 0.0, }, "threads.runnable.count"duration_units" : {"seconds", "valuerate_units" : 6"calls/second" }, "threads.terminated.countorg.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-role" : { "valuecount" : 0, }, "threads.timed_waiting.count"max" : {0.0, "valuemean" : 8 0.0, }, "threads.waiting.count"min" : {0.0, "valuep50" : 23 }0.0, }, "countersp75" : { }0.0, "histograms" : { }, "metersp95" : { }, "timers" : {0.0, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.create-rolep98" : {0.0, "countp99" : 0.0, "maxp999" : 0.0, "meanstddev" : 0.0, "minm15_rate" : 0.0, "p50m1_rate" : 0.0, "p75m5_rate" : 0.0, "p95mean_rate" : 0.0, "duration_units" : "p98seconds", "rate_units" : 0.0 "calls/second" }, "p99 "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-authorizable" : 0.0,{ "p999count" : 0.0, "stddevmax" : 0.0, "m15_ratemean" : 0.0, "m1_ratemin" : 0.0, "m5_ratep50" : 0.0, "mean_ratep75" : 0.0, "duration_unitsp95" : "seconds", "rate_units" : "calls/second" },0.0, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-privilegep98" : {0.0, "countp99" : 0.0, "maxp999" : 0.0, "meanstddev" : 0.0, "minm15_rate" : 0.0, "p50m1_rate" : 0.0, "p75m5_rate" : 0.0, "p95mean_rate" : 0.0, "duration_units" : "p98seconds", "rate_units" : 0.0 "calls/second" }, "p99 "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-role" : 0.0,{ "p999count" : 0.0, "stddevmax" : 0.0, "m15_ratemean" : 0.0, "m1_ratemin" : 0.0, "m5_ratep50" : 0.0, "mean_ratep75" : 0.0, "duration_unitsp95" : "seconds", "rate_units" : "calls/second" },0.0, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-rolep98" : {0.0, "countp99" : 0.0, "maxp999" : 0.0, "meanstddev" : 0.0, "minm15_rate" : 0.0, "p50m1_rate" : 0.0, "p75m5_rate" : 0.0, "p95mean_rate" : 0.0, "duration_units" : "p98seconds", "rate_units" : 0.0 "calls/second" }, "p99 "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-for-provider" : 0.0,{ "p999count" : 0.0, "stddevmax" : 0.0, "m15_ratemean" : 0.0, "m1_ratemin" : 0.0, "m5_ratep50" : 0.0, "mean_ratep75" : 0.0, "duration_unitsp95" : "seconds", "rate_units" : "calls/second" },0.0, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-privilegep98" : {0.0, "countp99" : 0.0, "maxp999" : 0.0, "meanstddev" : 0.0, "minm15_rate" : 0.0, "p50m1_rate" : 0.0, "p75m5_rate" : 0.0, "p95mean_rate" : 0.0, "duration_units" : "p98seconds", "rate_units" : 0.0 "calls/second" }, "p99 "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-roles-by-group" : 0.0,{ "p999count" : 0.0, "stddevmax" : 0.0, "m15_ratemean" : 0.0, "m1_ratemin" : 0.0, "m5_ratep50" : 0.0, "mean_ratep75" : 0.0, "duration_unitsp95" : "seconds", "rate_units" : "calls/second" },0.0, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-role "p98" : {0.0, "countp99" : 0.0, "maxp999" : 0.0, "meanstddev" : 0.0, "minm15_rate" : 0.0, "p50m1_rate" : 0.0, "p75m5_rate" : 0.0, "p95mean_rate" : 0.0, "p98duration_units" : 0.0"seconds", "p99 "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.rename-privilege" : 0.0,{ "p999count" : 0.0, "stddevmax" : 0.0, "m15_ratemean" : 0.0, "m1_ratemin" : 0.0, "m5_ratep50" : 0.0, "mean_ratep75" : 0.0, "duration_unitsp95" : "seconds"0.0, "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-authorizable "p98" : {0.0, "countp99" : 0.0, "maxp999" : 0.0, "meanstddev" : 0.0, "minm15_rate" : 0.0, "p50m1_rate" : 0.0, "p75m5_rate" : 0.0, "p95mean_rate" : 0.0, "p98duration_units" : 0.0"seconds", "p99rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-privilege" : 0.0,{ "p999count" : 0.0, "stddevmax" : 0.0, "m15_ratemean" : 0.0, "m1_ratemin" : 0.0, "m5_ratep50" : 0.0, "mean_ratep75" : 0.0, "duration_unitsp95" : "seconds"0.0, "rate_unitsp98" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-role" : {0.0, "countp99" : 0.0, "maxp999" : 0.0, "meanstddev" : 0.0, "minm15_rate" : 0.0, "p50m1_rate" : 0.0, "p75m5_rate" : 0.0, "p95mean_rate" : 0.0, "p98duration_units" : 0.0"seconds", "p99rate_units" : 0.0,"calls/second" }, "p999 "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-role" : 0.0,{ "stddevcount" : 0.0, "m15_ratemax" : 0.0, "m1_ratemean" : 0.0, "m5_ratemin" : 0.0, "mean_ratep50" : 0.0, "duration_unitsp75" : "seconds"0.0, "rate_unitsp95" : "calls/second" 0.0, }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-for-providerp98" : {0.0, "countp99" : 0.0, "maxp999" : 0.0, "meanstddev" : 0.0, "minm15_rate" : 0.0, "p50m1_rate" : 0.0, "p75m5_rate" : 0.0, "p95mean_rate" : 0.0, "p98duration_units" : 0.0"seconds", "p99rate_units" : 0.0,"calls/second" } } } |
KMS
KMS also exposes JMX via the endpoint http://host:16000/kms/jmx.
REST API
The following REST APIs will be exposed from app fabric.
Info
Path
/v3/system/serviceproviders/info
Output
Code Block |
---|
{ "p999hdfs" : 0.0, { "stddevversion" : 0"2.7.0", "m15_rateurl" : 0.0, "m1_rate" : 0.0 "http://localhost:50070", "m5_rateurl" : 0.0, "mean_rate" : 0.0, "http://localhost:50070/logs/" }, "yarn": { "duration_unitsversion" : "seconds2.7.0", "rate_unitsurl" : "calls/second" }http://localhost:8088", "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-roles-by-group" : { logs": "http://localhost:8088/logs/" }, "counthbase" : 0, { "maxversion" : "1.0.0", "meanurl": : 0.0, "http://localhost:50070", "minlogs": "http://localhost: 0.0, 60010/logs/" }, "p50hive" : 0.0, { "p75version" : 01.0,2 }, "p95zookeeper" : 0.0, { "p98version" : 0.0,"3.4.2" }, "p99kafka" : 0.0, { "p999version" : 0.0,"2.10" } } |
Statistics
Path
/v3/system/serviceproviders/statistics
Output
Code Block |
---|
{ "stddevcdap" : 0.0, { "m15_ratemasters" : 0.02, "m1_rate" kafka-servers": 0.02, "m5_raterouters" : 0.0, : 1, "mean_rate" : 0.0"auth-servers": 1, "duration_unitsnamespaces" : "seconds"10, "rate_unitsapps" : "calls/second"46, }"artifacts": 23, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.rename-privilege" : {datasets": 68, "countstreams" : 034, "maxprograms" : 0.0, 78 }, "meanhdfs" : 0.0,{ "minspace" : 0.0,{ "p50total" : 0.03452759234, "p75used" : 0.034525543, "p95available": : 0.0,3443555345 }, "p98nodes" : 0.0,{ "p99total" : 0.040, "p999healthy" : 0.036, "stddevdecommissioned" : 0.03, "m15_ratedecommissionInProgress": : 0.0,1 }, "m1_rateblocks" : 0.0,{ "m5_ratemissing" : 0.033, "mean_ratecorrupt" : 0.03, "duration_unitsunderreplicated": :5 "seconds", } "rate_units" : "calls/second" },}, "yarn": { "nodes": { "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-privilege" : {total": 35, "countnew" : 0, "maxrunning" : 0.030, "meanunhealthy" : 0.01, "mindecommissioned" : 0.02, "p50lost" : 0.01, "p75rebooted": : 0.0,1 }, "apps": { "p95total" : 0.030, "p98submitted" : 0.02, "p99accepted" : 0.04, "p999running" : 0.020, "stddevfailed" : 0.01, "m15_ratekilled" : 0.03, "m1_ratenew" : 0.0, "m5new_ratesaving" : 0.0, }, "mean_ratememory" : 0.0,{ "duration_unitstotal" : "seconds"8192, "rate_unitsused" : "calls/second" 7168, }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-role" : { available": 1024 }, "countvirtualCores" : 0,{ "maxtotal" : 0.036, "meanused" : 0.012, "minavailable": 24 : 0.0, }, "p50queues" : 0.0,{ "p75total" : 0.010, "p95stopped" : 0.02, "p98running" : 0.08, "p99maxCapacity" : 0.032, "p999currentCapacity": 21 : 0.0, } }, "stddevhbase" : 0.0,{ "m15_ratenodes" : 0.0,{ "m1_ratetotalRegionServers" : 0.037, "m5_rateliveRegionServers" : 0.034, "mean_ratedeadRegionServers" : 0.03, "duration_units" : "seconds", masters": 3 }, "rate_unitstables": : "calls/second"56, "namespaces": }43 } } |
Sentry
TODO: CDAP Master Uptime?
...