...
- CDAP Uptime
- P1: Should indicate the time (number of hours, days?) for which the CDAP Master process has been running.
- P2: In an HA environment, it would be nice to indicate the time of the last master failover.
- CDAP System Services:
- P1: Should indicate the current number of instances.
- P1: Should have a way to scale services.
- P1: Should show service logs
- P2: Node name where container started
- P2: Container name
- P2:
master.services
YARN application name
- Middle Drawer:
- CDAP:
- P1: # of masters, routers, kafka-servers, auth-servers
- P1: Router requests - # 200s, 404s, 500s
- P1: # namespaces, artifacts, apps, programs, datasets, streams, views
- P1: Transaction snapshot summary (invalid, in-progress, committing, committed)
- P1: Logs/Metrics service lags
- P2: Last GC pause time
- HDFS:
- P1: Space metrics: yotal, free, used
- P1: Nodes: yotal, healthy, decommissioned, decommissionInProgress
- P1: Blocks: missing, corrupt, under-replicated
- YARN:
- P1: Nodes: total, new, running, unhealthy, decommissioned, lost, rebooted
- P1: Apps: total, submitted, accepted, running, failed, killed, new, new_saving
- P1: Memory: total, used, free
- P1: Virtual Cores: total, used, free
- P1: Queues: total, stopped, running, max_capacity, current_capacity
- HBase
- P1: Nodes: total_regionservers, live_regionservers, dead_regionservers, masters
- P1: No. of namespaces, tables
- P2: Last major compaction (time + info)
- Zookeeper: Most of these are from the output of
echo mntr | nc localhost 2181
- P1: Num of alive connections
- P1: Num of znodes
- P1: Num of watches
- P1: Num of ephemeral nodes
- P1: Data size
- P1: Open file descriptor count
- P1: Max file descriptor count
- Kafka
- JMX Metrics that Kafka exposes: https://kafka.apache.org/documentation#monitoring
- https://github.com/linkedin/kafka-monitor may have some clues
- Sentry
- The following is available by enabling the sentry web service (ref: http://www.cloudera.com/documentation/enterprise/latest/topics/sg_sentry_metrics.html) and querying for metrics (API: http://[sentry-service-host]:51000/metrics?pretty=true). TODO: Shortlist.
{ "version" : "3.0.0", "gauges" : { "buffers.direct.capacity" : { "value" : 57344 }, "buffers.direct.count" : { "value" : 5 }, "buffers.direct.used" : { "value" : 57344 }, "buffers.mapped.capacity" : { "value" : 0 }, "buffers.mapped.count" : { "value" : 0 }, "buffers.mapped.used" : { "value" : 0 }, "gc.PS-MarkSweep.count" : { "value" : 0 }, "gc.PS-MarkSweep.time" : { "value" : 0 }, "gc.PS-Scavenge.count" : { "value" : 2 }, "gc.PS-Scavenge.time" : { "value" : 26 }, "memory.heap.committed" : { "value" : 1029701632 }, "memory.heap.init" P1: # of rolesCode Block - P1: # of privileges
- P1: memory: total, used, available
- P1: requests per second
- any more?
- KMS
- CDAP:
- Component Overview
- P1: YARN, HDFS, HBase, Zoookeeper, Kafka, Hive
- P1: For each component: version, url, logs_url
- P2: Distribution info
- P2: Plus button - to store custom components and version, url, logs_url for each.
Design
Data Sources
Data for these APIs will be sourced using:
- DistributedFileSystem - For HDFS statistics
- YarnClient - for YARN statistics and info
- HBaseAdmin - for HBase statistics and info
- Configuration and HBaseConfiguration - For HDFS, YARN and HBase info
Versions
- CDAP -
co.cask.cdap.common.utils.ProjectInfo
- HBase -
co.cask.cdap.data2.util.hbase.HBaseVersion
- YARN -
org.apache.hadoop.yarn.util.YarnVersionInfo
- HDFS -
org.apache.hadoop.util.VersionInfo
- Zookeeper - No client API available. Will have to build a utility around
echo stat | nc localhost 2181
- Hive -
org.apache.hive.common.util.HiveVersionInfo
URL
- CDAP -
$(dashboard.bind.address) + $(dashboard.bind.port)
- YARN -
$(yarn.resourcemanager.webapp.address)
- HDFS -
$(dfs.namenode.http-address)
- HBase - hbaseAdmin.getClusterStatus().getMaster().toString()
REST API
The following REST APIs will be exposed from app fabric.
Info
Path
/v3/system/serviceproviders/info
Output
Code Block |
---|
{ "hdfs": { |
...
" |
...
version" |
...
: |
...
"2.7.0", " |
...
url": |
...
"http://localhost:50070", " |
...
url": |
...
"http://localhost:50070/logs/" |
...
},
|
...
"yarn": { |
...
" |
...
version" |
...
: |
...
"2.7.0", " |
...
url" |
...
: |
...
"http://localhost:8088", |
...
" |
...
logs": |
...
"http://localhost:8088/logs/" |
...
},
|
...
"hbase": { |
...
" |
...
version" |
...
: |
...
"1.0.0", " |
...
url": "http://localhost:50070", " |
...
logs": |
...
"http://localhost:60010/logs/" }, |
...
"hive": { |
...
" |
...
version" |
...
: |
...
1.2 |
...
}, |
...
" |
...
zookeeper": { |
...
" |
...
version" |
...
: |
...
"3.4.2" |
...
}, |
...
" |
...
kafka": { |
...
" |
...
version" |
...
: |
...
"2.10" |
...
}
} |
Statistics
Path
/v3/system/serviceproviders/statistics
Output
Code Block |
---|
{ "cdap": { "masters": 2, " |
...
kafka-servers": |
...
2, |
...
"routers": |
...
1, " |
...
auth-servers": 1, |
...
" |
...
namespaces": |
...
10, |
...
"apps": |
...
46, " |
...
artifacts": |
...
23, |
...
" |
...
datasets": |
...
68, "streams": |
...
34, " |
...
programs": |
...
78 }, |
...
" |
...
hdfs" |
...
: |
...
{ " |
...
space": { " |
...
total": |
...
3452759234, |
...
|
...
" |
...
used" |
...
: |
...
34525543, " |
...
available" |
...
: |
...
3443555345 }, " |
...
nodes" |
...
: {
" |
...
total": 40, |
...
|
...
|
...
"healthy": 36, |
...
"decommissioned": 3, " |
...
decommissionInProgress" |
...
: |
...
1 }, " |
...
blocks" |
...
: {
" |
...
missing": |
...
33, |
...
"corrupt": |
...
3, |
...
"underreplicated": 5 } }, " |
...
yarn": |
...
{ |
...
"nodes": { |
...
" |
...
total": 35, " |
...
new" |
...
: 0, |
...
"running": 30, |
...
"unhealthy": 1, |
...
" |
...
decommissioned" |
...
: |
...
2, |
...
|
...
"lost": 1, " |
...
rebooted" |
...
: 1
},
" |
...
apps" |
...
: {
" |
...
total": 30, |
...
|
...
|
...
"submitted": 2, " |
...
accepted" |
...
: |
...
4, " |
...
running" |
...
: |
...
20, |
...
|
...
"failed": |
...
1, " |
...
killed" |
...
: |
...
3, |
...
|
...
" |
...
new |
...
" |
...
: |
...
0, " |
...
new_saving" |
...
: 0
},
" |
...
memory" |
...
: {
" |
...
total" |
...
: |
...
8192, |
...
|
...
"used": 7168, " |
...
available" |
...
: |
...
1024 }, " |
...
virtualCores": { " |
...
total" |
...
: |
...
36, |
...
|
...
"used": 12, " |
...
available" |
...
: |
...
24 }, |
...
|
...
" |
...
queues" |
...
: {
|
...
" |
...
total" |
...
: |
...
10, |
...
|
...
|
...
|
...
|
...
" |
...
stopped": 2, |
...
|
...
" |
...
running": 8, " |
...
maxCapacity" |
...
: |
...
32, " |
...
currentCapacity" |
...
: |
...
21 } |
...
}, |
...
" |
...
hbase" |
...
: |
...
{ |
...
" |
...
nodes" |
...
: |
...
{ " |
...
totalRegionServers" |
...
: |
...
37, " |
...
liveRegionServers" |
...
: |
...
34, " |
...
deadRegionServers" |
...
: |
...
3, " |
...
masters": 3 |
...
...
}, " |
...
tables" |
...
: |
...
56, |
...
" |
...
namespaces" |
...
: |
...
43 |
...
}
} |
Sentry
The following is available by enabling the sentry web service (ref: http://www.cloudera.com/documentation/enterprise/latest/topics/sg_sentry_metrics.html) and querying for metrics (API: http://[sentry-service-host]:51000/metrics?pretty=true).
Code Block |
---|
{ "version" : "3.0.0", "gauges" : { " |
...
buffers.direct.capacity" : |
...
{ " |
...
value" : |
...
57344 |
...
}, |
...
"buffers.direct.count" : |
...
{ " |
...
value" : |
...
5 }, " |
...
buffers. |
...
direct.used" : { " |
...
value" : |
...
57344 }, " |
...
buffers.mapped.capacity" : |
...
{ " |
...
value" : 0 |
...
}, " |
...
buffers.mapped.count" : |
...
{ " |
...
value" : 0 |
...
}, " |
...
buffers.mapped.used" : |
...
{ " |
...
value" : 0 |
...
}, " |
...
gc.PS-MarkSweep.count" : |
...
{ " |
...
value" : 0 |
...
}, |
...
"gc.PS-MarkSweep.time" : |
...
{ " |
...
value" : 0 |
...
}, " |
...
gc.PS-Scavenge.count" : |
...
{ " |
...
value" : |
...
2 }, " |
...
gc.PS-Scavenge.time" : |
...
{ " |
...
value" : |
...
26 }, |
...
|
...
" |
...
memory.heap.committed" : { " |
...
value" : |
...
1029701632 }, " |
...
memory. |
...
heap.init" : { " |
...
value" : |
...
1073741824 }, "memory.heap.max" : |
...
{ " |
...
value" : |
...
1029701632 }, " |
...
memory.heap.usage" : |
...
{ " |
...
value" : 0. |
...
17999917863585554 }, |
...
"memory.heap.used" : |
...
{ " |
...
value" : |
...
185345448 }, " |
...
memory.non-heap.committed" : |
...
{ " |
...
value" : |
...
31391744 }, " |
...
memory.non-heap.init" : |
...
{ " |
...
value" : |
...
24576000 }, " |
...
memory.non-heap.max" : |
...
{ " |
...
value" : |
...
136314880 }, " |
...
memory.non-heap.usage" : |
...
{ " |
...
value" : 0. |
...
2187954829289363 }, " |
...
memory.non-heap.used" : |
...
{ " |
...
value" : |
...
29825080 }, " |
...
memory. |
...
pools. |
...
Code-Cache.usage" : { " |
...
value" : 0 |
...
.029324849446614582 }, " |
...
memory.pools.PS-Eden-Space.usage" : |
...
{ " |
...
value" : 0. |
...
6523454156767787 |
...
}, " |
...
memory.pools.PS-Old-Gen.usage" : |
...
{ " |
...
value" : |
...
1.1440740671897877E-4 }, |
...
|
...
|
...
"memory.pools.PS-Perm-Gen.usage" : |
...
{ " |
...
value" : 0. |
...
32970512204053926 }, " |
...
memory.pools.PS-Survivor-Space.usage" : |
...
{ " |
...
value" : 0. |
...
22010480095358456 }, |
...
"memory.total.committed" : |
...
{ " |
...
value" : 1061093376 |
...
}, |
...
"memory.total.init" : |
...
{ " |
...
value" : |
...
1098317824 }, " |
...
memory.total.max" : |
...
{ " |
...
value" : |
...
1166016512 }, " |
...
memory. |
...
total.used" : { " |
...
value" : |
...
215170528 |
...
}, |
...
"org.apache.sentry.provider.db.service.persistent.SentryStore.group_count" : |
...
{ " |
...
value" : |
...
3 |
...
}, " |
...
org.apache.sentry.provider.db.service.persistent.SentryStore.privilege_count" : |
...
{ " |
...
value" : 0 |
...
}, |
...
|
...
|
...
"org.apache.sentry.provider.db.service.persistent.SentryStore.role_count" : { " |
...
value" : |
...
132 }, " |
...
threads.blocked.count" : |
...
{ " |
...
value" : |
...
1 }, " |
...
threads.count" : |
...
{ " |
...
value" : |
...
38 }, |
...
|
...
"threads.daemon.count" : { " |
...
value" : |
...
27 }, " |
...
threads.deadlocks" : |
...
{ " |
...
value" : |
...
[ ] }, " |
...
threads. |
...
new.count" : { " |
...
value" : 0 |
...
}, " |
...
threads.runnable.count" : |
...
{ " |
...
value" : |
...
6 }, " |
...
threads.terminated.count" : |
...
{ " |
...
value" : 0 |
...
}, " |
...
threads.timed_waiting.count" : |
...
{ " |
...
value" : |
...
8 }, " |
...
threads.waiting.count" : |
...
{ " |
...
value" : |
...
23 } }, " |
...
counters" : |
...
{ }, "histograms" : { }, " |
...
meters" : |
...
{ }, "timers" : { |
...
"org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.create-role" : |
...
{ " |
...
count" : 0 |
...
,
" |
...
max" : 0.0, "mean |
...
" : 0.0,
" |
...
min" : |
...
0.0, " |
...
p50" : |
...
0.0, |
...
|
...
" |
...
p75" : 0.0, " |
...
p95" : 0.0, " |
...
p98" : 0.0, " |
...
p99" : 0.0, " |
...
p999" : 0.0, " |
...
stddev" : 0.0, " |
...
m15_rate" : 0.0, " |
...
m1_rate" : 0.0, " |
...
m5_rate" : 0.0, " |
...
mean_rate" : 0.0, " |
...
duration_units" : |
...
"seconds", " |
...
rate_ |
...
units" : |
...
"calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-privilege" : { " |
...
count" : |
...
0,
" |
...
max" : 0.0, "mean |
...
" : 0.0,
" |
...
min" : |
...
0.0, " |
...
p50" : |
...
0.0, "p75" : |
...
0.0, |
...
" |
...
p95" : 0.0, " |
...
p98" : 0.0, " |
...
p99" : 0.0, " |
...
p999" : 0.0, " |
...
stddev" : 0.0, " |
...
m15_rate" : 0.0, " |
...
m1_rate" : 0.0, " |
...
m5_rate" : 0.0, " |
...
mean_rate" : 0.0, " |
...
duration_units" : |
...
"seconds", " |
...
rate_units" : |
...
"calls/second" }, |
...
"org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-role" : { "count" : 0 |
...
,
" |
...
max" : 0.0, " |
...
mean" : 0.0, " |
...
min" : 0.0, " |
...
p50" : |
...
0.0, " |
...
p75" : |
...
0.0, "p95" : |
...
0.0, |
...
"p98" : 0.0, " |
...
p99" : 0.0, " |
...
p999" : 0.0, " |
...
stddev" : 0.0, " |
...
m15_rate" : 0.0, " |
...
m1_rate" : 0.0, " |
...
m5_rate" : 0.0, " |
...
mean_rate" : 0.0, " |
...
duration_units" : |
...
"seconds", " |
...
rate_units" : |
...
"calls/second" }, " |
...
org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-privilege" : { "count" : |
...
0,
" |
...
max" : 0.0, " |
...
mean" : 0.0, " |
...
min" : 0.0, " |
...
p50" : 0.0, " |
...
p75" : |
...
0.0, " |
...
p95" : 0.0, |
...
"p98" : 0.0, |
...
|
...
"p99" : |
...
0.0, " |
...
p999" : 0.0, " |
...
stddev" : 0.0, " |
...
m15_rate" : 0.0, " |
...
m1_rate" : 0.0, " |
...
m5_rate" : 0.0, " |
...
mean_rate" : 0.0, " |
...
duration_units" : |
...
"seconds", " |
...
rate_units" : |
...
"calls/second" }, |
...
|
...
|
...
"org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-role" : { "count" : 0, " |
...
max" : 0.0, " |
...
mean" : 0.0, " |
...
min" : 0.0, " |
...
p50" : 0.0, " |
...
p75" : 0.0, " |
...
p95" : |
...
0.0, " |
...
p98" : |
...
0.0, "p99" : |
...
0.0, |
...
"p999" : 0.0, " |
...
stddev" : 0.0, " |
...
m15_rate" : 0.0, " |
...
m1_rate" : 0.0, " |
...
m5_rate" : 0.0, " |
...
mean_rate" : 0.0, " |
...
duration_units" : |
...
"seconds", " |
...
rate_units" : |
...
"calls/second" |
...
}, |
...
"org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-authorizable" : { "count" : |
...
0,
" |
...
max" : 0.0, " |
...
mean" : 0.0, " |
...
min" : 0.0, " |
...
p50" : 0.0, " |
...
p75" : 0.0, " |
...
p95" : |
...
0.0, " |
...
p98" : |
...
0.0, |
...
"p99" |
...
: 0.0, " |
...
p999" : 0.0, " |
...
stddev" : 0.0, " |
...
m15_rate" : 0.0, " |
...
m1_rate" : 0.0, " |
...
m5_rate" : 0.0, " |
...
mean_rate" : 0.0, " |
...
duration_units" : |
...
"seconds", " |
...
rate_units" : |
...
"calls/second" }, |
...
|
...
"org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-role" : { "count" : 0, " |
...
max" : 0.0, " |
...
mean" : 0.0, " |
...
min" : 0.0, " |
...
p50" : 0.0, " |
...
p75" : 0.0, " |
...
p95" : |
...
0.0, " |
...
p98" : |
...
0.0, |
...
"p99" |
...
- P1: YARN, HDFS, HBase, Zoookeeper, Kafka, Hive
- P1: For each component: version, url, logs_url
- P2: Distribution info
- P2: Plus button - to store custom components and version, url, logs_url for each.
Design
Data Sources
Data for these APIs will be sourced using:
- DistributedFileSystem - For HDFS statistics
- YarnClient - for YARN statistics and info
- HBaseAdmin - for HBase statistics and info
- Configuration and HBaseConfiguration - For HDFS, YARN and HBase info
Versions
- CDAP -
co.cask.cdap.common.utils.ProjectInfo
- HBase -
co.cask.cdap.data2.util.hbase.HBaseVersion
- YARN -
org.apache.hadoop.yarn.util.YarnVersionInfo
- HDFS -
org.apache.hadoop.util.VersionInfo
- Zookeeper - No client API available. Will have to build a utility around
echo stat | nc localhost 2181
- Hive -
org.apache.hive.common.util.HiveVersionInfo
URL
- CDAP -
$(dashboard.bind.address) + $(dashboard.bind.port)
- YARN -
$(yarn.resourcemanager.webapp.address)
- HDFS -
$(dfs.namenode.http-address)
- HBase - hbaseAdmin.getClusterStatus().getMaster().toString()
REST API
The following REST APIs will be exposed from app fabric.
Info
Path
/v3/system/serviceproviders/info
Output
Code Block |
---|
{ "hdfs": { "version": "2.7.0", "url": "http://localhost:50070", "url": "http://localhost:50070/logs/" }, "yarn": 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-for-provider" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-roles-by-group" : { "versioncount" : "2.7.0", "urlmax" : "http://localhost:8088", "logs": "http://localhost:8088/logs/"0.0, }, "hbasemean" : 0.0, { "versionmin" : "1.0.0", "urlp50" : "http://localhost:50070",0.0, "logsp75" : "http://localhost:60010/logs/"0.0, }, "hivep95" : 0.0, { "versionp98" : 10.20, }, "zookeeperp99" : {0.0, "versionp999" : "3.4.2" }0.0, "kafka": { "versionstddev" : "2.10"0.0, } } |
Statistics
Path
/v3/system/serviceproviders/statistics
Output
Code Block |
---|
{ "cdapm15_rate" : {0.0, "mastersm1_rate" : 20.0, "kafka-servers": 2, "m5_rate" : 0.0, "routersmean_rate" : 10.0, "auth-servers"duration_units" : 1"seconds", "namespacesrate_units" : 10,"calls/second" "apps": 46}, "artifacts": 23,org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.rename-privilege" : { "datasetscount" : 680, "streamsmax" : 34, 0.0, "programsmean" : 0.0, 78 }, "hdfsmin" : {0.0, "spacep50" : {0.0, "totalp75" : 34527592340.0, "usedp95" : 345255430.0, "availablep98" : 3443555345 0.0, }, "nodesp99" : {0.0, "totalp999" : 400.0, "healthystddev" : 360.0, "decommissionedm15_rate" : 30.0, "decommissionInProgressm1_rate" : 10.0, }, "blocksm5_rate" : {0.0, "missingmean_rate" : 330.0, "corruptduration_units" : 3"seconds", "underreplicatedrate_units" : 5"calls/second" } , }, "yarn": { "nodes"org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-privilege" : { "totalcount" : 350, "newmax" : 0.0, "runningmean" : 300.0, "unhealthymin" : 10.0, "decommissionedp50" : 20.0, "lostp75" : 10.0, "rebootedp95" : 1 0.0, }, "apps": { "total": 30p98" : 0.0, "submittedp99" : 20.0, "acceptedp999" : 40.0, "runningstddev" : 200.0, "failedm15_rate" : 10.0, "killedm1_rate" : 30.0, "newm5_rate" : 0.0, "newmean_savingrate" : 0.0, }, "duration_units" : "memoryseconds": {, "totalrate_units" : 8192,"calls/second" "used": 7168}, "available": 1024 "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-role" : { }, "virtualCorescount" : {0, "totalmax" : 360.0, "usedmean" : 120.0, "availablemin" : 240.0, }, "queuesp50" : {0.0, "totalp75" : 100.0, "stoppedp95" : 20.0, "runningp98" : 80.0, "maxCapacityp99" : 320.0, "currentCapacityp999": 21 : 0.0, } }, "hbasestddev" : {0.0, "nodesm15_rate" : {0.0, "totalRegionServersm1_rate" : 370.0, "liveRegionServersm5_rate" : 340.0, "deadRegionServersmean_rate" : 30.0, "mastersduration_units" : 3 "seconds", }, "tablesrate_units" : 56,"calls/second" "namespaces": 43} } } |
TODO: CDAP Master Uptime?
...