Versions Compared
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Table of Contents |
---|
Introduction
The CDAP 4.0 UI is designed to provide operational insights about both - CDAP services as well as other service providers such as YARN, HBase and HDFS. The CDAP platform will need to expose additional APIs to surface this information.
Goals
The operational APIs should surface information for the Management Screen
These designs translate into the following requirements:
- CDAP Uptime
- P1: Should indicate the time (number of hours, days?) for which the CDAP Master process has been running.
- P2: In an HA environment, it would be nice to indicate the time of the last master failover.
- CDAP System Services:
- P1: Should indicate the current number of instances.
- P1: Should have a way to scale services.
- P1: Should show service logs
- P2: Node name where container started
- P2: Container name
- P2:
master.services
YARN application name
- Middle Drawer:
- CDAP:
- P1: # of masters, routers, kafka-servers, auth-servers
- P1: Router requests - # 200s, 404s, 500s
- P1: # namespaces, artifacts, apps, programs, datasets, streams, views
- P1: Transaction snapshot summary (invalid, in-progress, committing, committed)
- P1: Logs/Metrics service lags
- P2: Last GC pause time
- HDFS:
- P1: Space metrics: yotal, free, used
- P1: Nodes: yotal, healthy, decommissioned, decommissionInProgress
- P1: Blocks: missing, corrupt, under-replicated
- YARN:
- P1: Nodes: total, new, running, unhealthy, decommissioned, lost, rebooted
- P1: Apps: total, submitted, accepted, running, failed, killed, new, new_saving
- P1: Memory: total, used, free
- P1: Virtual Cores: total, used, free
- P1: Queues: total, stopped, running, max_capacity, current_capacity
- HBase
- P1: Nodes: total_regionservers, live_regionservers, dead_regionservers, masters
- P1: No. of namespaces, tables
- P2: Last major compaction (time + info)
- Zookeeper: Most of these are from the output of
echo mntr | nc localhost 2181
- P1: Num of alive connections
- P1: Num of znodes
- P1: Num of watches
- P1: Num of ephemeral nodes
- P1: Data size
- P1: Open file descriptor count
- P1: Max file descriptor count
- Kafka
- JMX Metrics that Kafka exposes: https://kafka.apache.org/documentation#monitoring
- P1: # of topics
- P1: Message in rate
- P1: Request rate
- P1: # of under replicated partitions
- P1: Partition counts
- Sentry
- P1: # of roles
- P1: # of privileges
- P1: memory: total, used, available
- P1: requests per second
- any more?
- KMS
- TBD: Having a hard time hitting the JMX endpoint for KMS
- CDAP:
- Component Overview
- P1: YARN, HDFS, HBase, Zoookeeper, Kafka, Hive
- P1: For each component: version, url, logs_url
- P2: Sentry, KMS
- P2: Distribution info
- P2: Plus button - to store custom components and version, url, logs_url for each.
User Stories
- As a CDAP admin, I would like to have insights into the health of all CDAP system services including master, log saver, explore container, metrics processor, metrics, streams, transaction server and dataset executor
- As a CDAP admin, I would like to know information about my CDAP setup including the version of CDAP
- As a CDAP admin, I would like to know the uptime of CDAP including optionally the time since the last failover in an HA scenario
- As a CDAP admin, I would like to know the versions and (optionally) links to the web UI and logs if available of the underlying infrastructure components.
- As a CDAP admin, I would like to have operational insights including statistics such as request rate, node status, available compute as well as storage capacity for the underlying infrastructure components that CDAP relies upon. These insights should help me understand the health of these components as well as help in root cause analysis in case CDAP fails or performs poorly.
Design
Data Sources
Versions
- CDAP -
co.cask.cdap.common.utils.ProjectInfo
- HBase -
co.cask.cdap.data2.util.hbase.HBaseVersion
- YARN -
org.apache.hadoop.yarn.util.YarnVersionInfo
- HDFS -
org.apache.hadoop.util.VersionInfo
- Zookeeper - No client API available. Will have to build a utility around
echo stat | nc localhost 2181
- Hive -
org.apache.hive.common.util.HiveVersionInfo
URL
- CDAP -
$(dashboard.bind.address) + $(dashboard.bind.port)
- YARN -
$(yarn.resourcemanager.webapp.address)
- HDFS -
$(dfs.namenode.http-address)
- HBase - hbaseAdmin.getClusterStatus().getMaster().toString()
HDFS
DistributedFileSystem - For HDFS statistics
YARN
YarnClient - for YARN statistics and info
HBase
HBaseAdmin - for HBase statistics and info
Kafka
JMX
Reference: https://github.com/linkedin/kafka-monitor
Zookeeper
Option 1: Four letter commands - mntr. Drawbacks: mntr was introduced in 3.5.0 - users may be running older versions of Zookeeper
Option 2: Zookeeper also exposes JMX - https://zookeeper.apache.org/doc/trunk/zookeeperJMX.html
HiveServer2
TBD
Sentry
JMX
The following is available by enabling the sentry web service (ref: http://www.cloudera.com/documentation/enterprise/latest/topics/sg_sentry_metrics.html) and querying for metrics (API: http://[sentry-service-host]:51000/metrics?pretty=true).
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{ "version" : "3.0.0", "gauges" : { "buffers.direct.capacity" : { "value" : 57344 }, "buffers.direct.count" : { "value" : 5 }, "buffers.direct.used" : { "value" : 57344 }, "buffers.mapped.capacity" : { "value" : 0 }, "buffers.mapped.count" : { "value" : 0 }, "buffers.mapped.used" : { "value" : 0 }, "gc.PS-MarkSweep.count" : { "value" : 0 }, "gc.PS-MarkSweep.time" : { "value" : 0 }, "gc.PS-Scavenge.count" : { "value" : 2 }, "gc.PS-Scavenge.time" : { "value" : 26 }, "memory.heap.committed" : { "value" : 1029701632 }, "memory.heap.init" : { "value" : 1073741824 }, "memory.heap.max" : { "value" : 1029701632 }, "memory.heap.usage" : { "value" : 0.17999917863585554 }, "memory.heap.used" : { "value" : 185345448 }, "memory.non-heap.committed" : { "value" : 31391744 }, "memory.non-heap.init" : { "value" : 24576000 }, "memory.non-heap.max" : { "value" : 136314880 }, "memory.non-heap.usage" : { "value" : 0.2187954829289363 }, "memory.non-heap.used" : { "value" : 29825080 }, "memory.pools.Code-Cache.usage" : { "value" : 0.029324849446614582 }, "memory.pools.PS-Eden-Space.usage" : { "value" : 0.6523454156767787 }, "memory.pools.PS-Old-Gen.usage" : { "value" : 1.1440740671897877E-4 }, "memory.pools.PS-Perm-Gen.usage" : { "value" : 0.32970512204053926 }, "memory.pools.PS-Survivor-Space.usage" : { "value" : 0.22010480095358456 }, "memory.total.committed" : { "value" : 1061093376 }, "memory.total.init" : { "value" : 1098317824 }, "memory.total.max" : { "value" : 1166016512 }, "memory.total.used" : { "value" : 215170528 }, "org.apache.sentry.provider.db.service.persistent.SentryStore.group_count" : { "value" : 3 }, "org.apache.sentry.provider.db.service.persistent.SentryStore.privilege_count" : { "value" : 0 }, "org.apache.sentry.provider.db.service.persistent.SentryStore.role_count" : { "value" : 132 }, "threads.blocked.count" : { "value" : 1 }, "threads.count" : { "value" : 38 }, "threads.daemon.count" : { "value" : 27 }, "threads.deadlocks" : { "value" : [ ] }, "threads.new.count" : { "value" : 0 }, "threads.runnable.count" : { "value" : 6 }, "threads.terminated.count" : { "value" : 0 }, "threads.timed_waiting.count" : { "value" : 8 }, "threads.waiting.count" : { "value" : 23 } }, "counters" : { }, "histograms" : { }, "meters" : { }, "timers" : { "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.create-role" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-privilege" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-role" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-privilege" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-role" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-authorizable" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-role" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-for-provider" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-roles-by-group" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.rename-privilege" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-privilege" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-role" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" } } } |
KMS
KMS also exposes JMX via the endpoint http://host:16000/kms/jmx.
TODO: CDAP Master Uptime?
Caching
It is not possible to hit HBase/YARN/HDFS for every request from the UI. As a result, the result of the statistics API will have to be cached, with a configurable time to live. The cache will be keyed by the service provider name, with the value as the statistics for the service provider. In addition to a time-to-live, the cache will also need to be invalidated when:
- In the
ProgramController
andProgramRuntimeService
class hierarchy, so YARN statistics can be updated - In the Entity (app, dataset, stream) lifecycle (creation and deletion) when storage statistics will need to be updated
- P3: In the authorization class hierarchy, when the authorization statistics may need to be updated. However, we may choose to not do this, because we eventually want to get out of the business of managing authorization policies in CDAP
- TODO: Invalidation for Kafka and Zookeeper
API changes
New REST APIs
The following REST APIs will be exposed from app fabric.
Path | Method | Description | Response Code | Response | ||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
/v3/system/serviceproviders | GET | Lists all the available service providers and (optionally) minimal info about each (version, url and logs_url) | 200 - On success 500 - Any internal errors |
| ||||||||||||||||||||||||||||||||||||
/v3/system/serviceproviders/{service-provider-name}/stats | GET | Returns statistics for the specified service provider | 200 OK - Statistics for the specified service provider were successfully fetched 503 Unavailable - Could not contact the service provider for status 404 Not found - Service provider not found (not in the list returned by the list service providers API) 500 - Any other internal errors |
|
List Service Providers
This API lists all the available service providers and (optionally) minimal info about each (version, url and logs_url)
Path
GET /v3/system/serviceproviders
Output
Code Block |
---|
{ "hdfs": { "version": "2.7.0", "url": "http://localhost:50070", "url": "http://localhost:50070/logs/" }, "yarn": { "version": "2.7.0", "url": "http://localhost:8088", "logs": "http://localhost:8088/logs/" }, "hbase": { "version": "1.0.0", "url": "http://localhost:50070", "logs": "http://localhost:60010/logs/" }, "hive": { "version": 1.2 }, "zookeeper": { "version": "3.4.2" }, "kafka": { "version": "2.10" } } |
Get Service Provider Statistics
Path
GET /v3/system/serviceproviders/{service-provider-name}/stats
Response
200 OK - Statistics for the specified service provider were successfully fetched
503 Unavailable - Could not contact the service provider for status
404 Not found - Service provider not found (not in the list returned by the list service providers API)
Output
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{ "services": { "masters": 2, "kafka-servers": 2, "routers": 1, "auth-servers": 1, "namespaces": 10, }, "entities": { "apps": 46, "artifacts": 23, "datasets": 68, "streams": 34, "programs": 78 } } |
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{ "space": { "total": 3452759234, "used": 34525543, "available": 3443555345 }, "nodes": { "total": 40, "healthy": 36, "decommissioned": 3, "decommissionInProgress": 1 }, "blocks": { "missing": 33, "corrupt": 3, "underreplicated": 5 } } |
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{ "nodes": { "total": 35, "new": 0, "running": 30, "unhealthy": 1, "decommissioned": 2, "lost": 1, "rebooted": 1 }, "apps": { "total": 30, "submitted": 2, "accepted": 4, "running": 20, "failed": 1, "killed": 3, "new": 0, "new_saving": 0 }, "memory": { "total": 8192, "used": 7168, "available": 1024 }, "virtualCores": { "total": 36, "used": 12, "available": 24 }, "queues": { "total": 10, "stopped": 2, "running": 8, "maxCapacity": 32, "currentCapacity": 21 } } |
Code Block | ||||||
---|---|---|---|---|---|---|
| ||||||
{ "nodes": { "totalRegionServers": 37, "liveRegionServers": 34, "deadRegionServers": 3, "masters": 3 }, "tables": 56, "namespaces": 43 } |
TODO: Add output for Kafka, Zookeeper, Sentry, KMS
CLI Impact or Changes
New CLI commands will have to be added to front the two new APIs.
List Service Providers
list service providers
Get Service Provider Statistics
get statistics for service provider <service-provider>
UI Impact or Changes
The Management screen on the CDAP 4.0 UI will have to be implemented using the APIs exposed by this design in addition to existing APIs for getting System Service Status and Logs
Security Impact
Currently CDAP does not enforce authorization for the system services APIs -
Jira Legacy | ||||||||
---|---|---|---|---|---|---|---|---|
|
ADMIN
privileges on the CDAP instance should be able to execute these APIs successfully.Impact on Infrastructure Outages
Test Scenarios
Releases
Release 4.0.0
- Ground work for collecting statistics from infrastructure components.
- Focus on HDFS, YARN, HBase, Hive, Kafka, Zookeeper (in that order)
Release 4.1.0
- More components such as Sentry, KMS.
Related Work
Future work
Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post