- Created by Bhooshan Mogal, last modified on Oct 25, 2016
You are viewing an old version of this page. View the current version.
Compare with Current View Page History
« Previous Version 26 Next »
Introduction
The CDAP 4.0 UI is designed to provide operational insights about both - CDAP services as well as other service providers such as YARN, HBase and HDFS. The CDAP platform will need to expose additional APIs to surface this information.
Goals
The operational APIs should surface information for the Management Screen
These designs translate into the following requirements:
- CDAP Uptime
- P1: Should indicate the time (number of hours, days?) for which the CDAP Master process has been running.
- P2: In an HA environment, it would be nice to indicate the time of the last master failover.
- CDAP System Services:
- P1: Should indicate the current number of instances.
- P1: Should have a way to scale services.
- P1: Should show service logs
- P2: Node name where container started
- P2: Container name
- P2:
master.services
YARN application name
- Middle Drawer:
- CDAP:
- P1: # of masters, routers, kafka-servers, auth-servers
- P1: Router requests - # 200s, 404s, 500s
- P1: # namespaces, artifacts, apps, programs, datasets, streams, views
- P1: Transaction snapshot summary (invalid, in-progress, committing, committed)
- P1: Logs/Metrics service lags
- P2: Last GC pause time
- HDFS:
- P1: Space metrics: yotal, free, used
- P1: Nodes: yotal, healthy, decommissioned, decommissionInProgress
- P1: Blocks: missing, corrupt, under-replicated
- YARN:
- P1: Nodes: total, new, running, unhealthy, decommissioned, lost, rebooted
- P1: Apps: total, submitted, accepted, running, failed, killed, new, new_saving
- P1: Memory: total, used, free
- P1: Virtual Cores: total, used, free
- P1: Queues: total, stopped, running, max_capacity, current_capacity
- HBase
- P1: Nodes: total_regionservers, live_regionservers, dead_regionservers, masters
- P1: No. of namespaces, tables
- P2: Last major compaction (time + info)
- Zookeeper: Most of these are from the output of
echo mntr | nc localhost 2181
- P1: Num of alive connections
- P1: Num of znodes
- P1: Num of watches
- P1: Num of ephemeral nodes
- P1: Data size
- P1: Open file descriptor count
- P1: Max file descriptor count
- Kafka
- JMX Metrics that Kafka exposes: https://kafka.apache.org/documentation#monitoring
- P1: # of topics
- P1: Message in rate
- P1: Request rate
- P1: # of under replicated partitions
- P1: Partition counts
- Sentry
- P1: # of roles
- P1: # of privileges
- P1: memory: total, used, available
- P1: requests per second
- any more?
- KMS
- TBD: Having a hard time hitting the JMX endpoint for KMS
- CDAP:
- Component Overview
- P1: YARN, HDFS, HBase, Zoookeeper, Kafka, Hive
- P1: For each component: version, url, logs_url
- P2: Sentry, KMS
- P2: Distribution info
- P2: Plus button - to store custom components and version, url, logs_url for each.
User Stories
Design
Data Sources
Versions
- CDAP -
co.cask.cdap.common.utils.ProjectInfo
- HBase -
co.cask.cdap.data2.util.hbase.HBaseVersion
- YARN -
org.apache.hadoop.yarn.util.YarnVersionInfo
- HDFS -
org.apache.hadoop.util.VersionInfo
- Zookeeper - No client API available. Will have to build a utility around
echo stat | nc localhost 2181
- Hive -
org.apache.hive.common.util.HiveVersionInfo
URL
- CDAP -
$(dashboard.bind.address) + $(dashboard.bind.port)
- YARN -
$(yarn.resourcemanager.webapp.address)
- HDFS -
$(dfs.namenode.http-address)
- HBase - hbaseAdmin.getClusterStatus().getMaster().toString()
HDFS
DistributedFileSystem - For HDFS statistics
YARN
YarnClient - for YARN statistics and info
HBase
HBaseAdmin - for HBase statistics and info
Kafka
JMX
Reference: https://github.com/linkedin/kafka-monitor
Zookeeper
Option 1: Four letter commands - mntr. Drawbacks: mntr was introduced in 3.5.0 - users may be running older versions of Zookeeper
Option 2: Zookeeper also exposes JMX - https://zookeeper.apache.org/doc/trunk/zookeeperJMX.html
HiveServer2
TBD
Sentry
JMX
The following is available by enabling the sentry web service (ref: http://www.cloudera.com/documentation/enterprise/latest/topics/sg_sentry_metrics.html) and querying for metrics (API: http://[sentry-service-host]:51000/metrics?pretty=true).
{ "version" : "3.0.0", "gauges" : { "buffers.direct.capacity" : { "value" : 57344 }, "buffers.direct.count" : { "value" : 5 }, "buffers.direct.used" : { "value" : 57344 }, "buffers.mapped.capacity" : { "value" : 0 }, "buffers.mapped.count" : { "value" : 0 }, "buffers.mapped.used" : { "value" : 0 }, "gc.PS-MarkSweep.count" : { "value" : 0 }, "gc.PS-MarkSweep.time" : { "value" : 0 }, "gc.PS-Scavenge.count" : { "value" : 2 }, "gc.PS-Scavenge.time" : { "value" : 26 }, "memory.heap.committed" : { "value" : 1029701632 }, "memory.heap.init" : { "value" : 1073741824 }, "memory.heap.max" : { "value" : 1029701632 }, "memory.heap.usage" : { "value" : 0.17999917863585554 }, "memory.heap.used" : { "value" : 185345448 }, "memory.non-heap.committed" : { "value" : 31391744 }, "memory.non-heap.init" : { "value" : 24576000 }, "memory.non-heap.max" : { "value" : 136314880 }, "memory.non-heap.usage" : { "value" : 0.2187954829289363 }, "memory.non-heap.used" : { "value" : 29825080 }, "memory.pools.Code-Cache.usage" : { "value" : 0.029324849446614582 }, "memory.pools.PS-Eden-Space.usage" : { "value" : 0.6523454156767787 }, "memory.pools.PS-Old-Gen.usage" : { "value" : 1.1440740671897877E-4 }, "memory.pools.PS-Perm-Gen.usage" : { "value" : 0.32970512204053926 }, "memory.pools.PS-Survivor-Space.usage" : { "value" : 0.22010480095358456 }, "memory.total.committed" : { "value" : 1061093376 }, "memory.total.init" : { "value" : 1098317824 }, "memory.total.max" : { "value" : 1166016512 }, "memory.total.used" : { "value" : 215170528 }, "org.apache.sentry.provider.db.service.persistent.SentryStore.group_count" : { "value" : 3 }, "org.apache.sentry.provider.db.service.persistent.SentryStore.privilege_count" : { "value" : 0 }, "org.apache.sentry.provider.db.service.persistent.SentryStore.role_count" : { "value" : 132 }, "threads.blocked.count" : { "value" : 1 }, "threads.count" : { "value" : 38 }, "threads.daemon.count" : { "value" : 27 }, "threads.deadlocks" : { "value" : [ ] }, "threads.new.count" : { "value" : 0 }, "threads.runnable.count" : { "value" : 6 }, "threads.terminated.count" : { "value" : 0 }, "threads.timed_waiting.count" : { "value" : 8 }, "threads.waiting.count" : { "value" : 23 } }, "counters" : { }, "histograms" : { }, "meters" : { }, "timers" : { "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.create-role" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-privilege" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.drop-role" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-privilege" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.grant-role" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-authorizable" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-by-role" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-privileges-for-provider" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.list-roles-by-group" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.rename-privilege" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-privilege" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" }, "org.apache.sentry.provider.db.service.thrift.SentryPolicyStoreProcessor.revoke-role" : { "count" : 0, "max" : 0.0, "mean" : 0.0, "min" : 0.0, "p50" : 0.0, "p75" : 0.0, "p95" : 0.0, "p98" : 0.0, "p99" : 0.0, "p999" : 0.0, "stddev" : 0.0, "m15_rate" : 0.0, "m1_rate" : 0.0, "m5_rate" : 0.0, "mean_rate" : 0.0, "duration_units" : "seconds", "rate_units" : "calls/second" } } }
KMS
KMS also exposes JMX via the endpoint http://host:16000/kms/jmx.
TODO: CDAP Master Uptime?
Caching
It is not possible to hit HBase/YARN/HDFS for every request from the UI. As a result, the result of the statistics API will have to be cached, with a configurable time to live. The cache will be keyed by the service provider name, with the value as the statistics for the service provider. In addition to a time-to-live, the cache will also need to be invalidated when:
- In the
ProgramController
andProgramRuntimeService
class hierarchy, so YARN statistics can be updated - In the Entity (app, dataset, stream) lifecycle (creation and deletion) when storage statistics will need to be updated
- P3: In the authorization class hierarchy, when the authorization statistics may need to be updated. However, we may choose to not do this, because we eventually want to get out of the business of managing authorization policies in CDAP
- TODO: Invalidation for Kafka and Zookeeper
API changes
New REST APIs
The following REST APIs will be exposed from app fabric.
List Service Providers
This API lists all the available service providers and (optionally) minimal info about each (version, url and logs_url)
Path
GET /v3/system/serviceproviders
Output
{ "hdfs": { "version": "2.7.0", "url": "http://localhost:50070", "url": "http://localhost:50070/logs/" }, "yarn": { "version": "2.7.0", "url": "http://localhost:8088", "logs": "http://localhost:8088/logs/" }, "hbase": { "version": "1.0.0", "url": "http://localhost:50070", "logs": "http://localhost:60010/logs/" }, "hive": { "version": 1.2 }, "zookeeper": { "version": "3.4.2" }, "kafka": { "version": "2.10" } }
Get Service Provider Statistics
Path
GET /v3/system/serviceproviders/{service-provider-name}/stats
Response
200 OK - Statistics for the specified service provider were successfully fetched
503 Unavailable - Could not contact the service provider for status
404 Not found - Service provider not found (not in the list returned by the list service providers API)
Output
{ "services": { "masters": 2, "kafka-servers": 2, "routers": 1, "auth-servers": 1, "namespaces": 10, }, "entities": { "apps": 46, "artifacts": 23, "datasets": 68, "streams": 34, "programs": 78 } }
{ "space": { "total": 3452759234, "used": 34525543, "available": 3443555345 }, "nodes": { "total": 40, "healthy": 36, "decommissioned": 3, "decommissionInProgress": 1 }, "blocks": { "missing": 33, "corrupt": 3, "underreplicated": 5 } }
{ "nodes": { "total": 35, "new": 0, "running": 30, "unhealthy": 1, "decommissioned": 2, "lost": 1, "rebooted": 1 }, "apps": { "total": 30, "submitted": 2, "accepted": 4, "running": 20, "failed": 1, "killed": 3, "new": 0, "new_saving": 0 }, "memory": { "total": 8192, "used": 7168, "available": 1024 }, "virtualCores": { "total": 36, "used": 12, "available": 24 }, "queues": { "total": 10, "stopped": 2, "running": 8, "maxCapacity": 32, "currentCapacity": 21 } }
{ "nodes": { "totalRegionServers": 37, "liveRegionServers": 34, "deadRegionServers": 3, "masters": 3 }, "tables": 56, "namespaces": 43 }
TODO: Add output for Kafka, Zookeeper, Sentry, KMS
CLI Impact or Changes
UI Impact or Changes
Security Impact
Impact on Infrastructure Outages
Test Scenarios
Releases
Release 4.0.0
Release 4.1.0
Related Work
- Work #1
- Work #2
- Work #3
Future work
Checklist
- User Stories Documented
- User Stories Reviewed
- Design Reviewed
- APIs reviewed
- Release priorities assigned
- Test cases reviewed
- Blog post
- No labels