Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

The context of a metric is typically enclosed into a hierarchy of contexts. For example, the Spark context is enclosed in the application context, which in turn is enclosed in the namespace context. A metric can always be queried (and aggregated) relative to any enclosing context.

System Metric

Context

All Mappers of a MapReduce

namespace:<namespace-id> app:<app-id> mapreduce:<mapreduce-id> tasktype:m

All Reducers of a MapReduce

namespace:<namespace-id> app:<app-id> mapreduce:<mapreduce-id> tasktype:r

One Run of a MapReduce

namespace:<namespace-id> app:<app-id> mapreduce:<mapreduce-id> run:<run-id>

One MapReduce

namespace:<namespace-id> app:<app-id> mapreduce:<mapreduce-id>

All MapReduce of an application

namespace:<namespace-id> app:<app-id> mapreduce:*

One service

namespace:<namespace-id> app:<app-id> service:<service-id>

All services of an application

namespace:<namespace-id> app:<app-id> service:*

One Spark program

namespace:<namespace-id> app:<app-id> spark:<spark-id>

All Spark programs of an application

namespace:<namespace-id> app:<app-id> spark:*

One worker

namespace:<namespace-id> app:<app-id> worker:<worker-id>

All workers of an application

namespace:<namespace-id> app:<app-id> workers:*

All components of an application

namespace:<namespace-id> app:<app-id>

All components of all applications

namespace:<namespace-id> app:*

Dataset metrics are available at the dataset level, but they can also be queried down to the worker, service, Mapper, or Reducer level:

Dataset Metric

Context

A single dataset in the context of a specific application

namespace:<namespace-id> dataset:<dataset-id> app:<app-id>

A single dataset

namespace:<namespace-id> dataset:<dataset-id>

All datasets

namespace:<namespace-id> dataset:*

Available System Metrics

Note: A user metric may have the same name as a system metric. They are distinguished by prepending the respective prefix when querying: user or system.

These metrics are available in a dataset context:

Dataset Metric

Description

system.dataset.store.bytes

Number of bytes written

system.dataset.store.ops

Operations (reads and writes) performed

system.dataset.store.reads

Read operations performed

system.dataset.store.writes

Write operations performed

These metrics are available in a Mappers or Reducers context (specify whether a Mapper or Reducer context is desired, as shown above):

Mappers or Reducers Metric

Description

system.process.completion

A number from 0 to 100 indicating the progress of the Map or Reduce phase

system.process.entries.in

Number of entries read in by the Map or Reduce phase

system.process.entries.out

Number of entries written out by the Map or Reduce phase

These metrics are available in a service context:

Service Metric

Description

system.requests.count

Number of requests made to the service

system.response.successful.count

Number of successful requests completed by the service

system.response.server.error.count

Number of failures seen by the service

These metrics are available in a Spark context, where <spark-id> depends on the Spark program being queried:

Spark Metric

Description

system.<spark-id>.driver.BlockManager.disk.diskSpaceUsed_MB

Disk space used by the Block Manager

system.<spark-id>.driver.BlockManager.memory.maxMem_MB

Maximum memory given to the Block Manager

system.<spark-id>.driver.BlockManager.memory.memUsed_MB

Memory used by the Block Manager

system.<spark-id>.driver.BlockManager.memory.remainingMem_MB

Memory remaining to the Block Manager

system.<spark-id>.driver.DAGScheduler.job.activeJobs

Number of active jobs

system.<spark-id>.driver.DAGScheduler.job.allJobs

Total number of jobs

system.<spark-id>.driver.DAGScheduler.stage.failedStages

Number of failed stages

system.<spark-id>.driver.DAGScheduler.stage.runningStages

Number of running stages

system.<spark-id>.driver.DAGScheduler.stage.waitingStages

Number of waiting stages

These metrics are available for services, for the system services component context or the user services context:

Request and Response Metric

Description

system.request.received

Number of requests received for the service

system.response.successful

Number of successful responses sent

system.response.{server-error, client-error}

Number of server-error or client-error responses sent

These metrics are available for every application context:

Application Logging Metric

Description

system.app.log.{error, info, warn}

Number of errorinfo, or warn log messages logged by an application or applications

These logging metrics are available for system services, in the system component context:

System Services Logging Metric

Description

system.services.log.{error, info, warn}

Number of errorinfo, or warn log messages logged by a system service or system services

These processing metrics are available for system services, in the system component context:

System Services Metric Processor Metric

Description

metrics.<metric.processor.id>.process.count

Number of metrics processed by metric processor instance

metrics.<metric.processor.id>.process.delay.ms

Metrics processing delay in milliseconds. Difference between last metric's timestamp and current time

These metrics are available for the CDAP transaction service:

Transaction Metric

Description

system.start.{short, long}

Number of short or long transactions started

system.start.{short, long}.latency

Time taken (in milliseconds) to start short or long transactions

system.wal.append.count

Number of transaction edits added to the write-ahead log

system.{canCommit, commit, committed, inprogress, invalidate, abort}

Number of transactions in a specified transaction state

system.{canCommit, commit, committed, inprogress, invalidate, abort}.latency

Time taken (in milliseconds) to perform a specified transaction state update

system.{invalid, committing, committed, inprogress}.size

Number of transactions of a specified type that are active

These metrics are available for the CDAP transactional messaging service:

Transactional Messaging System Metric

Description

system.persist.requested

Number of message persist requests

system.persist.success

Number of message persist requests succeeded

system.persist.failure

Number of message persist requests failed

system.persist.queue.size

Number of messages in the queue that are persisted in one batch

system.cache.add.requests

Number of entries requested to add to the messaging cache

system.cache.entries.added

Number of entries added to the messaging cache

system.cache.entries.removed

Number of entries removed from the messaging cache

system.cache.add.reduce.weight

Number of times that the cache reduce weight logic was executed while adding entries to the cache. This number ideally should be very small for the cache to have good performance.

system.cache.scan.reduce.weight

Number of times that the cache reduce weight logic was executed while scanning the cache. This number ideally should be relative small and steady over time.

system.cache.scan.requests

Number of scan requests on the messaging cache

system.cache.weight

The current weight of the cache, measured in bytes

These metrics are available for the YARN cluster resources:

YARN Cluster Metric

Description

system.resources.{total, available, used}.memory

Size (in megabytes) of total, available, or used cluster memory

system.resources.{total, available, used}.vcores

Number of total, available, or used cluster virtual cores

Searches and Queries

The process of retrieving a metric involves these steps:

...

You can also define the query to search in a given context across all values of one or more tags provided in the context by specifying * as a value for a tag. See the examples below for its use.

Parameter

Description

context [Optional]

Metrics context to search within. If not provided, the search is provided across all contexts. Consists of a collection of tags.

Examples

HTTP Method

POST /v3/metrics/search?target=tag

Returns

[{"name":"namespace","value":"default"},{"name":"namespace","value":"system"}]

Description

Returns all first-level tags; in this case, two namespaces.

 

 

HTTP Method

POST /v3/metrics/search?target=tag&tag=namespace:default

Returns

[{"name":"app","value":"PurchaseHistory"},
 {"name":"component","value":"gateway"},`` {"name":"dataset","value":"frequentCustomers"},``
{"name":"dataset","value":"history"},`` {"name":"dataset","value":"purchases"},``  {"name":"dataset","value":"userProfiles"}] 

Description

Returns all tags of the of the given parent context; in this case, all entities in the default namespace.

HTTP Method

POST /v3/metrics/search?target=tag&tag=

 

namespace:default&tag=app:PurchaseHistory&tag=run:*

Returns

[
{“name”: “spark”, “value”:”PurchaseTracker”}
]

Description

Queries all available contexts within the PurchaseHistory for any run.

Search for Metrics

To search for the available metrics within a given context, perform an HTTP POST request:

Code Block
POST /v3/metrics/search?target=metric&tag=<context>

Parameter

Description

context

Metrics context to search within. Consists of a collection of tags.

Example

HTTP Method

POST /v3/metrics/search?target=metric&tag=namespace:default&tag=app:PurchaseHistory

Returns

["system.process.events.in","system.process.events.processed","system.process.instance", "system.process.tuples.attempt.read","system.process.tuples.read"]

Description

Returns all metrics in the context of the application PurchaseHistory of the default namespace; in this case, returns a list of system and (possibly) user-defined metrics.

 

 

HTTP Method

POST /v3/metrics/search?target=metric&tag=namespace:default&tag=app:SportResults&tag=service:UploadService

Returns

["system.dataset.store.ops","system.dataset.store.reads","system.requests.count", "system.response.successful.count", "user.uploads.completed"]

Description

Returns all metrics in the context of the service UploadService of the application SportResults of the default namespace; in this case, returns a list of system and user-defined metrics.

Querying a Metric

Once you know the context and the metric to query, you can formulate a request for the metrics data.

...

Code Block
POST /v3/metrics/query?tag=<context>&metric=<metric>&<time-range>[&groupBy=<tags>]

Parameter

Description

context

Metrics context to search within, a collection of tags

metric

Metric(s) being queried, a collection of metric names

time-range

A time range or aggregate=true for all since the application was deployed

tags (optional)

Tag list by which to group results (optional)

Query Examples

HTTP Method

POST /v3/metrics/query?tag=namespace:default&tag=app:HelloWorld& &metric=system.process.events.processed&aggregate=true

Returns

{"startTime":0,"endTime":1429327964,"series":[{"metricName":"system.process.events.processed","grouping":{},"data":[{"time":0,"value":1}]}]}

Description

Using a System metric, system.process.events.processed

 

 

HTTP Method

POST /v3/metrics/query?tag=namespace:default&tag=app:HelloWorld& &tag=run:13ac3a50-a435-49c8-a752-83b3c1e1b9a8&metric=user.names.bytes&aggregate=true

Returns

{"startTime":0,"endTime":1429328212,"series":[{"metricName":"user.names.bytes","grouping":{},"data":[{"time":0,"value":8}]}]}

Description

Querying the User-defined metric names.bytes by its run-ID

 

 

HTTP Method

POST /v3/metrics/query?tag=namespace:default&tag=app:HelloWorld&metric=user.names.bytes

Returns

{"startTime":0,"endTime":1429475995,"series":[]}

Description

Using a User-defined metric, names.bytes in a service's Handler, called before any data entered, returning an empty series

 

 

HTTP Method

POST /v3/metrics/query?tag=namespace:default&tag=app:HelloWorld&&metric=user.names.bytes

Returns

{"startTime":0,"endTime":1429477901,"series":[{"metricName":"user.names.bytes","grouping":{},"data":[{"time":0,"value":44}]}]}

Description

Using a User-defined metric, names.bytes in a service's Handler

Query Results

Results from a query are returned as a JSON string, in the format:

Code Block
{"startTime":<start-time>, "endTime":<end-time>, "series":<series-array>}

Name

Description

start-time

Start time, in seconds, with 0 being from the beginning of the query records

metric

End time, in seconds

series-array

An array of metric results, which can be one series, a multiple time series, or none (an empty array)

If a particular metric has no value, a query will return an empty array in the "series" of the results, such as:

...

In a query, the optional groupBy parameter defines a list of tags whose values are used to build multiple time series. All data points that have the same values in tags specified in the groupBy parameter will form a single time series. You can define multiple tags for grouping by providing a list, similar to a tag combination list.

Tag List

Description

groupBy=app

Retrieves the time series for each application

groupBy=app&groupBy=spark

Retrieves a time series for each App and Spark combination.

An example method (re-formatted to fit):

...

By default, queries without a time range retrieve a value based on aggregate=true.

Parameter

Description

aggregate=true

Total aggregated value for the metric since the application was deployed. If the metric is a gauge type, the aggregate will return the latest value set for the metric.

start=<time>&end=<time>

Time range defined by start and end times, where the times are either in seconds since the start of the Epoch, or a relative time, using now and times added to it.

count=<count>

Number of time intervals since start with length of time interval defined by resolution. If count=60 and resolution=1s, the time range would be 60 seconds in length.

resolution=[1s|1m|1h|auto]

Time resolution in seconds, minutes or hours; or if "auto", one of {1s, 1m, 1h} is used based on the time difference.

With a specific time range, a resolution can be included to retrieve a series of data points for a metric. By default, 1 second resolution is used. Acceptable values are noted above. If resolution=auto, the resolution will be determined based on a time difference calculated between the start and end times:

  • (endTime - startTime) > 36000 seconds (ten hours), resolution will be 1 hour;

  • (endTime - startTime) >  600 seconds (ten minutes), resolution will be 1 minute;

  • otherwise, resolution will be 1 second.

Time Range

Description

start=now-30s&end=now

The last 30 seconds. The start time is given in seconds relative to the current time. You can apply simple math, using now for the current time, s for seconds, m for minutes, h for hours and d for days. For example: now-5d-12h is 5 days and 12 hours ago.

start=1385625600& end=1385629200

From Thu, 28 Nov 2013 08:00:00 GMT to Thu, 28 Nov 2013 09:00:00 GMT, both given as since the start of the Epoch.

start=1385625600& count=3600& resolution=1s

The same as before, the count given as a number of time intervals, each 1 second.

start=1385625600& end=1385629200& resolution=1m

From Thu, 28 Nov 2013 08:00:00 GMT to Thu, 28 Nov 2013 09:00:00 GMT, with 1 minute resolution, will return 61 data points with metrics aggregated for each minute.

start=1385625600& end=1385632800& resolution=1h

From Thu, 28 Nov 2013 08:00:00 GMT to Thu, 28 Nov 2013 10:00:00 GMT, with 1 hour resolution, will return 3 data points with metrics aggregated for each hour.

Example:

Code Block
POST /v3/metrics/query?tag=namespace:default&tag=app:CountRandom&
  metric=system.process.events.processed&start=now-1h&end=now&resolution=1m

...