Metrics Microservices

Use the CDAP Metrics Microservices to retrieve the metrics created and saved by CDAP.

As applications process data, CDAP collects metrics about the application’s behavior and performance. Some of these metrics are similar for every application, such as how many events are processed and how many data operations are performed, and are called system or CDAP metrics.

Other metrics are user-defined and differ from application to application.

All methods or endpoints described in this API have a base URL (typically http://<host>:11015 or https://<host>:10443) that precedes the resource identifier, as described in the Microservices Conventions. These methods return a status code, as listed in the Microservices Status Codes.

Metrics Data

Metrics data is identified by a combination of context and name.

metrics context consists of a collection of tags. Each tag is composed of a tag name and a tag value.

Metrics contexts are hierarchal, rooted in the CDAP instance, and extend through namespaces, applications, and down to the individual components.

For example, the metrics context:

namespace:default app:PurchaseHistory spark:PurchaseTracker

is a context that identifies a Spark program. It has a parent context, namespace:default app:PurchaseHistory, which identifies the parent application.

Each level of the context is described by a pair, composed of a tag name and a value, such as:

  • namespace:default (tag name: namespace, value: default)

  • app:PurchaseHistory (tag name: app, value: PurchaseHistory)

  • spark:PurchaseTracker (tag name: spark, value: PurchaseTracker)

metrics name is either a name generated by CDAP, and pre-pended with system, or is a name set by a developer when writing an application, which are pre-pended with user.

The system metrics vary depending on the context; a list is available of common system metrics for different contexts.

User metrics are defined by the application developer and thus are completely dependent on what the developer sets.

In both cases, searches using this API show, for a given context, all available metrics.

Available Contexts

The context of a metric is typically enclosed into a hierarchy of contexts. For example, the Spark context is enclosed in the application context, which in turn is enclosed in the namespace context. A metric can always be queried (and aggregated) relative to any enclosing context.

System Metric

Context

System Metric

Context

All Mappers of a MapReduce

namespace:<namespace-id> app:<app-id> mapreduce:<mapreduce-id> tasktype:m

All Reducers of a MapReduce

namespace:<namespace-id> app:<app-id> mapreduce:<mapreduce-id> tasktype:r

One Run of a MapReduce

namespace:<namespace-id> app:<app-id> mapreduce:<mapreduce-id> run:<run-id>

One MapReduce

namespace:<namespace-id> app:<app-id> mapreduce:<mapreduce-id>

All MapReduce of an application

namespace:<namespace-id> app:<app-id> mapreduce:*

One service

namespace:<namespace-id> app:<app-id> service:<service-id>

All services of an application

namespace:<namespace-id> app:<app-id> service:*

One Spark program

namespace:<namespace-id> app:<app-id> spark:<spark-id>

All Spark programs of an application

namespace:<namespace-id> app:<app-id> spark:*

One worker

namespace:<namespace-id> app:<app-id> worker:<worker-id>

All workers of an application

namespace:<namespace-id> app:<app-id> workers:*

All components of an application

namespace:<namespace-id> app:<app-id>

All components of all applications

namespace:<namespace-id> app:*

Dataset metrics are available at the dataset level, but they can also be queried down to the worker, service, Mapper, or Reducer level:

Dataset Metric

Context

Dataset Metric

Context

A single dataset in the context of a specific application

namespace:<namespace-id> dataset:<dataset-id> app:<app-id>

A single dataset

namespace:<namespace-id> dataset:<dataset-id>

All datasets

namespace:<namespace-id> dataset:*

Available System Metrics

Note: A user metric may have the same name as a system metric. They are distinguished by prepending the respective prefix when querying: user or system.

Dataset Metrics

These metrics are available in a dataset context:

Dataset Metric

Description

Dataset Metric

Description

system.dataset.store.bytes

Number of bytes written

system.dataset.store.ops

Operations (reads and writes) performed

system.dataset.store.reads

Read operations performed

system.dataset.store.writes

Write operations performed

Mappers or Reducer Metrics

These metrics are available in a Mappers or Reducers context (specify whether a Mapper or Reducer context is desired, as shown above):

Mappers or Reducers Metrics

Description

Mappers or Reducers Metrics

Description

system.process.completion

A number from 0 to 100 indicating the progress of the Map or Reduce phase

system.process.entries.in

Number of entries read in by the Map or Reduce phase

system.process.entries.out

Number of entries written out by the Map or Reduce phase

Service Metrics

These metrics are available in a service context:

Service Metrics

Description

Service Metrics

Description

system.requests.count

Number of requests made to the service

system.response.successful.count

Number of successful requests completed by the service

system.response.server.error.count

Number of failures seen by the service

Spark Metrics

These metrics are available in a Spark context, where <spark-id> depends on the Spark program being queried:

Spark Metrics

Description

Spark Metrics

Description

system.<spark-id>.driver.BlockManager.disk.diskSpaceUsed_MB

Disk space used by the Block Manager

system.<spark-id>.driver.BlockManager.memory.maxMem_MB

Maximum memory given to the Block Manager

system.<spark-id>.driver.BlockManager.memory.memUsed_MB

Memory used by the Block Manager

system.<spark-id>.driver.BlockManager.memory.remainingMem_MB

Memory remaining to the Block Manager

system.<spark-id>.driver.DAGScheduler.job.activeJobs

Number of active jobs

system.<spark-id>.driver.DAGScheduler.job.allJobs

Total number of jobs

system.<spark-id>.driver.DAGScheduler.stage.failedStages

Number of failed stages

system.<spark-id>.driver.DAGScheduler.stage.runningStages

Number of running stages

system.<spark-id>.driver.DAGScheduler.stage.waitingStages

Number of waiting stages

Request and Response Metrics

These metrics are available for services, for the system services component context or the user services context:

Request and Response Metrics

Description

Request and Response Metrics

Description

system.request.received

Number of requests received for the service

system.response.successful

Number of successful responses sent

system.response.{server-error, client-error}

Number of server-error or client-error responses sent

Application Logging Metrics

These metrics are available for every application context:

Application Logging Metrics

Description

Application Logging Metrics

Description

system.app.log.{error, info, warn}

Number of errorinfo, or warn log messages logged by an application or applications

System Services Logging Metrics

These logging metrics are available for system services, in the system component context:

System Services Logging Metrics

Description

System Services Logging Metrics

Description

system.services.log.{error, info, warn}

Number of errorinfo, or warn log messages logged by a system service or system services

System Services Metric Processor Metrics

These processing metrics are available for system services, in the system component context:

System Services Metric Processor Metrics

Description

System Services Metric Processor Metrics

Description

metrics.<metric.processor.id>.process.count

Number of metrics processed by metric processor instance

metrics.<metric.processor.id>.process.delay.ms

Metrics processing delay in milliseconds. Difference between last metric's timestamp and current time

Transaction Metrics

These metrics are available for the CDAP transaction service:

Transaction Metrics

Description

Transaction Metrics

Description

system.start.{short, long}

Number of short or long transactions started

system.start.{short, long}.latency

Time taken (in milliseconds) to start short or long transactions

system.wal.append.count

Number of transaction edits added to the write-ahead log

system.{canCommit, commit, committed, inprogress, invalidate, abort}

Number of transactions in a specified transaction state

system.{canCommit, commit, committed, inprogress, invalidate, abort}.latency

Time taken (in milliseconds) to perform a specified transaction state update

system.{invalid, committing, committed, inprogress}.size

Number of transactions of a specified type that are active

Transactional Messaging System Metrics

These metrics are available for the CDAP transactional messaging service:

Transactional Messaging System Metrics

Description

Transactional Messaging System Metrics

Description

system.persist.requested

Number of message persist requests

system.persist.success

Number of message persist requests succeeded

system.persist.failure

Number of message persist requests failed

system.persist.queue.size

Number of messages in the queue that are persisted in one batch

system.cache.add.requests

Number of entries requested to add to the messaging cache

system.cache.entries.added

Number of entries added to the messaging cache

system.cache.entries.removed

Number of entries removed from the messaging cache

system.cache.add.reduce.weight

Number of times that the cache reduce weight logic was executed while adding entries to the cache. This number ideally should be very small for the cache to have good performance.

system.cache.scan.reduce.weight

Number of times that the cache reduce weight logic was executed while scanning the cache. This number ideally should be relative small and steady over time.

system.cache.scan.requests

Number of scan requests on the messaging cache

system.cache.weight

The current weight of the cache, measured in bytes

YARN Cluster Metrics

These metrics are available for the YARN cluster resources:

YARN Cluster Metrics

Description

YARN Cluster Metrics

Description

system.resources.{total, available, used}.memory

Size (in megabytes) of total, available, or used cluster memory

system.resources.{total, available, used}.vcores

Number of total, available, or used cluster virtual cores

CDAP Program Metrics

These metrics are available for CDAP programs:

Program Metrics

Version Introduced

Description

Program Metrics

Version Introduced

Description

system.program.provisioning.delay.seconds

 

Measures time taken by program to transition from provisioning to starting

system.program.starting.delay.seconds

6.6.0

Measures time taken by program to transition from provisioning to running

system.program.run.seconds

 

Measures time taken by program to transition from provisioning to any complete state

system.program.completed.runs

 

Number of successful program runs

system.program.failed.runs

 

Number of failed programs runs

system.program.killed.runs

 

Number of killed program runs

system.program.rejected.runs

 

Number of rejected program runs

system.flowcontrol.launching.count

6.6.0

Number of top-level program launching requests in the system

system.flowcontrol.running.count

6.6.0

Number of running top-level programs in the system

User Metrics

These metrics are available for pipeline connections:

Pipeline Connection Metrics

Pipeline Connection Metrics

Version Introduced

Description

Pipeline Connection Metrics

Version Introduced

Description

users.connections.count

6.7.0

Number of create connection requests.

users.connections.deleted.count

6.7.0

Number of delete connection requests.

users.connections.get.count

6.7.0

Number of get connection requests.

users.connections.browse.count

6.7.0

Number of browse connection requests.

users.connections.sample.count

6.7.0

Number of sample connection requests.

users.connections.spec.count

6.7.0

Number of specification connection requests.

users.upload.file.count

6.7.0

Number of upload file connection requests.

Query Tips

Global Count Example

POST v3/metrics/query?target=metric&metric=user.connections.count

Group By connection Type Example

POST v3/metrics/query?target=metric&metric=user.connections.count&groupBy=tpe

Searches and Queries

The process of retrieving a metric involves these steps:

  1. Obtain (usually through a search) the correct context for a metric;

  2. Obtain (usually through a search within the context) the available metrics;

  3. Querying for a specific metric, supplying the context and any parameters.

Search for Contexts

To search for the available contexts, perform an HTTP request:

POST /v3/metrics/search?target=tag[&tag=<context>]

The optional <context> defines a metrics context to search within. If it is not provided, the search is performed across all data. The available contexts that are returned can be used to query for a lower-level of contexts.

You can also define the query to search in a given context across all values of one or more tags provided in the context by specifying * as a value for a tag. See the examples below for its use.

Parameter

Description

Parameter

Description

context [Optional]

Metrics context to search within. If not provided, the search is provided across all contexts. Consists of a collection of tags.

Examples

HTTP Method

POST /v3/metrics/search?target=tag

Returns

[{"name":"namespace","value":"default"},{"name":"namespace","value":"system"}]

Description

Returns all first-level tags; in this case, two namespaces.

 

 

HTTP Method

POST /v3/metrics/search?target=tag&tag=namespace:default

Returns

[{"name":"app","value":"PurchaseHistory"},
 {"name":"component","value":"gateway"},`` {"name":"dataset","value":"frequentCustomers"},``
{"name":"dataset","value":"history"},`` {"name":"dataset","value":"purchases"},``  {"name":"dataset","value":"userProfiles"}] 

Description

Returns all tags of the of the given parent context; in this case, all entities in the default namespace.

 

 

HTTP Method

POST /v3/metrics/search?target=tag&tag=namespace:default&tag=app:PurchaseHistory&tag=run:*

Returns

[
{“name”: “spark”, “value”:”PurchaseTracker”}
]

Description

Queries all available contexts within the PurchaseHistory for any run.

Search for Metrics

To search for the available metrics within a given context, perform an HTTP POST request:

POST /v3/metrics/search?target=metric&tag=<context>

Parameter

Description

Parameter

Description

context

Metrics context to search within. Consists of a collection of tags.

Example

HTTP Method

POST /v3/metrics/search?target=metric&tag=namespace:default&tag=app:PurchaseHistory

Returns

["system.process.events.in","system.process.events.processed","system.process.instance", "system.process.tuples.attempt.read","system.process.tuples.read"]

Description

Returns all metrics in the context of the application PurchaseHistory of the default namespace; in this case, returns a list of system and (possibly) user-defined metrics.

 

 

HTTP Method

POST /v3/metrics/search?target=metric&tag=namespace:default&tag=app:SportResults&tag=service:UploadService

Returns

["system.dataset.store.ops","system.dataset.store.reads","system.requests.count", "system.response.successful.count", "user.uploads.completed"]

Description

Returns all metrics in the context of the service UploadService of the application SportResults of the default namespace; in this case, returns a list of system and user-defined metrics.

Querying a Metric

Once you know the context and the metric to query, you can formulate a request for the metrics data.

In general, a metrics query is performed by making an HTTP POST request, with parameters supplied either in the URL or in the body of the request. If you submit the parameters in the body, you can make multiple queries with a single request.

Metric parameters include:

  • tag values for filtering by context;

  • metric names (multiple metric names can be queried in each request);

  • time range or aggregate=true for an aggregated result; and

  • tag values for grouping results (optional)

To query a metric within a given context, perform an HTTP POST request:

Parameter

Description

Parameter

Description

context

Metrics context to search within, a collection of tags

metric

Metric(s) being queried, a collection of metric names

time-range

A time range or aggregate=true for all since the application was deployed

tags (optional)

Tag list by which to group results (optional)

Query Examples

HTTP Method

POST /v3/metrics/query?tag=namespace:default&tag=app:HelloWorld& &metric=system.process.events.processed&aggregate=true

Returns

{"startTime":0,"endTime":1429327964,"series":[{"metricName":"system.process.events.processed","grouping":{},"data":[{"time":0,"value":1}]}]}

Description

Using a System metric, system.process.events.processed

 

 

HTTP Method

POST /v3/metrics/query?tag=namespace:default&tag=app:HelloWorld& &tag=run:13ac3a50-a435-49c8-a752-83b3c1e1b9a8&metric=user.names.bytes&aggregate=true

Returns

{"startTime":0,"endTime":1429328212,"series":[{"metricName":"user.names.bytes","grouping":{},"data":[{"time":0,"value":8}]}]}

Description

Querying the User-defined metric names.bytes by its run-ID

 

 

HTTP Method

POST /v3/metrics/query?tag=namespace:default&tag=app:HelloWorld&metric=user.names.bytes

Returns

{"startTime":0,"endTime":1429475995,"series":[]}

Description

Using a User-defined metric, names.bytes in a service's Handler, called before any data entered, returning an empty series

 

 

HTTP Method

POST /v3/metrics/query?tag=namespace:default&tag=app:HelloWorld&&metric=user.names.bytes

Returns

{"startTime":0,"endTime":1429477901,"series":[{"metricName":"user.names.bytes","grouping":{},"data":[{"time":0,"value":44}]}]}

Description

Using a User-defined metric, names.bytes in a service's Handler

Query Results

Results from a query are returned as a JSON string, in the format:

Name

Description

Name

Description

start-time

Start time, in seconds, with 0 being from the beginning of the query records

metric

End time, in seconds

series-array

An array of metric results, which can be one series, a multiple time series, or none (an empty array)

If a particular metric has no value, a query will return an empty array in the "series" of the results, such as:

You can also receive such a result from querying a metric that does not exist, either because it does not exist at the context given or if the query is incorrectly formulated:

will return the empty result, as the metric name will be interpreted as "user.names.bytes?aggregate=true" instead of "user.names.bytes".

Querying for Multiple Metrics

Retrieving multiple metrics at once can be accomplished by issuing an HTTP POST request with a JSON list as the request body that enumerates the name and attributes for each metric. The format of the request and the JSON body depends on whether the metrics share the same context or are being called for different contexts.

Multiple Metrics with the Same Context

Retrieving multiple metrics at once for the same contexts can be accomplished by issuing a request as in previous examples, but providing the additional metrics. For example:

The result (pretty-printed to fit) would be:

Multiple Metrics with Different Contexts

Retrieving multiple metrics at once for different contexts can be accomplished by issuing a request with a JSON list as the request body that enumerates the name, attributes and context for each metric. Use an HTTP POST request:

with the arguments as a JSON string in the body. The format of the JSON follows this structure (pretty-printed):

Queries are identified by a <query-id> (in the example above, query1query2; in the example below, eventsIneventsOut). The <query-id> is then used in the returned result to identify the series.

For example, to retrieve multiple metrics using a curl call (command and results reformatted to fit):

If the context of the requested metric or metric itself doesn't exist, the system returns a status 200 (OK) with JSON formed following the above description, with an empty series for values:

Querying for Multiple Time Series

In a query, the optional groupBy parameter defines a list of tags whose values are used to build multiple time series. All data points that have the same values in tags specified in the groupBy parameter will form a single time series. You can define multiple tags for grouping by providing a list, similar to a tag combination list.

Tag List

Description

Tag List

Description

groupBy=app

Retrieves the time series for each application

groupBy=app&groupBy=spark

Retrieves a time series for each App and Spark combination.

An example method (re-formatted to fit):

returns the user.customers.count metric in the context of the application PurchaseHistory of the default namespace, for the specified time range, and grouped by spark: PurchaseHistoryTracker (results reformatted to fit):

Querying by a Time Range

The time range of a metric query can be specified in various ways: either aggregate=true to retrieve the total aggregated since the application was deployed or, in the case of dataset metrics, since a dataset was created; or as a start and end to define a specific range and return a series of data points.

By default, queries without a time range retrieve a value based on aggregate=true.

Parameter

Description

Parameter

Description

aggregate=true

Total aggregated value for the metric since the application was deployed. If the metric is a gauge type, the aggregate will return the latest value set for the metric.

start=<time>&end=<time>

Time range defined by start and end times, where the times are either in seconds since the start of the Epoch, or a relative time, using now and times added to it.

count=<count>

Number of time intervals since start with length of time interval defined by resolution. If count=60 and resolution=1s, the time range would be 60 seconds in length.

resolution=[1s|1m|1h|auto]

Time resolution in seconds, minutes or hours; or if "auto", one of {1s, 1m, 1h} is used based on the time difference.

With a specific time range, a resolution can be included to retrieve a series of data points for a metric. By default, 1 second resolution is used. Acceptable values are noted above. If resolution=auto, the resolution will be determined based on a time difference calculated between the start and end times:

  • (endTime - startTime) > 36000 seconds (ten hours), resolution will be 1 hour;

  • (endTime - startTime) >  600 seconds (ten minutes), resolution will be 1 minute;

  • otherwise, resolution will be 1 second.

Time Range

Description

Time Range

Description

start=now-30s&end=now

The last 30 seconds. The start time is given in seconds relative to the current time. You can apply simple math, using now for the current time, s for seconds, m for minutes, h for hours and d for days. For example: now-5d-12h is 5 days and 12 hours ago.

start=1385625600& end=1385629200

From Thu, 28 Nov 2013 08:00:00 GMT to Thu, 28 Nov 2013 09:00:00 GMT, both given as since the start of the Epoch.

start=1385625600& count=3600& resolution=1s

The same as before, the count given as a number of time intervals, each 1 second.

start=1385625600& end=1385629200& resolution=1m

From Thu, 28 Nov 2013 08:00:00 GMT to Thu, 28 Nov 2013 09:00:00 GMT, with 1 minute resolution, will return 61 data points with metrics aggregated for each minute.

start=1385625600& end=1385632800& resolution=1h

From Thu, 28 Nov 2013 08:00:00 GMT to Thu, 28 Nov 2013 10:00:00 GMT, with 1 hour resolution, will return 3 data points with metrics aggregated for each hour.

Example:

This will return the value of the metric system.process.events.processed for the last hour at one-minute intervals.

For aggregates, you cannot specify a time range. As an example, to return the total number of input objects processed since the application CountRandom was deployed, assuming that CDAP has not been stopped or restarted:

If a metric is a gauge type, the aggregate will return the latest value set for the metric. For example, this request will retrieve the completion percentage for the map-stage of the MapReduce PurchaseHistoryBuilder (reformatted to fit):

Querying by Run-ID

Each execution of a program (MapReduce, Spark, service, worker) has an associated run-ID that uniquely identifies that program's run. We can query metrics for a program by its run-ID to retrieve the metrics for a particular run. Please see the Run Records and Schedule on retrieving active and historical program runs.

When querying by run-ID, it is specified in the context—in the collection of tags—after the program-id with the tag run:

Examples of using a run-ID (with both commands and results reformatted to fit):

The last example will return (where "time"=0 means aggregated total number, and endTime is the time of the query) something similar to:

Query Tips

  • User-defined metrics are always prefixed with the word user, and must be queried by using that prefix with the metric name.

    For example, to request the user-defined metric uploads.completed for the SportResults application's UploadService:

 

Created in 2020 by Google Inc.