Metrics Microservices
Use the CDAP Metrics Microservices to retrieve the metrics created and saved by CDAP.
As applications process data, CDAP collects metrics about the application’s behavior and performance. Some of these metrics are similar for every application, such as how many events are processed and how many data operations are performed, and are called system or CDAP metrics.
Other metrics are user-defined and differ from application to application.
All methods or endpoints described in this API have a base URL (typically http://<host>:11015
or https://<host>:10443
) that precedes the resource identifier, as described in the Microservices Conventions. These methods return a status code, as listed in the Microservices Status Codes.
Metrics Data
Metrics data is identified by a combination of context and name.
A metrics context consists of a collection of tags. Each tag is composed of a tag name and a tag value.
Metrics contexts are hierarchal, rooted in the CDAP instance, and extend through namespaces, applications, and down to the individual components.
For example, the metrics context:
namespace:default app:PurchaseHistory spark:PurchaseTracker
is a context that identifies a Spark program. It has a parent context, namespace:default app:PurchaseHistory
, which identifies the parent application.
Each level of the context is described by a pair, composed of a tag name and a value, such as:
namespace:default
(tag name: namespace, value: default)app:PurchaseHistory
(tag name: app, value: PurchaseHistory)spark:PurchaseTracker
(tag name: spark, value: PurchaseTracker)
A metrics name is either a name generated by CDAP, and pre-pended with system
, or is a name set by a developer when writing an application, which are pre-pended with user
.
The system metrics vary depending on the context; a list is available of common system metrics for different contexts.
User metrics are defined by the application developer and thus are completely dependent on what the developer sets.
In both cases, searches using this API show, for a given context, all available metrics.
Available Contexts
The context of a metric is typically enclosed into a hierarchy of contexts. For example, the Spark context is enclosed in the application context, which in turn is enclosed in the namespace context. A metric can always be queried (and aggregated) relative to any enclosing context.
System Metric | Context |
---|---|
All Mappers of a MapReduce |
|
All Reducers of a MapReduce |
|
One Run of a MapReduce |
|
One MapReduce |
|
All MapReduce of an application |
|
One service |
|
All services of an application |
|
One Spark program |
|
All Spark programs of an application |
|
One worker |
|
All workers of an application |
|
All components of an application |
|
All components of all applications |
|
Dataset metrics are available at the dataset level, but they can also be queried down to the worker, service, Mapper, or Reducer level:
Dataset Metric | Context |
---|---|
A single dataset in the context of a specific application |
|
A single dataset |
|
All datasets |
|
Available System Metrics
Note: A user metric may have the same name as a system metric. They are distinguished by prepending the respective prefix when querying: user
or system
.
Dataset Metrics
These metrics are available in a dataset context:
Dataset Metric | Description |
---|---|
| Number of bytes written |
| Operations (reads and writes) performed |
| Read operations performed |
| Write operations performed |
Mappers or Reducer Metrics
These metrics are available in a Mappers or Reducers context (specify whether a Mapper or Reducer context is desired, as shown above):
Mappers or Reducers Metrics | Description |
---|---|
| A number from 0 to 100 indicating the progress of the Map or Reduce phase |
| Number of entries read in by the Map or Reduce phase |
| Number of entries written out by the Map or Reduce phase |
Service Metrics
These metrics are available in a service context:
Service Metrics | Description |
---|---|
| Number of requests made to the service |
| Number of successful requests completed by the service |
| Number of failures seen by the service |
Spark Metrics
These metrics are available in a Spark context, where <spark-id>
depends on the Spark program being queried:
Spark Metrics | Description |
---|---|
| Disk space used by the Block Manager |
| Maximum memory given to the Block Manager |
| Memory used by the Block Manager |
| Memory remaining to the Block Manager |
| Number of active jobs |
| Total number of jobs |
| Number of failed stages |
| Number of running stages |
| Number of waiting stages |
Request and Response Metrics
These metrics are available for services, for the system services component context or the user services context:
Request and Response Metrics | Description |
---|---|
| Number of requests received for the service |
| Number of successful responses sent |
| Number of |
Application Logging Metrics
These metrics are available for every application context:
Application Logging Metrics | Description |
---|---|
| Number of |
System Services Logging Metrics
These logging metrics are available for system services, in the system component context:
System Services Logging Metrics | Description |
---|---|
| Number of |
System Services Metric Processor Metrics
These processing metrics are available for system services, in the system component context:
System Services Metric Processor Metrics | Description |
---|---|
| Number of metrics processed by metric processor instance |
| Metrics processing delay in milliseconds. Difference between last metric's timestamp and current time |
Transaction Metrics
These metrics are available for the CDAP transaction service:
Transaction Metrics | Description |
---|---|
| Number of |
| Time taken (in milliseconds) to start |
| Number of transaction edits added to the write-ahead log |
| Number of transactions in a specified transaction state |
| Time taken (in milliseconds) to perform a specified transaction state update |
| Number of transactions of a specified type that are active |
Transactional Messaging System Metrics
These metrics are available for the CDAP transactional messaging service:
Transactional Messaging System Metrics | Description |
---|---|
| Number of message persist requests |
| Number of message persist requests succeeded |
| Number of message persist requests failed |
| Number of messages in the queue that are persisted in one batch |
| Number of entries requested to add to the messaging cache |
| Number of entries added to the messaging cache |
| Number of entries removed from the messaging cache |
| Number of times that the cache reduce weight logic was executed while adding entries to the cache. This number ideally should be very small for the cache to have good performance. |
| Number of times that the cache reduce weight logic was executed while scanning the cache. This number ideally should be relative small and steady over time. |
| Number of scan requests on the messaging cache |
| The current weight of the cache, measured in bytes |
YARN Cluster Metrics
These metrics are available for the YARN cluster resources:
YARN Cluster Metrics | Description |
---|---|
| Size (in megabytes) of total, available, or used cluster memory |
| Number of total, available, or used cluster virtual cores |
CDAP Program Metrics
These metrics are available for CDAP programs:
Program Metrics | Version Introduced | Description |
---|---|---|
|
| Measures time taken by program to transition from provisioning to starting |
| 6.6.0 | Measures time taken by program to transition from provisioning to running |
|
| Measures time taken by program to transition from provisioning to any complete state |
|
| Number of successful program runs |
|
| Number of failed programs runs |
|
| Number of killed program runs |
|
| Number of rejected program runs |
| 6.6.0 | Number of top-level program launching requests in the system |
| 6.6.0 | Number of running top-level programs in the system |
User Metrics
These metrics are available for pipeline connections:
Pipeline Connection Metrics
Pipeline Connection Metrics | Version Introduced | Description |
---|---|---|
| 6.7.0 | Number of create connection requests. |
| 6.7.0 | Number of delete connection requests. |
| 6.7.0 | Number of get connection requests. |
| 6.7.0 | Number of browse connection requests. |
| 6.7.0 | Number of sample connection requests. |
| 6.7.0 | Number of specification connection requests. |
| 6.7.0 | Number of upload file connection requests. |
Query Tips
Global Count Example
POST v3/metrics/query?target=metric&metric=user.connections.count
Group By connection Type Example
POST v3/metrics/query?target=metric&metric=user.connections.count&groupBy=tpe
Searches and Queries
The process of retrieving a metric involves these steps:
Obtain (usually through a search) the correct context for a metric;
Obtain (usually through a search within the context) the available metrics;
Querying for a specific metric, supplying the context and any parameters.
Search for Contexts
To search for the available contexts, perform an HTTP request:
POST /v3/metrics/search?target=tag[&tag=<context>]
The optional <context>
defines a metrics context to search within. If it is not provided, the search is performed across all data. The available contexts that are returned can be used to query for a lower-level of contexts.
You can also define the query to search in a given context across all values of one or more tags provided in the context by specifying *
as a value for a tag. See the examples below for its use.
Parameter | Description |
---|---|
| Metrics context to search within. If not provided, the search is provided across all contexts. Consists of a collection of tags. |
Examples
HTTP Method |
|
---|---|
Returns |
|
Description | Returns all first-level tags; in this case, two namespaces. |
|
|
HTTP Method |
|
Returns |
|
Description | Returns all tags of the of the given parent context; in this case, all entities in the default namespace. |
|
|
HTTP Method |
|
Returns |
|
Description | Queries all available contexts within the PurchaseHistory for any run. |
Search for Metrics
To search for the available metrics within a given context, perform an HTTP POST request:
POST /v3/metrics/search?target=metric&tag=<context>
Parameter | Description |
---|---|
| Metrics context to search within. Consists of a collection of tags. |
Example
HTTP Method |
|
---|---|
Returns |
|
Description | Returns all metrics in the context of the application PurchaseHistory of the default namespace; in this case, returns a list of system and (possibly) user-defined metrics. |
|
|
HTTP Method |
|
Returns |
|
Description | Returns all metrics in the context of the service UploadService of the application SportResults of the default namespace; in this case, returns a list of system and user-defined metrics. |
Querying a Metric
Once you know the context and the metric to query, you can formulate a request for the metrics data.
In general, a metrics query is performed by making an HTTP POST request, with parameters supplied either in the URL or in the body of the request. If you submit the parameters in the body, you can make multiple queries with a single request.
Metric parameters include:
tag values for filtering by context;
metric names (multiple metric names can be queried in each request);
time range or
aggregate=true
for an aggregated result; andtag values for grouping results (optional)
To query a metric within a given context, perform an HTTP POST request:
Parameter | Description |
---|---|
| Metrics context to search within, a collection of tags |
| Metric(s) being queried, a collection of metric names |
| A time range or |
| Tag list by which to group results (optional) |
Query Examples
HTTP Method |
|
---|---|
Returns |
|
Description | Using a System metric, system.process.events.processed |
|
|
HTTP Method |
|
Returns |
|
Description | Querying the User-defined metric names.bytes by its run-ID |
|
|
HTTP Method |
|
Returns |
|
Description | Using a User-defined metric, names.bytes in a service's Handler, called before any data entered, returning an empty series |
|
|
HTTP Method |
|
Returns |
|
Description | Using a User-defined metric, names.bytes in a service's Handler |
Query Results
Results from a query are returned as a JSON string, in the format:
Name | Description |
---|---|
| Start time, in seconds, with 0 being from the beginning of the query records |
| End time, in seconds |
| An array of metric results, which can be one series, a multiple time series, or none (an empty array) |
If a particular metric has no value, a query will return an empty array in the "series"
of the results, such as:
You can also receive such a result from querying a metric that does not exist, either because it does not exist at the context given or if the query is incorrectly formulated:
will return the empty result, as the metric name will be interpreted as "user.names.bytes?aggregate=true"
instead of "user.names.bytes"
.
Querying for Multiple Metrics
Retrieving multiple metrics at once can be accomplished by issuing an HTTP POST request with a JSON list as the request body that enumerates the name and attributes for each metric. The format of the request and the JSON body depends on whether the metrics share the same context or are being called for different contexts.
Multiple Metrics with the Same Context
Retrieving multiple metrics at once for the same contexts can be accomplished by issuing a request as in previous examples, but providing the additional metrics. For example:
The result (pretty-printed to fit) would be:
Multiple Metrics with Different Contexts
Retrieving multiple metrics at once for different contexts can be accomplished by issuing a request with a JSON list as the request body that enumerates the name, attributes and context for each metric. Use an HTTP POST request:
with the arguments as a JSON string in the body. The format of the JSON follows this structure (pretty-printed):
Queries are identified by a <query-id>
(in the example above, query1, query2; in the example below, eventsIn, eventsOut). The <query-id>
is then used in the returned result to identify the series.
For example, to retrieve multiple metrics using a curl
call (command and results reformatted to fit):
If the context of the requested metric or metric itself doesn't exist, the system returns a status 200 (OK) with JSON formed following the above description, with an empty series
for values:
Querying for Multiple Time Series
In a query, the optional groupBy
parameter defines a list of tags whose values are used to build multiple time series. All data points that have the same values in tags specified in the groupBy
parameter will form a single time series. You can define multiple tags for grouping by providing a list, similar to a tag combination list.
Tag List | Description |
---|---|
| Retrieves the time series for each application |
| Retrieves a time series for each App and Spark combination. |
An example method (re-formatted to fit):
returns the user.customers.count metric in the context of the application PurchaseHistory of the default namespace, for the specified time range, and grouped by spark: PurchaseHistoryTracker (results reformatted to fit):
Querying by a Time Range
The time range of a metric query can be specified in various ways: either aggregate=true
to retrieve the total aggregated since the application was deployed or, in the case of dataset metrics, since a dataset was created; or as a start
and end
to define a specific range and return a series of data points.
By default, queries without a time range retrieve a value based on aggregate=true
.
Parameter | Description |
---|---|
| Total aggregated value for the metric since the application was deployed. If the metric is a gauge type, the aggregate will return the latest value set for the metric. |
| Time range defined by start and end times, where the times are either in seconds since the start of the Epoch, or a relative time, using |
| Number of time intervals since start with length of time interval defined by resolution. If |
| Time resolution in seconds, minutes or hours; or if "auto", one of |
With a specific time range, a resolution
can be included to retrieve a series of data points for a metric. By default, 1 second resolution is used. Acceptable values are noted above. If resolution=auto
, the resolution will be determined based on a time difference calculated between the start and end times:
(endTime - startTime) > 36000 seconds
(ten hours), resolution will be 1 hour;(endTime - startTime) > 600 seconds
(ten minutes), resolution will be 1 minute;otherwise, resolution will be 1 second.
Time Range | Description |
---|---|
| The last 30 seconds. The start time is given in seconds relative to the current time. You can apply simple math, using |
| From |