Checklist

User Stories Documented
User Stories Reviewed
Design Reviewed
APIs reviewed
Release priorities assigned
Test cases reviewed
Blog post

Introduction

CDAP pipeline is composed of various plugins that can be configured by users as CDAP pipelines are being developed. While building CDAP pipelines, pipeline developer can provide invalid plugin configurations or schema. For example, the BigQuery sink plugin can have output schema which does not match with underlying BigQuery table Or provided bucket name contains invalid characters. CDAP pipeline developer can use new validation endpoint to validate each stage before deploying the pipeline. In order to fail fast and for better user experience, validation endpoint should return all the possible validation errors from a given stage when this endpoint is called.

Goals

The purpose of this document is to provide best practices on usage of new validation apis in cdap plugins.

Validation Api usage in plugins

The purpose of using validation apis in plugins is to collect validation errors as early as possible. FailureCollector api is to be used to collect multiple ValidationFailures.

Failure collection using FailureCollector

CDAP plugins override method configurePipeline() which is used to configure the stage at deploy time. The same method is called through validation endpoint as well. In order to collect multiple validation failures, FailureCollector api is exposed through stage configurer to validate the configurations in this method. The sample usage of FailureCollector api looks as below:

@Override
public void configurePipeline(PipelineConfigurer configurer) {
   StageConfigurer stageConfigurer = configurer.getStageConfigurer();
   // get failure collector from stage configurer
   FailureCollector collector = stageConfigurer.getFailureCollector();
   // use failure collector to collect multiple validation failures
   config.validate(collector); 
   validatePartitionProperties(collector); 
   validateConfiguredSchema(configuredSchema, collector);
   stageConfigurer.setOutputSchema(configuredSchema);
}

Adding ValidationFailures to FailureCollector

A validation failure made up of 3 components:

Message - Represents a validation error message
Corrective action - An optional corrective action that represents an action to be taken by the user to correct the error situation
Causes - Represents one or more causes for the validation failure. Each cause can have more than one attributes. These attributes are used to highlight different sections of the plugins on UI.

Example:

In bigquery source if the bucket config contains invalid characters, a new validation failure will be added to the collector with a `stageConfig` cause attribute as below:

Pattern p = Pattern.compile("[a-z0-9._-]+");
if (!p.matcher(bucket).matches()) {
   collector.addFailure("Allowed characters are lowercase characters, numbers,'.', '_', and '-'", 
                        "Bucket name should only contain allowed characters.'")
                        .withConfigProperty("bucket");
}

While a ValidationFailure allows plugins to add a cause with any arbitrary attributes, ValidationFailure api provides various util methods to create validation failures with common causes that can be used to highlight appropriate UI sections. Below is the list of common causes and associated plugin usage:

1. Stage config cause

Purpose: Indicates an error in the stage property

Scenario: User has provided in valid bucket name for Bigquery source plugin

Example:

collector.addFailure("Allowed characters are lowercase characters, numbers,'.', '_', and '-'", 
                     "Bucket name should only contain allowed characters.'")
                     .withConfigProperty("bucket");

2. Plugin not found cause

Purpose: Indicates a plugin not found error

Scenario: User is trying to use a plugin/jdbc driver that has not been deployed

Example:

collector.addFailure("Unable to load JDBC driver class 'com.mysql.jdbc.Driver'.",
                     "Jar with JDBC driver class 'com.mysql.jdbc.Driver' must be deployed")
                     .withPluginNotFound("driver", "mysql", "jdbc");

3. Config element cause

Purpose: Indicates a single element in the list of values for a given config property

Scenario: User has provided a field to keep in the project transform that does not exist in input schema

Example:

collector.addFailure("Field to keep 'non_existing_field' does not exist in the input schema",
                     "Field to keep must be present in the input schema")
                     .withConfigElement("keep", "non_existing_field");

4. Input schema field cause

Purpose: Indicates an error in input schema field

Scenario: User is using big query sink plugin that is does not record fields

Example:

collector.addFailure("Input field 'record_field' is of unsupported type.",
                     "Field 'record_field' must be of primitive type.")
                     .withInputSchemaField("record_field", null);

5. Output schema field cause

Purpose: Indicates an error in output schema field

Scenario: User has provided output schema field that does not exist in big query source table

Example:

collector.addFailure("Output field 'non_existing' does not exist in table 'xyz'.",
                     "Field 'non_existing' must be present in table 'xyz'.")
                     .withOutputSchemaField("non_existing", null);

Cause Associations

While validating the plugin configurations, the validation failure can be caused by multiple causes. Below are a few examples of associated causes:

Example 1

Database source has username and password as co-dependent properties. If username is not provided but password is provided, the plugin can just add a new validation failure with 2 causes as below:

collector.addFailure("Missing username",
                     "Username and password must be provided'")
                     .withConfigProperty("username").withConfigProperty("password");

Example 2

Projection Transform received incompatible input schema and output schema for a field such that input field can not be converted to output field. In that case a new validation failure can be created with 2 different causes as below:

collector.addFailure("Input field 'record_type' can not be converted to string",
                     "Field 'record_type' must be of primitive type'")
                     .withConfigProperty("convert").withInputSchemaField("record_type");

Guidelines to incorporate new validation apis

Validate all the stage config properties.
1. Make sure all the configuration properties are validated. If properties are provided as macros, validation should happen as early as possible. For example, properties that are provided as macros, should get validated in prepareRun() method.
2. Make sure all the validation errors are captured in the failure collector and no validation related exceptions are thrown from configurePipeline() method.
3. Handle macro enabled properties correctly so that null pointer exceptions are not thrown during validation.
4. Make sure sources and sink output schema matches with underlying storage during validation.
Handling validation for co-dependent properties:
1. It possible that some of the config properties are dependent on each other. For example, if jdbc driver can not be loaded, no further validation should not be performed. In such cases, use getOrThrowException method to throw a ValidationException.

If the private method called by validate() method is returning an object, use throw getOrThrowException() at the end of the method as shown below:

private String someMethod() {
  switch (someVar) {
    // cases
  }
  // if control comes here, it means failure
  failureCollector.addFailure(...);
  // throw validation exception so that the compiler knows that exception is being thrown which eliminates 
  // the need to have a statement that returns null at the end of this method
  throw failureCollector.getOrThrowException();
}

Handling logical types:

Make sure to add checks for logical types wherever applicable.

Use display names to add appropriate logical types in the error message. For example,

if (fieldSchema.getLogicalType() != null || fieldSchema.getType() != Schema.Type.BYTES) {
   collector.addFailure(
      // If schema represents a logical type schema, Schema.getDisplayName() will make sure appropriate display name is being used for logical types 
      String.format("Field '%s' must be of type 'bytes' but is of type '%s'.", field.getName(), fieldSchema.getDisplayName()),
      String.format("Make sure field '%s' is of type 'bytes'.", field.getName()))
      .withConfigProperty("audiofield").withInputSchemaField(audioFieldName, null);
}

Related Work

Validation Api Design - https://wiki.cask.co/display/CE/Plugin+Validation