WIP : Cloud plugins user experience improvement in data-pipelines

Checklist

  • User Stories Documented
  • User Stories Reviewed
  • Design Reviewed
  • APIs reviewed
  • Release priorities assigned
  • Test cases reviewed
  • Blog post

Introduction 

Simplify and improve user experience for cloud bases plugins such as GCS, S3, BigQuery for CDAP data-pipelines in cloud environment.

Goals

When CDAP is provisioned in cloud environments such as Google cloud or AWS improve plugins GCS, BigQuery or S3 in data-pipelines, provide autofill selection for fields such as bucket_name, dataset_id, etc.

When CDAP is provisioned in cloud environments such as Google cloud or AWS improve plugins GCS, BigQuery or S3 in data-pipelines, disable fields such as credentials, project_information, etc which are available by default for the CDAP instance.


User Stories 

  • User provisioned CDAP on GCP and while configuring Google Cloud Storage plugin in data-pipeline, expects auto suggestions for bucket_name, path fields based on the buckets they have access to.
  • User provisioned CDAP on GCP and while configuring Google BigQuery plugin in data-pipeline, expects auto suggestions for dataset_id, table_id fields based on the datasets they have access to.

  • As a plugin developer, for Google cloud plugins I like to specify certain fields as disabled for GCP environment to improve user experience.
  • As a plugin developer, for AWS plugins I like to specify certain fields as disabled for AWS environment to improve user experience.

  • User provisioned CDAP on GCP and while configuring GCS/BigQuery plugin in data-pipeline, doesn’t expect to provide credentials and project information, they rather expect these fields to be hidden/disabled.

Design

Autofill suggestions for cloud plugins


When users use data-prep to wrangle their data and then create a pipeline from data-prep, the source plugin and their fields are filled in automatically and the user doesn’t have to provide them again.

However a user can start using the GCS/BigQuery/S3 plugins in data-pipeline view and we would like to improve their experience in cloud by providing autofill suggestions with available values for fields such as buckets, dataset_id, etc

CDAP Plugins has endpoints capability through which they expose additional functionality, typically this is used for getting schema information, eg : Database plugin executes a query and returns the schema, this schema is used as output schema for the database source by UI.

We can leverage the existing plugins endpoints feature to add additional endpoints in the cloud plugins to list buckets, list path or list datasets etc.



@Path("bucketList")
public List<String> getBucketList(ProjectInfo request) {
  List<String> buckets = new ArrayList<>();
  Storage storage = getStorage(request);
  Page<Bucket> list = storage.list();
  Iterator<Bucket> iterator = list.getValues().iterator();
  while (iterator.hasNext()) {
    buckets.add(iterator.next().getName());
  }
  return buckets;
}

class ProjectInfo {
  @Nullable
  public String serviceAccountFilePath;
  @Nullable
  public String projectName;
}

private Storage getStorage(ProjectInfo project) {
    StorageOptions.Builder storageOptionsBuilder = StorageOptions.newBuilder();
    if (project.serviceAccountFilePath != null) {
      storageOptionsBuilder.setCredentials(loadServiceCredentialsFromFile(project.serviceAccountFilePath));
    }
    if (project.projectName != null) {
      storageOptionsBuilder.setProjectId(project.projectName);
    }
    return storageOptionsBuilder.build().getService();
  }
}


Note:

serviceAccountFilePath and projectName can be null on CDAP instance running on GCP cloud environment. On local sandbox or non-cloud environment, if user has provided the service account and project information, they can use this endpoint to get auto fill suggestions for listing buckets, etc.


Question : should backend support limiting result size and offset for listing buckets or listing bucket contents, etc?

Disabling unnecessary fields in cloud plugins

When CDAP is running in cloud environment, it will be confusing or misleading to allow users to edit certain fields such as credentials, project_information etc. It will be robust if they are disabled.

Approach #1 - Annotating fields in plugin config with disabled - Considered

We want to allow plugin developers to tag certain fields as disabled in certain runtime environments.



GCSConfig
public static class GCSSourceConfig extends FileSourceConfig {
 @Name("project")
 @Description("Project ID")
 @Macro
 @Nullable
 @Disabled("GCS")
 public String project;

 @Name("serviceFilePath")
 @Description("Service account file path.")
 @Macro
 @Nullable
 @Disabled("GCS")
 public String serviceAccountFilePath;
}

This will involve platform change, to add a new field for disabled environment list in PluginPropertyField class



PluginPropertyField
public class PluginPropertyField {
// existing fields
 private final String name;
 private final String description;
 private final String type;
 private final boolean required;
 private final boolean macroSupported;
 private final boolean macroEscapingEnabled;
 // New field
 private final List<String> disabled;
}

When CDAP-UI receives the plugin config, if the disabled list isn’t empty and if the current CDAP environment is present in the list, then UI can disable that corresponding field.

Approach #2 - Providing disabled information through widget properties - Preferred

  1. Add new widget property in UI for marking a field disabled, that can be used by plugin developers when they are writing their widget json.

  2. Example : In GCSFile-batchsource.json widget, credential field will have an additional property for disabling in GCP.

  3. The widget properties are stored as plugin properties in artifact store, UI queries  information about widget properties which has additional information about fields and their widget type.

  4. If credential field has a property for disabled in GCP, in the widget properties and if the environment is GCP, UI can use those information to disable that field.



GCS-Widget-Snippet.json
{
  "label" : "Service Account and Project",
  "properties" : [
    {
      "widget-type": "textbox",
      "label": "Service Account File Path",
      "name": "serviceFilePath",
      "widget-attributes" : {
        "placeholder": "Path to service account file (Local to host running on).",
        "disabled": "mode:gcp"
      }
    },
    {
      "widget-type": "textbox",
      "label": "Project Id",
      "name": "project",
      "widget-attributes" : {
        "placeholder": "The Project Id of GCS.",
        "disabled": "mode:gcp"
      }
    },
    {
      "widget-type": "textbox",
      "label": "Bucket Name",
      "name": "bucket",
      "widget-attributes" : {
        "placeholder": "Temporary Google Cloud Storage bucket name."
      }
    }

  ]
}


Why Approach#2 is Preferred ?

Marking a field as disabled is very UI specific feature, unlike other annotations such as @Nullable or @Macro used in plugin config, for plugin property fields. As Nullable and Macros have backend logic, disabled is very specific to user experience. Hence Approach#2 seems a more reasonable solution.

Note on CDAP instance

For either of the approaches, we need a way for CDAP UI to get information about the platform the CDAP instance is on.


Error Handling

When do we test the connections that are created by default ? If there are any exceptions while testing and using the created connection how do we surface them ? 

If we test the connection when they are created, then its possible that there might be multiple exceptions during data-prep initialization that can cause bad user experience for the user.

If we test the connection when the user is selecting a connection, it might be a better time to test the connection and provide the appropriate result or error message. Will be good to get UI team's opinion on this.

UI Impact or Changes

  • Autofill suggestions for cloud plugin fields will require a new UI widget type, as the current widget for plugin endpoints is with an explicit button to get schema, this new widget type would be an implicit endpoint call for those fields similar to stream/dataset selector.

  • New UI widget attribute for disabled field, that can be understood by the UI and disable those fields

Created in 2020 by Google Inc.