Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Reusable pipelines have a regular pipeline structure, but use macros to add dynamic reusable variables to plugin configurations so that you can specify the variable substitutions at runtime. 

Fields that can be made dynamic reusable using macros have a blue M icon on the far right in the Plugin Configuration section:

...

Use the pricing calculator to generate a cost estimate based on your projected usage.

Before you begin

  1. In the Google Cloud Console, on the project selector page, select or create a Google Cloud project.

    Note: If you don't plan to keep the resources that you create in this procedure, create a project instead of selecting an existing project. After you finish these steps, you can delete the project, removing all resources associated with the project.

    Go to the project selector page

  2. Make sure that billing is enabled for your Cloud project. Learn how to confirm that billing is enabled for your project.

  3. Enable the Cloud Storage APIs.
    Enable the APIs

  4. Download the following sample file, args.json, and upload it to a GCS bucket. This is the JSON file that you’ll call from the GCS Argument Setter:

View file
nameargs.json

Read from Cloud Storage

  1. In the Pipeline Studio, add a GCS batch source to the canvas.

    Image Modified
  2. Click the Properties button.

  3. In the Reference name field, enter a name.

  4. In the Project ID field, enter your Google Cloud Project ID.

  5. In the Path field, enter ${input.path}. This macro controls what the Cloud Storage input path will be in the different pipeline runs.

  6. (Optional) In the Output Schema panel, remove the offset field from the output schema by clicking the trash icon in the offset field row.
    Recommended: The offset field isn’t required. Removing it for text sources is a best practice to include only required fields in a pipeline, which can help streamline troubleshooting and lineage analysis.

    Image Modified
  7. In the Format field, select text.
    This is the format of the file that we’ll call in the GCS Argument Setter. The Format of the GCS source must match the format of the source file that is called in the GCS Argument Setter.

  8. For Service Account, select JSON.

  9. In the Service Account JSON field, paste the entire contents of the Service Account JSON.

    Image Modified
  10. Click Validate to make sure you don't have any errors.

  11. Click the X button to exit the Properties dialog box.

Transform your data

  1. Add a Wrangler transformation to the pipeline.

    Image Modified
  2. On the Wrangler transformation, click the Properties button.

  3. In Input field name, enter body.

  4. In the Recipe field, enter ${directives}. This macro controls what the transform logic will be in the different pipeline runs.

    Image Modified
  5. Click Validate to make sure you don't have any errors.

  6. Click the X button to exit the Properties dialog box.

Write to Cloud Storage

  1. Add a GCS sink to the pipeline.

    Image Modified
  2. On the GCS sink, click the Properties button.

  3. In the Reference name field, enter a name.

  4. In the Path field, enter the path of a Cloud Storage bucket in your project, where your pipeline can write the output files. If you don't have a Cloud Storage bucket, create one.

    Image Modified
  5. For Service Account, select JSON.

  6. In the Service Account JSON field, paste the entire contents of the Service Account JSON.

  7. In the Format field, select json.
    Note: You can select any output format, but for this tutorial, we’ll write to a json file.

  8. Click Validate to make sure you don't have any errors.

  9. Click the X button to exit the Properties menu.

Add the argument setter

  1. From the Plugin Palette, under Conditions and Actions, select the GCS Argument Setter plugin.

  2. In the Pipeline Studio canvas, connect the GCS Argument Setter to the GCS batch source.

    Image Modified
  3. On the GCS Argument Setter plugin, click the Properties button.

  4. In the Project ID field, enter your Google Cloud Project ID.

  5. In the Path field, enter the path to the args.json file you uploaded to your bucket.
    The file contains the following content:

    Code Block
    {
      "arguments" : [
        {
          "name": "input.path",
          "value": "gs://reusable-pipeline-tutorial/user-emails.txt"
        },
        {
          "name": "directives",
          "value": "send-to-error !dq:isEmail(body)"
        }
      ]
    }

    The first of the two arguments is the value for input.path. The path gs://reusable-pipeline-tutorial/user-emails.txt is a publicly accessible object in Cloud Storage that contains the following test data:

    Code Block
    alice@example.com
    bob@example.com
    craig@invalid@example.com

    The second argument is the value for directives. The value send-to-error !dq:isEmail(body) configures Wrangler to filter out any lines that are not a valid email address. For example, craig@invalid@example.com is filtered out.

  6. In the Service Account Type field, select JSON.

  7. In the Service Account JSON field, paste the entire contents of the Service Account JSON.

    Image Modified

Deploy and run your pipeline

  1. In the top bar of the Pipeline Studio page, click Name your pipeline. Name your pipeline and click Save.

  2. Click Deploy.

  3. Next to Run, click the drop-down menu arrow_drop_down to open the Runtime Arguments and view the macro (runtime) input.path and directives arguments. Leave the value fields blank to notify CDAP that the GCS Argument Setter plugin in the pipeline will set the values of these arguments during runtime.

  4. Click Run.

Cleaning up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial, either delete the project that contains the resources, or keep the project and delete the individual resources.

...

  • Everything in the project is deleted. If you used an existing project for this tutorial, when you delete it, you also delete any other work you've done in the project.

  • Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as an appspot.com URL, delete selected resources inside the project instead of deleting the whole project.

  1. In the Cloud Console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.

  3. In the dialog, type the project ID, and then click Shut down to delete the project.

Related Topics

Google Cloud Storage Argument Setter Action

Reusable Pipelines