Deduplicate: Preview is success but the pipeline is getting intermittent error after 9 mins using Deduplicate analytics plugin with BigQuery source and BigQuery as sink(BQ source->Deduplicate->BQ sink).

Description

The Deduplicate analytic plugin, after filling the mandatory fields and configuring pipeline from BQ source to Deduplicate and then to BigQuery sink. The pipeline is getting intermittent error after 9 mins .

Steps to reproduce:

  1. Launch the latest CDAP (6.8.0)

  2. Open BigQuery source, deduplicate plugin and BigQuery sink, now add establish the connection from source to analytics to sink. Fill the properties and click on validate.

  3. Preview the pipeline, it is successful, records are shown in preview data.

  4. Now deploy the pipeline. The pipeline will be in running status for more that 9 mins then getting intermittent error .

  5. The same is working fine in lower version of 6.7.0. The pipeline gets succeeded.

  6. Attached the screen recording of intermittent error and passed json.

Expected: Pipeline should be successful and table should be created in BigQuery as per deduplicate records mentioned for required fields.
Actual: Pipeline is getting intermittent error after 9 mins.

Release Notes

None

Attachments

15
  • 02 Jun 2022, 05:24 PM
  • 16 May 2022, 03:52 PM
  • 16 May 2022, 03:38 PM
  • 16 May 2022, 03:38 PM
  • 16 May 2022, 03:38 PM
  • 11 May 2022, 07:54 AM
  • 11 May 2022, 07:53 AM
  • 11 May 2022, 06:08 AM
  • 29 Apr 2022, 07:14 AM
  • 28 Apr 2022, 07:42 AM
  • 27 Apr 2022, 08:12 PM
  • 21 Apr 2022, 01:38 PM

Activity

Show:

Vitalii Tymchyshyn June 8, 2022 at 10:22 PM

It seems like a regression that should be fixed in 6.8.0. We should not change memory setting as this is not what users will do out of the box. Proper investigation and fix of the regression is needed

Sanjana Sandeep June 7, 2022 at 10:23 PM

Could you please suggest if the solution to the problem (my comment below) is increasing the JVM memory as suggested further below (JAVA_OPTS="-Xmx12G" ./bin/cdap sandbox start)? This problem occurs specifically only in the local sandbox.

Sanjana Sandeep June 3, 2022 at 5:57 PM

To summarize the problem: When we use analytics plugins (group by/ deduplicate/ distinct) along with a gcs/bq sink, we see performance issues (probably spark execution) when running on local sandbox.

Surya June 3, 2022 at 1:43 PM

Mentioning the pipelines with performance/memory issue 1. BigQuery-Distinct-GCS, 2. GCS-Distinct-BigQuery, 3. File-Distinct-GCS, 4. BigQuery-Distinct-BigQuery. And pipelines with no performance/memory issue 1. GCS-Distinct-File, 2. File-Joiner-Distinct/Deduplicate-Trash, 3. BigQuery-Distinct-File, 4. File-Distinct-Trash, 5. BigQuery-Distinct-Trash which we ran on local sandbox.

Swati Dubey June 3, 2022 at 1:29 PM

Groupby plugin in analytics having same performance issue like deduplicate and distinct plugin so tested GroupBy plugin with GCS as source and File as sink taking 30 sec to succeed the pipeline and previously it was taking 20 min to succeed the pipeline with BQ

Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Priority

Created April 13, 2022 at 10:48 AM
Updated July 12, 2022 at 8:44 PM