GroupBy with CollectSet followed by a Python transform returns error

Description

The steps to reproduce the issue:

  • Create a Source with a CSV file, for example:

The schema is name: string and value: string. 

  • Add a GroupBy transform, grouping by name and a CollectSet of the value column.

  • Add an empty Python transform

  • Add a Sink (Trash sink for testing) 

When the pipeline is run, It returns a cast exception 

 

With CollectList it works correctly.

 

Attached a sample pipeline reading from GCS (real paths and project info removed)

 

 

 

Release Notes

Fixed an issue in the python transform that caused it to fail on certain types of array inputs.

Attachments

1

Activity

Show:

Albert Shau December 18, 2020 at 12:19 AM

Albert Shau December 17, 2020 at 10:23 PM

Albert Shau February 26, 2020 at 6:37 PM

The bug is at https://github.com/data-integrations/python-plugins/blob/develop/src/main/java/io/cdap/plugin/python/transform/PythonObjectsEncoder.java#L55.

It should not be assuming the underlying object is a List. An 'array' type can be a java array, or any java collection.

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Affects versions

Fix versions

Priority

Created February 26, 2020 at 3:58 PM
Updated January 13, 2021 at 1:11 AM
Resolved December 18, 2020 at 12:25 AM