Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Introduction

CDAP provides Python Evaluator transform that allows user provided python code to be executed in a transform plugin. The current solution uses jython 2.5.x and does not have support for allowing any standard/native libraries to be used

Use case(s)

  • Python evaluator transform should allow capabilities to use standard python libraries in the user provided code. 
  • Python evaluator transform should allow capabilities to use native third-party libraries (using numpy, scipy other scientific libs)
  • Python 3 should be supported
  • Publish benchmark of running transformations on a large datasets (~10 million records)
  • Code for the python transform should be separated out into a repo in data-integrations org

Deliverables 

  • Source code in data integrations org
  • Performance test on a large dataset
  • Integration test code 
  • Relevant documentation in the source repo and reference documentation section in plugin

Relevant links 

Plugin Type

  •  Batch Source
  •  Batch Sink 
  •  Real-time Source
  •  Real-time Sink
  •  Transform
  •  Action
  •  Post-Run Action
  •  Aggregate
  •  Join
  •  Spark Model
  •  Spark Compute

Configurables

This section defines properties that are configurable for this plugin. 

User Facing NameTypeDescriptionConstraints
PYTHONPATHString

PYTHONPATH to libraries which user may want to use.

By default it's empty and we allow it to be empty it's not required. Python will have it's default PYTHONPATH anyway.


Path to Python binaryString

Path to Python binary. User is be able to provide a python binary of the any version. Both python2 and python3 will work.

This is a field user is required to fill. Default is empty.


Timeout of Python processLong

Java process will have to block and wait for Python process to finish. So it's inevitable that we need some kind of timeout not to block the app

Default: 600 seconds



Design

Previously we used Jython which emulated how CPython works. However it cannot read pure C libraries like some default python libraries and most scientific libs (numpy, scipy - all C written), which are very useful for data analysis. Also Jython is not realeased for Python3.

So the solution is to use pure CPython process and a python library Py4j (some discussion and explanation on why we chose this lib can be found in comments). This library allows a Python process to communicate with JVM via plain file sockets to exchange data. Also it transforms class hierarchies and primitive types between Java and Python.

Here's how the process goes:

  1. Transformation starts with some Python code provided by user.
  2. Java code starts Py4J.Gateway to communicate with python process over plain sockets.
  3. Java code executs "join()" on py4j.gateway thread waiting for it to shutdown (happens before python process death)
    a) if it does not finish in 600 timeout seconds. Pipeline , pipeline fails due to timeout.
  4. We run the code below using plain cpython (e.g. /usr/bin/python):

    Code Block
    from py4j.java_gateway import JavaGateway
    gateway = JavaGateway()                   # wait and connect to the JVM
    
    ###### here we insert the code from UI. It is an actual transform function #######
    # def transform(record, emitter, context):
    # # transformations user has done
    # emitter.emit(record);
    ##################################################################################
    
    
    # This is a class an instance of special java class which contains list 
    # of records and some other information to run transform 
    app = gateway.entry_point
    
    
    for entity in app.entities:
      transform(entity, app.emitter, app.context)
    
    
    gateway.shutdown()

    This code includes the section that user provided from UI as well as some extra code in order to connect to JVM via py4j and to run transform code.

  5. The information changed by transform is sent by calling emitter.emit(input) to JVM.

Limitation(s)

  • User will have to install Py4J library by himself (either for python2 or for python3). This can be done by running "pip install py4j" or "pip3 install py4j"
  • User will have to install Python by himself.


There is no easy way to install py4j library automatically along with transform plugin unfortunately. We have to knowledge of version of python the user is going to use (python2 or python3). Also plugin is a single jar which does not include any additional files. For users who are installing CDAP from CM/Ambari/Docker image we could however pre-install py4j library. 

Justification of using Py4J

We were to choose between 3 most popular solutions which enable executing Python code in scope of Java application. Here's a comparison table of them.



Py4JJepJpy
LicenseBSDzlib/libpngApache
Mechanism of work

CPython process and java thread

transfer data using sockets

JNI (via *.so library which has access to python)JNI (via *.so library which has access to python)
Anticipated performanceNormal


High (due to low-level mechanism)High+ (due to low-level mechanism). Project states it was specifically designed to meet high performance goals
Needs achitecture dependent files (*.so) -++

Importing native libraries


Any native librariesCan read almost all. Including numpy, scipyCan read almost all. Including numpy, scipy
StabilityVery highHighHigh
Ability to choose python version to use+--
Installation/RunningSimpleSimpleSlighting advanced (building from sources and setting paths)


Needs achitecture dependent files (*.so)

If library needs *.so file this gets a bit complicated. We will need to ask user to install the library with a simple command. Before using the transform pipeline. Since these *.so files are not only dependent on processor achitecture, but also dependent on very specific version of Python it is used with. I will create another comment to discuss this, cause there are other points to make.

Importing native libraries / Stability

Py4j uses real Python process no problems here everything will work. For Jep and Jpy they are using real CPython as well, but they are calling it via *.so libraries, which is not the usual way. And some presumably small number of libs can have a problem with this approach.  From Jep docs:

works with most numpy, scipy, pandas, tensorflow, matplotlib, cvxpy

Jep should work with any pure Python modules. CPython extensions and Cython modules may or may not work correctly

Jep doesn't work correctly with some CPython extensions due to how those extensions were coded. Oftentimes projects were not designed towards the idea of running in embedded sub-interpreters instead of in a standard, single Python interpreter. When running Jep, these types of problems have been known to appear as a JVM crash, an attribute inexplicably set to None, or a thread deadlock.

Devs claim that on latest version jvm crash should not appear when using odd libs, when using another method of loading the module. 

Ability to choose python version to use

Py4j will allow to choose any python user has installed. Organizations might have multiple different versions which is normal. Jep and Jpy install *.so library which is compiled and bound to a specific Python version so after installation user cannot choose the Python he would like to use.

Conclusion

As we can see Py4J wins in most categories, hence with it we can provide much better user experience. Allowing them to choose python version from UI, and install the library with ease (since no *.so is needed) and provide much more stable library (which won't crash jvm). Also Py4J is already used it big projects like Spark, which shows that it should be stable. The only drawback it has is that performance is expected to be a bit slower than of other two libs. However since performance on transform is not bottleneck for pipelines this should not be a problem for us.



Table of Contents

Table of Contents
stylecircle

Checklist

  •  User stories documented 
  •  User stories reviewed 
  •  Design documented 
  •  Design reviewed 
  •  Feature merged 
  •  Examples and guides 
  •  Integration tests 
  •  Documentation for feature 
  •  Short video demonstrating the feature