Versions Compared
compared with
Key
- This line was added.
- This line was removed.
- Formatting was changed.
Introduction
Run plugin allows user to run any executable binary installed and available on all Hadoop nodes. The user code is capable of processing the input record and return the output record to be further processed downstream in the pipeline.
Use-Case
Often times, in enterprise there are existing tools or systems that exist and perform complex transformations of data. These tools are time tested and have been running in production for long time. As more and more processing is being moved to Hadoop, users would like to slowly transition to running on Hadoop. In this case, they would like to have the ability to run the tools as in or with minor modifications. They have the tool or binary installed on all Hadoop nodes and they would like the ability to pass the processing record into the tool and retrieve the results back into the pipeline.
User Stories
- User should be able to specify the fully path to binary or just the binary
- User should be able to specify the arguments for the binary to be executed
- User should be provided a specification about how the record is passed to binary (need to be designed)
- If binary executable doesn’t exist or not in path or not executable, user should be notified appropriately during runtime
- User is able to see the errors in log if the executable writes the errors to STDERR
- User executable is able to read the record from STDIN
- User executable is able to write the record to STDOUT
- User executable is able to write the error records to a different FILE descriptor
- User should be able to specify the error dataset to which data returned from executable special FILE descriptor
- User will make sure the binary and it’s dependencies are available on all machines of the cluster and no capability needs to be added to the plugin for marshaling the executable
Conditions
- Binary and it’s dependencies must be available on all the machines of the cluster, prior to the execution of binary.
- Arguments should be in the proper sequence and the supported format. Any mismatch in the sequence of the arguments will result into the failure of execution.
Design
For both the design approaches, following are the assumptions/considerations:
- For the below following cases:
a) Binary with zero number of arguments or no arguments.
b) Binary that takes a File or DB Name (Not STDIN or Hydrator Source Stage)
User is not restricted to execute the binary. However, the Hydrator source stage will not be utilized and the binary will run again and again, for each record coming through the source stage. Types of binary executable that will be supported by plugin are:
a) .bat
b) .exe
c) .sh
d) .jar
And to select the executable type, dropdown will be provided to the user.
Design Approach 1:
- User can pass the input to the binary, either through "Hydrator Soruce Stage" or "STDIN". And to select the options drop down will be provided.
- Arguments for the binary will be passed in this format: For example,
java -jar <Example.jar> <runtime/variable field1> <runtime/variable field2>..... <Other fixed/static named & unnamed arguments>
Note: All the static/fixed named or unnamed arguments will be passed after the actual input arguments coming through Hydrator Source or STDIN. - Arguments should be in sequence. Any mismatch in the sequence of the arguments will result into the failure of execution.
Run Plugin Properties:
- binaryType : Type of the binary to be executed. For example, .jar or .bat or .sh or .exe.
- binaryName : Full path to the binary to be executed.
- inputSource : Source to read the data for runtime/variable input arguments for binary to be executed. For example, Hydrator Source Stage or STDIN
- inputSourceFields variableArguments : A spacecomma-separated sequence of the fields which will be used as input source for the runtime/variable arguments. For example, firstname or firstname lastname in case of multiple arguments. Please make sure that the sequence of arguments is proper.
- fixedArguments : A space-separated sequence of the fixed input arguments that will be passed to the binary to be executed. Please make sure that the sequence of the arguments is proper. All the fixed input arguments are followed by the runtime/variable input arguments.
- outputFields : A comma-separated sequence of field name and its type which will be used to store the final output.
Run Input Json Format:
{
"name": "Run",
"type": "transform",
"properties": {
"binaryType": "jar",
"binaryName": "/opt/cdap/Runner.jar",
"inputSource": "HydratorSource",
"inputSourceFieldsvariableArguments": "firstname,lastname",
"fixedArguments": "256 1024 -Dcheckstule=true",
}
}
Design Approach 2:
In this approach, user have to provide the complete sequence of arguments in one text box. To distinguish between the static and runtime arguments, runtime arguments will be enclosed in theplaceholder ${vairable-name}For Example, ${fname} ${lname} 256 1024 -Dcheckstyle=true
It could be the fields from the Hydrator source stage or STDIN.
Run Plugin Properties:
- binaryName : The name of the binary or the full path to the binary, to be executed.
- binaryType : Type of the binary to be executed. For example, .jar or .bat or .sh or .exe. Default is .jar.
- inputArguments: Sequence of the arguments for the binary to be executed. In case of varible arguments, pass the arguments inside '${}' placeholder. For example: ${fname} ${lname} 256 1024 -Dcheckstyle=true
- sourceForVariableArguments: A comma-separated sequence of the placeholders with the source for its input. The source for the variable argument or placeholder can be field from the input record or through STDIN.
Run Plugin Json Format:
{
"name": "Run",
"type": "transform",
"properties": {
"binaryName": "Runner",
"binaryType": ".jar",
"inputArguments": "${fname} ${lname} 256 1024 -Dcheckstyle=true",
"sourceForVariableArguments": "${fname}:Firstname,${lname}:STDIN",
}
}
"outputFields": "FinalOutput:string"
}
}
Note: More details will be added based on the findings.
Implementation Tips
- Please reuse and/or modify ExternalProgramExecutor (https://github.com/caskdata/cdap-apps/blob/develop/TwitterSentiment/src/main/java/co/cask/cdap/apps/flowlet/ExternalProgramExecutor.java)
Table of Contents
Table of Contents style circle
Checklist
- User stories documented
- User stories reviewed
- Design documented
- Design reviewed
- Feature merged
- Examples and guides
- Integration tests
- Documentation for feature
- Short video demonstrating the feature