Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • run(): Used to implement the functionality of the plugin.

  • configurePipeline(): Used to perform any validation on the application configuration that is required by this plugin or to create any streams or datasets if the fieldName for a stream or dataset is not a macro.

Example:

...

  • run(): Used to implement the functionality of the plugin.

  • configurePipeline(): Used to perform any validation on the application configuration that is required by this plugin or to create any streams or datasets if the fieldName for a stream or dataset is not a macro.

Example:

...

  • prepareRun(): Used to configure the input for each run of the pipeline. If the fieldName for a stream or dataset is a macro, their creation will happen during this stage. This is called by the client that will submit the job for the pipeline run.

  • onRunFinish(): Used to run any required logic at the end of a pipeline run. This is called by the client that submitted the job for the pipeline run.

  • configurePipeline(): Used to perform any validation on the application configuration that is required by this plugin or to create any streams or datasets if the fieldName for a stream or dataset is not a macro.

  • initialize(): Initialize the Batch Source. Guaranteed to be executed before any call to the plugin’s transform method. This is called by each executor of the job. For example, if the MapReduce engine is being used, each mapper will call this method.

  • destroy(): Destroy any resources created by initialize. Guaranteed to be executed after all calls to the plugin’s transform method have been made. This is called by each executor of the job. For example, if the MapReduce engine is being used, each mapper will call this method.

  • transform(): This method will be called for every input key-value pair generated by the batch job. By default, the value is emitted to the subsequent stage.

...

  • prepareRun(): Used to configure the output for each run of the pipeline. This is called by the client that will submit the job for the pipeline run.

  • onRunFinish(): Used to run any required logic at the end of a pipeline run. This is called by the client that submitted the job for the pipeline run.

  • configurePipeline(): Used to perform any validation on the application configuration that is required by this plugin or to create any streams or datasets if the fieldName for a stream or dataset is not a macro.

  • initialize(): Initialize the Batch Sink. Guaranteed to be executed before any call to the plugin’s transform method. This is called by each executor of the job. For example, if the MapReduce engine is being used, each mapper will call this method.

  • destroy(): Destroy any resources created by initialize. Guaranteed to be executed after all calls to the plugin’s transform method have been made. This is called by each executor of the job. For example, if the MapReduce engine is being used, each mapper will call this method.

  • transform(): This method will be called for every object that is received from the previous stage. The logic inside the method will transform the object to the key-value pair expected by the Batch Sink's output format. If you don't override this method, the incoming object is set as the key and the value is set to null.

...

  • configurePipeline(): Used to perform any validation on the application configuration that is required by this plugin or to create any streams or datasets if the fieldName for a stream or dataset is not a macro.

  • initialize(): Initialize the Batch Aggregator. Guaranteed to be executed before any call to the plugin’s groupBy or aggregate methods. This is called by each executor of the job. For example, if the MapReduce engine is being used, each mapper will call this method.

  • destroy(): Destroy any resources created by initialize. Guaranteed to be executed after all calls to the plugin’s groupBy or aggregate methods have been made. This is called by each executor of the job. For example, if the MapReduce engine is being used, each mapper will call this method.

  • groupBy(): This method will be called for every object that is received from the previous stage. This method returns zero or more group keys for each object it receives. Objects with the same group key will be grouped together for the aggregate method.

  • aggregate(): The method is called after every object has been assigned their group keys. This method is called once for each group key emitted by the groupBy method. The method receives a group key as well as an iterator over all objects that had that group key. Objects emitted in this method are the output for this stage.

...

  • configurePipeline(): Used to create any streams or datasets, or perform any validation on the application configuration that is required by this plugin.

  • initialize(): Initialize the Batch Joiner. Guaranteed to be executed before any call to the plugin’s joinOn or merge methods. This is called by each executor of the job. For example, if the MapReduce engine is being used, each mapper will call this method.

  • prepareRun(): Prepares a pipeline run. This is run every time before a pipeline runs to help set up the run. Here you can set properties such as the number of partitions to use when joining and the join key class, if it is not known at compile time.

  • destroy(): Destroy any resources created by the initialize method. Guaranteed to be executed after all calls to the plugin’s joinOn or merge methods have been made. This is called by each executor of the job. For example, if the MapReduce engine is being used, each mapper will call this method.

  • joinOn(): This method will be called for every object that is received from the previous stage. This method returns a join key for each object it receives. Objects with the same join key will be grouped together for the merge method.

  • getJoinConfig(): This method will be called by the CDAP Pipeline to find out the type of join to be performed. The config specifies which input stages are requiredInputs. Records from a required input will always be present in the merge() method. Records from a non-required input will only be present in the merge() method if they meet the join criteria. In other words, if there are no required inputs, a full outer join is performed. If all inputs are required inputs, an inner join is performed.

  • merge(): This method is called after each object has been assigned a join key. The method receives a join key, an iterator over all objects with that join key, and the stage that emitted the object. Objects emitted by this method are the output for this stage.

...

  • configurePipeline(): Used to perform any validation on the application configuration that is required by this plugin or to create any streams or datasets if the fieldName for a stream or dataset is not a macro.

  • transform(): This method is given a Spark RDD (Resilient Distributed Dataset) containing every object that is received from the previous stage. This method then performs Spark operations on the input to transform it into an output RDD that will be sent to the next stage.

...

  • configurePipeline(): Used to perform any validation on the application configuration that is required by this plugin or to create any streams or datasets if the fieldName for a stream or dataset is not a macro.

  • run(): This method is given a Spark RDD (Resilient Distributed Dataset) containing every object that is received from the previous stage. Then this method performs Spark operations on the input, and usually saves the result to a dataset.

...

  • configurePipeline(): Used to perform any validation on the application configuration that is required by this plugin or to create any streams or datasets if the fieldName for a stream or dataset is not a macro.

  • getStream(): Returns the JavaDStream that will be used as a source in the pipeline.

...

  • configurePipeline(): Used to perform any validation on the application configuration that is required by this plugin or to create any streams or datasets if the fieldName for a stream or

  • dataset is not a macro.

  • getWidth(): Return the width in seconds of windows created by this plugin. Must be a multiple of the batchInterval of the pipeline.

  • getSlideInterval(): Get the slide interval in seconds of windows created by this plugin. Must be a multiple of the batchInterval of the pipeline.

...