Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Use the guide (link) for recommended values for executor CPU and memory that can be set using the Resources tab. The number of executors can also be set. If the source and sink system can handle it, linearly scaling the number of executors will also linearly scale up the throughput. In our experiments, we saw that linearly scaling the executors for a pipeline (DataStreamGenerator -> Wrangler -> Kafka) increased the throughput until a point and then started decreasing because the Kafka Producer sink couldn’t handle the load.

...

Use the guide (link), to create a dataproc Dataproc cluster of the correct size for your pipeline. It is suggested to create the dataproc Dataproc cluster in the same region as the system it reads or writes from. For example, if your Kafka cluster is in the us-east1 region, and your pipeline writes data into a Kafka SinkProducer sink, then create the dataproc Dataproc cluster in the us-east1 region. In our experiments, we noticed a huge performance improvement of about 24 times in the throughput of a pipeline with Kafka Sink Producer sink run on a dataproc Dataproc cluster in a different region vs in the same region.

...

Another thing to consider is that you can only have as many consumers reading from the topic as the number of partitions. If you anticipate more consumers in the future, add more topic partitions. In our experiments, we noted that increasing the number of partitions helped increase throughput till a point and then it starts decreasing. For our experiment, we used a real-time pipeline (DataStreamGenerator -> Wrangler -> Kafka) with a Kafka cluster having 3 brokers. As can be seen from the chart below, throughput increased till we hit 9 partitions and then started decreasing.

Configure the Kafka Producer sink (Ref.)

Once the Kafka cluster size has been finalized, following properties can be set on the Kafka producer in CDAP to fine-tune performance.

...

async. In previous versions of CDAP, Kafka Producer sink was using a synchronous send function that blocks the call until the record is sent to the topic and an acknowledgement is received. This has now been changed to use an asynchronous send function by default which makes the method return immediately after storing the record in the buffer. This has improved the performance of the Kafka Producer sink by 2.5 times. In our experiments, for a pipeline with DataStreamGenerator -> Wrangler -> Kafka and Kafka server setup with 3 brokers and 9 topic partitions, the pipeline was able to write 236 million records/hour as compared to 82 million records/hour previously. Please note that this feature is now default and no property needs to be set.

...