Streaming HTTP handlers

Streaming HTTP handlers

Services can be used for ingest and egress of data. In current CDAP (3.2.0), however, there are limitations to what you can do:

  • Every method call of a service handler is executed in a transaction. The typical transaction timeout is configured at around 30 seconds. That means, if the handler methods needs longer than that to complete, the transaction will fail.

  • The content of the HTTP request is always buffered up in memory, hence the handler cannot receive large data. It would be better to stream the content. 

  • In case of transaction conflicts, the handler has no control over handling that error. 

Here are some use cases where these limitations get in the way:

  1. A service handler to upload partitions to a partitioned file set:

    • With each request, a large file is received. 

    • Meta data about the file is received in the HTTP headers

    • Based on the meta data, the handler determines the partition key for the file

    • The content of the request is consumed and streamed to a file

    • The handler validates the file (possible using a checksum, or validating its size or number of records)

    • The handler may also parse the content as it is streamed and validate it using lookups in a dataset. 

    • The handler registers the file as a new partition

    • If an error occurs in any of these steps, the file must be deleted, or moved to a quarantine area; possibly a record of the error needs to be saved to a dataset

    • If there is a transaction conflict, the same applies. 

    • Also, in case of an error, the handler has control over the HTTP response

  2. A service handler to download large files:

    • Similar to 1., with the exception that this is simpler because no writes happen (and no conflicts) 

    • Also, the request is small but the response may be very large and take a long time to send.

  3. A handler to receive a sequence of records, and to process them one by one

    • Processing a record may mean storing it in a dataset, or lookup in a dataset

    • The response may indicate how many records were successfully processed (some may have conflicts)

    • The response may contain a new record for every record received.

    • The processing should continue in case of an error (even a transaction conflict). 

    • Possibly each record must be processed in its own transaction 

 

 

 

 

Created in 2020 by Google Inc.