Services can be used for ingest and egress of data. In current CDAP (3.2.0), however, there are limitations to what you can do:
- Every method call of a service handler is executed in a transaction. The typical transaction timeout is configured at around 30 seconds. That means, if the handler methods needs longer than that to complete, the transaction will fail.
- The content of the HTTP request is always buffered up in memory, hence the handler cannot receive large data. It would be better to stream the content.
- In case of transaction conflicts, the handler has no control over handling that error.
Here are some use cases where these limitations get in the way:
- A service handler to upload partitions to a partitioned file set:
- With each request, a large file is received.
- Meta data about the file is received in the HTTP headers
- Based on the meta data, the handler determines the partition key for the file
- The content of the request is consumed and streamed to a file
- The handler validates the file (possible using a checksum, or validating its size or number of records)
- The handler may also parse the content as it is streamed and validate it using lookups in a dataset.
- The handler registers the file as a new partition
- If an error occurs in any of these steps, the file must be deleted, or moved to a quarantine area; possibly a record of the error needs to be saved to a dataset
- If there is a transaction conflict, the same applies.
- Also, in case of an error, the handler has control over the HTTP response
- A service handler to download serve large files:
- Similar to 1., with the exception that this is simpler because no writes happen (and no conflicts)
- Also, the request is small but the response may be very large and take a long time to send.
- A handler to receive a sequence of records, and to process them one by one
- Processing a record may mean storing it in a dataset, or lookup in a dataset
- The response may indicate how many records were successfully processed (some may have conflicts)
- The response may contain a new record for every record received.
- The processing should continue in case of an error (even a transaction conflict).
- Possibly each record must be processed in its own transaction