The Transactional Messaging System (TMS) is a CDAP service that provides a "publish-and-subscribe" messaging system that understands transactions, and that guarantees the ordering and persistence of messages.
It uses HBase for the persistent and durable storage of messages, and supports both transactional and non-transactional message publishing and consumption.
Topics are created in TMS using both a namespace and a topic name. Note that while it is possible to fetch from the system namespace, attempting to publish to the system namespace will result in an error.
Because TMS does not use authorization, you can only create topics in the same namespace that your application is running in. For that reason, the name of the current namespace is not required to create a topic. However, you can publish to all topics in all namespaces (except the system namespace) and fetch from all topics in all namespaces.
A valid topic name can only contain characters from the Alphanumeric Extra Extended Character Set, as described in the Naming Conventions: alphanumeric characters, underscores (_), or hyphens (-).
Messages in TMS are published as either non-transactional or transactional messages.
Non-transactional messaging in TMS works in a manner very similar to other messaging systems:
Publish a message (also called a payload) to a topic.
Fetch (also called subscribe or consume) from a topic.
TMS provides strong ordering guarantees for the consumers of a topic:
When fetching from the same topic, every consumer will see the exact same order of messages.
A consumer, fetching from the same topic more than once from the same point, will always see the topics in the exact same order.
Each message has a timestamp, which can be thought of as the published time, or the time the system persisted the message. Messages are uniquely identified by a concatenation of the timestamp, the topic, and a sequence ID (for distinguishing messages published in the same millisecond).
Under a high-concurrent load, the actual ordering of messages will be arbitrary, but is guaranteed to be consistent when fetched. With non-transactional messages, messages are available for consumption (fetching) as soon as the method call that publishes them returns.
Note that non-transactional consumers see all messages of a topic, including messages that are currently in a transaction.
With transactional messages, messages are not available (published) until the transaction has been successfully committed.
For example, a data pipeline might follow these steps:
Open a transaction
Do some work
Publish a message
Do additional work
Commit the transaction
If the transaction is successful, the message is now visible to transactional consumers.
If the transaction is unsuccessful, the message is rolled back and is never seen by transactional consumers.
However, as noted above, non-transactional consumers see all messages of a topic, including the messages that were published and are currently in a transaction.
With transactional publishing, all the work in a transaction will appear atomically to downstream consumers who are also transactional. It is not necessary that those consumers be in the same transaction; instead, they merely need to be in a transaction themselves.
Example Publish and Subscribe
Consider a workflow that modifies a dataset, and at the same time publishes a notification to a topic.
If it were to publish to a topic non-transactionally, a problem can arise as there is no guarantee that the notification will be published only after the dataset commit. If it were to publish transactionally to a TMS topic, there is the guarantee that transaction consumers will only see the notification if the write to the dataset is successfully committed:
TMS only supports explicit transactions and does not support publishing from a long-running transaction, such as a mapper, reducer, or Spark executor.
Javadocs describing the TMS Java API are available in the package io.cdap.cdap.api.messaging: