Microservices communications: Publishing messages to a bus in a reliable way
In microservices design we often need a service to publish messages or events to a message bus in order to notify other services that something interesting has happened.
Perhaps a user has subscribed to our newsletter and the Email Sender service needs to know so it can send a welcome email, or the Reservation Analytics service needs to be aware of all the reservations and cancellations within our restaurants platform to keep the business informed.
No matter which messaging solution we choose, eg. Kafka, RabbitMQ, Azure Service Bus, there is a common design problem that we need to be aware of and avoid.
Imagine that an interesting change is about to happen in our system; let’s say a user has filled out their information and is about to click the
Register button. Upon processing this request, the User microservice needs to perform two actions:
- persist the new user’s information to its database, and
- publish a
UserRegisteredevent to a message broker
Both steps are important for our system to function properly, so they should either both succeed or fail. For example, if the user info gets persisted but the event doesn’t get published, then the
Subscription service won’t know about the new user and it won’t activate their trial subscription.
It might sound essential that the two steps above are performed together, but this double write can lead to a number of problems. For example:
- When the message broker is be down or unreachable: A message broker is an independent system with its own uptime/SLA, which means it can be down when we attempt to use it. We also communicate with it over the network, which is unreliable by definition, so the broker can be unreachable when we try to send a message.
- When the service is killed mid-way: As soon as the database persistence step completes, the service dies unexpectedly. The message publishing step doesn’t happen at all.
- When requests have to be processed fast: Talking to a database and a message broker means that we talk to two different systems over a network, which adds to the total response time. If our service’s response SLA is tight, this might be an issue.
- When we need to republish old events: Imagine that we have just deployed a new microservice and it needs to catch up with all events that occurred within the last 30 days. The TTL (time to live) for messages in our bus is 7 days, so most of the events have disappeared. We need a way to republish these missing events without changing the state of any elements in the database. If those two actions are coupled we have a problem.
There are a few patterns we can leverage to avoid double writes. Which one to choose depends on the situation, as we need to consider some design aspects of our system, most importantly the database technology used by the service in question.
The premise of all the solutions presented here is embracing the eventual consistency model, while atomically performing each of the required steps.
That way we can guarantee that services will get to know about the changes they are interested in, just not at the exact time they happen but rather a few moments later (eventually). After all, sacrificing consistency in favor of high availability and partition tolerance is generally a good idea for cloud applications¹.
In addition, the service’s database becomes the source of truth for messages/events emitted to other consumers. This allows us not just to have better visibility into what and when something gets published, but also be able to republish items if necessary.
Note that by using these patterns, the event consumers need to ensure
idempotency, meaning that they should guarantee that they won’t process the same event more than once if the producing service happens to publish it multiple times into the bus.
¹ For more information on the subject you can read about The CAP Theorem. Note that within a microservices based application, not all of the components need to follow the same consistency model. For example, a
Payments component could favor consistency over availability.
Note that in Domain Driven Design terms, we only talk about eventual consistency between microservices / aggregate roots, not within the same one. An architect should be careful when designing subdomains to keep consistency levels inside one context, and apply proper failure semantics across different domains. (Many thanks to Hrvoje Hudoletnjak for contributing this paragraph to the article.)
1. The Transactional Outbox Pattern
For applications that use a relational database to manage their state, we can use a special
Outbox table as a staging area for messages to be published.
We use local ACID transactions to make changes to the application’s state, and as part of such a transaction we also insert a message/event record into the Outbox table.
Then we can have a separate
Relay process that reads messages from the Outbox table and publishes them to a message bus within a separate local ACID transaction.
It’s a highly convenient pattern, especially when we need an existing application to start publishing events. Most importantly, this patterns allows us to work directly with high-level events, in contrast with some other patterns as we’ll see next.
On the other hand, there is additional programming work to be done when writing/reading outbox messages, so there’s always the chance that a developer forgets to update this part when making changes to the main application logic and data shape.
2. The Transactional Log Tailing Pattern
For applications that use a database which supports log tailing, we can have a
Relay process that taps into the transaction logs, transforms detected changes into messages/events and publishes them to a message bus. Although this approach is similar to the Transactional Outbox, it comes with different pros and cons.
On the bright side, there is less development work involved as we don’t need to make any additional inserts when persisting state. On the other hand, the transaction logs are usually in a raw, lower-level format and there is some work involved to transform them to high-level events.
This approach works for both relational and NoSQL databases, as long as they support the log tailing feature. A good example of a NoSQL implementation of this pattern is Microsoft Azure Cosmos DB with the Change Feed feature.
When working with Cosmos DB, the
Relay service taps into the Change Feed API and receives notifications for data changes, which can then publish to a message bus. The Change Feed feature also works great with the full Event Sourcing pattern, which we’ll see next.
3. The Full Event Sourcing Pattern
With Event Sourcing we persist the state of a business entity, such as a
Reservation or a
User, as a stream of state-changing events. To make a change to a business entity we append a new event to the stream, and to recreate an entity’s state we replay all events in the stream.
Since with Event Sourcing we don’t edit records but rather only insert new ones, we always change state in an inherently atomic way. The challenge lies in publishing events to a bus in an atomic way as well.
When working with Event Sourcing, we refer to the database as the
Event Store. Although there are products specialized in storing events, such as Greg Young’s Event Store, we can essentially use most SQL or NoSQL database products and adjust our custom event sourcing implementation to account for each database’s strengths and weaknesses.
The way we publish events reliably depends on the Event Store. For example, with a SQL implementation we can have an
Events table to store domain events, and a flag that indicates whether a record has been published. It’s similar to the Outbox pattern, but we don’t need the additional Outbox entries.
Transactional Log Tailing is also a good option since with Event Sourcing it’s much easier to know exactly what to listen for and act upon. Finally, when using Cosmos DB as an Event Store, the Change Feed feature fits naturally into the design.
I have successfully used both relational (Azure SQL) and NoSQL/document-oriented (Azure Cosmos DB) databases as an event store. The jury is still out regarding which approach is better, as there are cost, scalability and other factors to consider.
Nevertheless, you need to remember that with software engineering there are no silver bullets, hence the decision depends on the system and requirements at hand.
More reasons why reliable messaging is so important
Distributed Transactions using the
Two-Phase Commit Protocol (2PC) is not an option for microservice-based applications. Not only it restricts our options for Databases, Message Buses, etc. as they need to explicitly support the protocol, but can also significantly reduce the availability of our system; for a distributed commit to succeed, all participating components have to be online and reachable.
Some business transactions may still need to span across many microservices. Since each microservice has its own private database and Distributed Transactions have to be avoided by design, we need a different approach for keeping data consistent.
That is where the Saga pattern comes into play. Although the patern’s details are outside the scope of this post, the important thing to note is that Sagas heavily rely on messaging.
That is one more reason why messaging is so important for microservice-based applications, so we need to take extra care that is done in a reliable fashion.
Messaging is in the heart of microservice-based application design, so we need to make sure that is done in a reliable and predictable way.
To that end there are several solutions we can leverage, depending on the database technology we use and its features, plus a number of design aspects of our system.
Message/event consumers need to handle events idempotently, meaning that the downstream processing needs to ensure that no undesired duplication of effects happens, no matter how many times they come across it in the bus. (Many thanks to Ruben Bartelink for contributing this paragraph to the article.)
Finally, when we need to keep data consistent across microservices we should embrace Eventual Consistency, avoid 2PC and use Sagas instead. The Saga pattern heavily relies on messaging.
Eventual Consistency - Image Source
Want to build expertise in microservices? Check out my complete Collection of Microservices Books and don’t hesitate to drop me a line with your feedback.
Until next time!
Become a DrinkBird insider
Get updates on my newest stories, tutorials, interesting books I come across etc. Subscribe Very low frequency. Your privacy, guaranteed. Unsubscribe any time.
Full disclosure: the following are Amazon affiliate links. Using them to purchase a book won't cost you extra, but will help me buy more books.