Site icon JVM Advent

I see events everywhere; who will organize them?

When a new paradigm of communication between applications appears, most developers are excited to migrate and start using it without thinking about the possible problems. Using events instead of classic communication like SOAP, REST, or GraphQL is no exception because it introduces a lot of benefits like decoupling the producers from the consumers, giving you the possibility to change flows, and increasing the performance, but not all are benefits.

The theory of the use of events looks excellent without any problem. Still, when you start to work with events, many challenges appear like how to document it in a way that anyone could understand the idea and the body of each event, what you need to do if an exception appears, how to deal with modifications on the body of the events and other problems related with the coordination of the event’s flows.

In this article, you will learn more about some common problems that could appear and some possible solutions to solve or mitigate the impact.

PROBLEM #1 – orchestration vs choreography

Defining how the events and the applications interact is one of the most documented situations. Books, articles, and conference talks explain each approach’s main idea. Still, it’s difficult to see which of them is the best alternative for some specific situation because most of them explain in a general way, so when you start to implement events, you choose one alternative instead of the other after a period in the future, the problems appear.

To recap, two alternatives or possible implementations of the SAGA pattern are choreography and orchestration. The first one indicates that there is a set of applications that produce and consume events without having control of the entire flow; each doesn’t know the entire flow and the events that compensate for every kind of problem. On the other hand, you have orchestration, where one or more applications manage the part or the entire flow, which implies knowing many things related to business logic.

ORCHESTRATION VS CHOREOGRAPHY

The main problem radicates with the complexity of the business logic and the number of applications that need to interact; this is one of the problems about why it is possible to choose the wrong approach. A possible solution for using SAGA instead of choosing one strategy or another could be a mix between both, which implies that you split your platform into different domains, sections, or groups where each of them has a specific microservice or application that works as an orchestrator and the interaction when different domains/group/sections could work with the choreography approach. The main benefit of this approach is that it is not one application that knows or listens to all the events from different domains, so it reduces the complexity of the platform and simplifies the problems of changing the format of different events.

ORCHESTRATION AND CHOREOGRAPHY

Consider this approach when your platform contains a lot of microservices, and most of them interact using events. If you have a simple scenario with a couple of events or the number of microservices is just a few, it could be a good option to consider one of the possible implementations of SAGA.

PROBLEM #2 – Testing

Testing an application implies many challenges, but when you introduce events, the complexity increases because, with the unit tests, you only cover some parts of the logic of creating integration tests that need to consume events from a topic that exists on AWS, for example.

There are two alternatives at that point: the first is to do a manual test on some environment, testing the entire flow, which is more like an end-to-end test than an integration test. The other alternative is to use a Docker image to simulate the infrastructure; for example, in the case of AWS, you have Localstack to simulate the SNS topic.

Some considerations related to using Localstack are creating a Docker file and adding a bash file to create all the topics and queues you need, like the following.

FROM localstack/localstack:0.14.3

ENV SERVICES=sqs,sns DEBUG=1 DEFAULT_REGION=us-east-1 HOSTNAME_EXTERNAL=localhost DOCKER_HOST=unix:///var/run/docker.sock

VOLUME /docker-entrypoint-initaws.d/
VOLUME /var/run/docker.sock

EXPOSE 4566

The VOLUME /docker-entrypoint-initaws.d/ will contain a file more or less like the following:

#!/usr/bin/env bash

set -euo pipefail

# enable debug
# set -x


echo "configuring sns/sqs"
echo "==================="
# https://gugsrs.com/localstack-sqs-sns/
LOCALSTACK_HOST=localhost
AWS_REGION=us-east-1
LOCALSTACK_DUMMY_ID=000000000000

get_all_queues() {
  awslocal --endpoint-url=http://${LOCALSTACK_HOST}:4566 sqs list-queues
}

create_queue() {
  local QUEUE_NAME_TO_CREATE=$1
  awslocal --endpoint-url=http://${LOCALSTACK_HOST}:4566 sqs create-queue --queue-name ${QUEUE_NAME_TO_CREATE} --attributes FifoQueue=true,ContentBasedDeduplication=true
}

get_all_topics() {
  awslocal --endpoint-url=http://${LOCALSTACK_HOST}:4566 sns list-topics
}

create_topic() {
  local TOPIC_NAME_TO_CREATE=$1
  awslocal --endpoint-url=http://${LOCALSTACK_HOST}:4566 sns create-topic --name ${TOPIC_NAME_TO_CREATE} --attributes FifoTopic=true,ContentBasedDeduplication=true
}

link_queue_and_topic() {
  local TOPIC_ARN_TO_LINK=$1
  local QUEUE_ARN_TO_LINK=$2
  awslocal --endpoint-url=http://${LOCALSTACK_HOST}:4566 sns subscribe --topic-arn ${TOPIC_ARN_TO_LINK} --protocol sqs --notification-endpoint ${QUEUE_ARN_TO_LINK} --attributes RawMessageDelivery=true
}

guess_queue_arn_from_name() {
  local QUEUE_NAME=$1
  echo "arn:aws:sqs:${AWS_REGION}:${LOCALSTACK_DUMMY_ID}:$QUEUE_NAME"
}

guess_topic_arn_from_name() {
  local TOPIC_NAME=$1
  echo "arn:aws:sns:${AWS_REGION}:${LOCALSTACK_DUMMY_ID}:$TOPIC_NAME"
}

send_message_to_queue() {
  local QUEUE=$1
  local MESSAGE=$2
  awslocal --endpoint-url=http://${LOCALSTACK_HOST}:4566 sqs send-message --queue-url $QUEUE --message-body $MESSAGE --message-group-id "saraza" --message-deduplication-id "saraza"
} 

ORDER_QUEUE_NAME="it-api-checkout-events-orders.fifo"

echo "creating queue: $ORDER_QUEUE_NAME"ORDER_QUEUE_URL=$(create_queue ${ORDER_QUEUE_NAME})
echo "created queue: $ORDER_QUEUE_URL"

echo "creating topic: $EVENTS_TOPIC_NAME"
EVENTS_TOPIC=$(create_topic ${EVENTS_TOPIC_NAME})
echo "created topic: $EVENTS_TOPIC"

echo "linking topic $EVENTS_TOPIC to queue $ASSERTIONS_QUEUE_NAME"
LINKING_RESULT=$(link_queue_and_topic $(guess_topic_arn_from_name $EVENTS_TOPIC_NAME) $(guess_queue_arn_from_name $ASSERTIONS_QUEUE_NAME))
echo "linking done:"
echo "$LINKING_RESULT"

echo "all topics are:"
echo "$(get_all_topics)"
echo "all queues are:"
echo "$(get_all_queues)"

A good alternative to using Localstack and creating an integration test is Karate, which is simple to implement and write the tests. I suggest that you create the event on each particular test, like appears in the following example:

Feature: consume orders

 Background:
    * url AppUrl //This is the base path of the application

  Scenario: test PRODUCT_SOLD event
    //Insert the event into the topic
    * def eventQuery = read('json/xxxx/event-query-scenario-.txt')
    Given url `http://localhost:${localstackPort}/000000000000/it-api-checkout-order.fifo?` + eventQuery
    When method get
    Then status 200

    //Check if the event was proccesed
    Given path 'order/1'
    When method get
    Then status 200

This is a possible approach to solve the integration problem; if you use Kafka instead of AWS SNS, the idea is more or less the same, but instead of using Localstack, you will use a Docker image of Kafka.

Last suggestion: in the case that your application sends an event to some topic, the idea is that you check if the event exists and, after that, delete it.

PROBLEM #3 – thin vs fat messages

The size or the amount of information each message contains is essential for the different applications. Still, it introduces a problem that you need to consider the tradeoff between including a lot of attributes in the body of the message that perhaps not all the consumer needs or instead only adding the IDs of different elements that allow you to reduce the size of the message. You only request certain microservices or applications to obtain the information your microservices need.

THIN VS FAT

There is no correct strategy for all situations; choosing one instead of another depends on the context, but if you send an event that a lot of different applications consume, perhaps it could be a better option to send the ID of the different elements and each application will have the responsibility to obtain the information that needs it to work.

PROBLEM #4 – versioning

This problem is familiar or connected with the use of events because the same happens when you need to introduce some disruptive changes on the endpoints of a microservice.

There are many alternatives to tackle this problem, like using the Semantic Version to indicate which version of the event represents one message. The consumer could filter the events for another version that does not support introducing some strategy to parse into different objects depending on which version in particular is.

Another alternative is to delegate the responsibility of registering the different versions of the events and indicate if they are backward-compatible in an Event Schema Registry; for example, in the case of Kafka, you can use Schema Registry. Before sending a message, the producer obtains the schema of the event, creates the message, and publishes it on the broker; the consumer does the same to deserialize the message.

PROBLEM #5 – Documentation

Documenting the events is one of the biggest problems if you have not started to do it at the beginning of the creation of the platform because it implies that you need to invest a lot of time to model all the events, the consumers, and the producers.

Why is it so important to do it, can you think? The main problem is what happens if some little attributes are in the message. If you don’t know the impact of the changes or the connection between different applications, it isn’t easy to understand how the platform works. To solve this particular problem, there are different strategies, some more sophisticated and others more trivial.

The trivial solution implies that you create a document on some tool like Confluence, Notion, Google Doc, or any that you prefer and put all the information about the consumers of an event, which is the message format. As you can imagine, this solution is not the best because having the entire platform map is complex, and not all developers like to write a long document with tons of information.

Conversely, some solutions are simple and close to the developers’ tools, like AsyncAPI, which is similar to OpenAPI3 to document REST applications.

This approach is great because it generates the documentation dynamically, and it’s simple to share with other developers. Still, again, you don’t have the entire picture of how the platform works, which is so important. To solve this particular problem, there is a tool called EvenCatalog, which allows the creation of documentation of the different applications and the message that goes from one application to another.

Example of the documentation using Service Catalog

One of the main benefits is that it’s simple to use. It is just an XML where you declare different things, and you can version the changes, but the main problem with this mechanism is that someone needs to keep the documentation updated with the latest changes.

One tool that combines the solution of using AsyncAPI to generate the documentation dynamically and how the different interactions between them could be Backstage, which Spotify created.

Consider that all the solutions could work for you depending on the size of the platform and the number of events. The best scenario is not to implement any solution to document your events, so analyze the tradeoff of using each tool and choose the right one for you.

PROBLEM #6 – failures processing

Errors processing events are expected because the same could happen on a simple HTTP request that produces, for example, an internal server error. In most cases, the queues have some strategies of retries by default, and in other cases, you need to configure the number of attempts before sending the message to the DLQ.

When something terrible happens, you need to consider many things, like creating alerts that detect the problem and notify you; there are many tools to do it, like Signoz, NewRelic, Dynatrace, and many more. Another issue associated is what happens with the status of distributed transactions, which implies that you throw an event to do the rollback of some operation after a certain number of retries or not change anything and send the event to the DLQ and with some strategy you decide what you do with the flow.

One aspect to consider to understand the problem and reproduce the situation with a test is to log everything with the event that your application consumes. This gives a quick way to know if the issue is in the application or the information the event contains.

PROBLEM #7 – RESEND INFORMATION

Not all the events that fail during the different attempts to be processed are unrecoverable; there are some situations when one of the applications consumes an event, and another application that provides certain vital information is down. This situation produces the event after some attempts to go directly to the DLQ.

When you have messages on the DLQ, you have two alternatives: if the error or problem is unrecoverable, you can discard the message, but in this case, you can solve the problem by reprocessing the event again. A possible implementation of this solution implies that you re-send the event to the queue again, like a new message, but how can you do this? How can you do the same on multiple DLQs that your platform has?

Create an endpoint of mechanism on each application that shows the messages on the DLQ and provides the logic to move or resend the message again, introducing the problem of having the same code or solution across multiple applications, which implies that someone needs to know different URLs.

DISTRIBUTED PROCESSOR

An alternative to this approach could be to create one that consumes the information of the DLQ of all the events, and with some back office, you decide manually what happens with each event.

RE-PROCESOR SERVICE

In both cases, you need to decide what you want to do with all the events on the DLQ; the question is which approach reduces the complexity to you.

An alternative to the previous solutions is to create a cron on each application that iterates the messages on the DLQ and can decide if the events could be reprocessed depending on some information. This implies many things, like if the error was related to some deployment or if some resource on the infrastructure was unhealthy. Still, it’s a possible solution instead to do all the manually.

PROBLEM #8 – naming conventions

In most cases, the naming conventions of the events depend on each company, like the names of the microservices or the UI components. Still, the problem with choosing a wrong or non-descriptive name for the events is the complexity when something bad happens, and you need to check the logs. Let me show an example; imagine that you work on an e-commerce website with tons of events, and for some reason, the order of one client was canceled; you check the logs and see that there was a problem processing the event on process-order and you don’t have information about which application produced the error, in this situation could be difficult to find which are the applications that are involved on the problem.

The idea of the naming conventions is that all the events follow the same pattern, which gives the necessary information to understand which part of the entire platform has a problem; following this approach, some possible naming alternatives could be:

<namespace>-<product>-<event-type>

<application>-<data-type>-<event-type>

<organization>-<application-name>-<event-type>-<event>

The previous examples are some of the most common, but you can choose the best alternative representing your company’s situation.  Consider adding some restriction or validation on the broker to prevent adding new topics that do not follow the pattern, which could be automatic or manual, where someone is responsible for checking each name.

PROBLEM #9 – order of the events

asynchronous communication if you need some mechanism to avoid it. The order of the events is not only connected with messages for the same event into a queue; in a distributed architecture, you need to know if all the events associated with a particular flow previous to you are executed correctly or not.

A question could appear in your mind: Why is a problem happening? There is no unique explanation for why this happens, but possible answers could be that some event was processed incorrectly or someone needed to understand the flow of events and produce the wrong event. It’s challenging to think of a solution based on one particular problem; you need to find a mechanism that guarantees the consistency of the information in any situation.

Let’s explain with one example: imagine that you work in e-commerce, and the microservice that processes a buy receives a message to deliver the products to the client. However, you never receive the event from the payment service that tells you if the payment was approved or not. In this example, the possible solution is to create a state machine that checks the actual status and which transitions are valid to execute before doing it.

Another solution, instead of having an attribute containing the status of the operation/order/buy, is to request the different microservices to check if the other events are finished. Here, the complexity implies making many requests to validate something.

PROBLEM #10 – no experience

The previous problems have more or less different alternatives to solve, but if most of your team or the team in the company does not have experience with the use of events, you will have a problem if you do not tackle it as soon as possible.

Someone in the company needs to research how to implement and solve the different problems that an event-driven architecture introduces, creating some archetype or template with the basics about consuming and publishing an event. Perphaps, after reading the previous sentence, you think that is obvious. Still, some companies decide to use events without analyzing the possible problems, so at some point in the future, the platform will contain many issues with different implementations using the same framework or library.

If you are an expert using event-driven architecture, try to create a small webinar or a talk to explain the best practices but focus on the problems to show your audience which catastrophe could occur in some situations, like processing a payment for a customer.

WHAT’S NEXT?

There are many resources about different topics connected with the management of events, but a few tackle some problems. The following is just a short list of resources:

Other resources could be great for solving some particular problems related to the documentation.

CONCLUSION

Using events on a platform could be great and give you a lot of benefits, like the possibility of working in parallel with other teams and changing the flows. Still, it would be best to focus on solving the problems before they appear because it’s not simple and implies analyzing different alternatives and discussing them with other members of your team or company.

There are a lot of possible problems that could appear using events. You can’t tackle all of them, so I suggest you prioritize them to solve the issue that could produce significant pain; for example, decide which type of SAGA (choreography or orchestration) you will use that is more relevant than the naming conventions.

One last thing: Do not be frustrated if you choose one approach to solve a problem instead of another; after some time, new problems appear. There is no unique way to solve the issue of using events, but all your decisions must be documented and discussed with other partners or colleagues. Hence, I suggest using ADR (Architecture Decision Record), which is an excellent way to track architectural choices.

Author: Andres Sacco

Andres Sacco has been a developer since 2007 in different languages, including Java, PHP, NodeJs, Scala, and Kotlin. His background is mostly in Java and the libraries or frameworks associated with this language. In most of the companies he worked for, he researched new technologies to improve the performance, stability, and quality of the applications of each company.

In 2017 he started to find new ways to optimize the transference of data between applications to reduce the cost of infrastructure. He suggested some actions, some of them applicable in all the manual microservices and others in just a few. All this work concludes with the creation of a series of theoric-practical projects, which are available on the page Manning.com

Recently he published a book on Apress about the last version of Scala. Also, he published a set of theoric-practical projects about uncommon ways of testing, like architecture tests and chaos engineering.

He dictated internal courses to different audiences like developers, business analysts, and commercial people. Also, he participates as a Technical Reviewer on the books of the editorials: Manning, Apress, and Packt.

Exit mobile version