While developing our new ePages product using a microservices architecture, we came to experience the challenges involved in operating these kinds of systems. Debugging is hard, but debugging a distributed system of microservices is even harder. We are going to tell you about our learnings in this series of blog posts, starting with the topic of logging in a distributed environment.
It’s only after shipping to production…
After a couple of very busy weeks or even months of development, you finally shipped your brand new system of microservices to production. You impatiently wait for the first real customer feedback, and are ready to implement the next big killer feature of your system. Then nothing is more irritating than your product owner telling you, that an essential feature is sporadically behaving “strange” and asks you to investigate.
…that the real fun begins!
But because of your well-oiled monitoring system, you were already informed about this hours ago. Here is how we at ePages deal with these situations to find out what is going wrong within our microservices.
Our first step when investigating, involves consulting the logs created in our production environment.
But since our microservices are running as Docker containers somewhere in the cloud, it would be very cumbersome to open a terminal connection to each of them, and use command line tools like grep
to search through log files written to disk.
Log aggregation to the rescue
As described in The Twelve-Factor App, logs should be treated as stream of events.
Containers should not write them to disk, but instead send them directly to stdout
to be collected and processed by specialized infrastructure components.
This logging infrastructure is responsible for gathering all log events from all runtimes in the system, transforming them into a format for easy searching, and storing them for later retrieval.
Dashboards and other visualization tools can then be used to start digging through the haystack of log events in order to find the needle we are looking for.
Structured log events
Tools like Logstash or Fluentd are Open Source solutions for gathering and transforming log events.
They provide one part of a technology stack often abbreviated as ELK or EFK, completed by Elasticsearch for storing and Kibana for visualizing log events.
While such tools are very flexible with regards to their input data formats, we already create log events in a structure that eases (or even lets us completely skip) the transformation step.
For our microservices executed in a Java Virtual Machine (JVM) we use the combination of Logback and Logback JSON encoder to produce log events in JSON, while still allowing seamless integration into all logging calls of our app.
The JSON format created out of the box is already a good fit for Logstash, but it’s also possible to introduce a custom JSON structure for more advanced use cases.
A typical Java application can be configured by adding Logback JSON encoder to the classpath and including this logback.xml
file:
An excerpt of JSON log events produced during startup of a Spring Boot application looks like this:
The properties @timestamp
, @version
and message
are what the ELK stack expects in order to visualize log events without any further configuration.
We also added the property app
to include the application name of the microservice for all log events using the following technique.
Using Mapped Diagnostic Context
The logging abstraction SLF4J of our choice comes with a feature called “Mapped Diagnostic Context” (MDC), allowing us to store arbitrary key/value pairs to be automatically attached to each log event without explicitly including them in the message we want to log. After request processing, it is important to also cleanup the MDC from all values that were stored in it, to prevent them from leaking into unrelated following log events. This can be automatically achieved by leveraging Java’s try-with-resources feature:
Logback JSON encoder already knows how to include MDC key/value pairs in the log events, and an example output generated by processing a request for user fred
looks like this:
Now it is easy to search for all log events that include a property named user
containing the value fred
.
Correlating log events
Most requests to a microservices system will be processed by the orchestrated effort of multiple containers, all of their log events being aggregated centrally. When inspecting the log stream, we want to correlate events from different containers based on matching properties to get the full picture of all participating microservices. This correlation can be useful for e.g. filtering all events that happened while processing a particular single request identified by a so-called unique Correlation ID.
Single point of entry
It is not uncommon to have one container acting as an API gateway in front of all the other containers. With its powerful reverse proxy capabilities, nginx is a robust choice for fulfilling this role. Using a custom log format, it is also possible to directly log every HTTP access using our desired JSON structure:
Notice how the $request_id
nginx variable is used to define an additional JSON property for every request, containing a randomly generated hexadecimal value used as the Correlation ID.
It is also communicated to the downstream service using a custom HTTP header X-Correlation-ID
, as well as returned to the calling client as a response header:
Adding this correlation-id
to the MDC for every request to our downstream service can be achieved using a servlet filter:
Now all container log events emitted as part of the same request also share the same correlation-id
, finally allowing them to be correlated:
Outlook
In an upcoming blog post of this series we will have a look at how we can enhance the JSON log events with even more helpful information for debugging, and how to investigate a request spanning multiple services, also known as distributed tracing.