Error Tracing and Routing in Microservice Architectures

Even the most well written code will fail at some point, even if merely due to uncontrollable external dependencies, and a developer will be tasked with figuring out exactly what happened and fixing it going forward. For monolithic projects, this can be as simple as searching the relevant log file for the error and analyzing the stack trace, but with the current trend towards distributed processes, this task becomes increasingly difficult. Nowadays, many projects consist of an increasingly complicated densely-connected graph of “microservices” – small independent services, which each perform a specific function and pass data around to others to accomplish complicated tasks. This introduces a variety of complications to the error-tracing process, as the developer must now search through dozens of log files on as many servers, each typically containing the disjointed mess of data that results from multithreaded processes. This process is error-prone, time consuming, and irritating, resulting in increasingly large delays and confusion when errors occur.

In Client Reporting, we face this problem often. A seemingly simple task, such as creating a report, may touch a dozen separate log files in a few seconds for example, we have services handling dependency checking, data gathering, report creation, binding, storage, approval, queueing, publishing, and distribution, all running multiple instances, before returning a generic error message with no hint as to where the error occurred. It was clear that we needed a more sophisticated error-tracing infrastructure.

Enter the Exception Monitoring Service, our concept of a Java server and library that unobtrusively plugs into existing services to intercept, merge, and provide easy visualization of distributed logs. Additionally, it provides Exception Handlers that can be used to automate responses when errors occur. Once integrated into a set of microservices via a few simple configuration changes, the Exception Monitoring Service will transparently package a set of contextual information into all communications, including workflow identifiers, application names, usernames, and exception information. It then uses this information to merge relevant chunks of disparate log files into one cohesive picture of what we will call a ‘workflow’ – a sequence of service calls kicked off by a business user, or batch job in order to execute a specific high-level task. By automatically merging the relevant parts of the logs of each of these services, the Exception Monitoring Service takes what was once spread across thousands of lines in tens of log files, and makes it immediately available and searchable on a single page. With this in place, developers can configure handlers to run when exceptions occur, ranging from simply emailing them a stack trace to updating dashboards automatically.

While the idea behind this is simple, creating such a service and integrating it into our existing infrastructure is complex. From an integration standpoint, challenges include: transparently intercepting and forwarding log messages, transparently packaging context information into service calls, and tracking workflows across servers. A central issue here lies in the keyword ‘transparent’; we need to minimize the work required to integrate existing services into this framework. Requiring extensive changes to our existing codebase introduces cost, effort, and risk, which presents a barrier to widespread use. Additionally, we have the challenges of reconstructing accurate merged logs, visualizing them, and designing handler interfaces. Below we will describe our initial approach to solving these challenges.

Modifying Service Calls

When our services call each other, they typically send the name of the service method they want run, and whatever parameters are required. We need to include a set of contextual information into all calls, but we can’t require developers to change their code to do so. Ideally, we want each service, on receiving a request, to also receive a chunk of data containing the following:

  • Correlation Id: an identifier, generated when the first request in the service call chain is kicked off and is transferred between service calls for the duration of the call chain.
  • Requester: The username of the person who kicked off the initial service request.
  • SourceApplication: The name of the application that initiated the call chain.
  • SequenceNumber: An identifier, incremented by each service that shows where we are in the call chain.

The vast majority of our services communicate by means of protobuf messages, which are automatically generated objects based on a simple description of the desired message. As such, it isn’t possible to transparently modify these messages – they simply don’t have any space to store our context information. Luckily, these messages need to be translated into data streams to send over the network, and then rebuilt by the receiver.

In our applications, this translation is typically done by a set of interceptors included via Spring configuration. This gave us an opening: create our own interceptor that mimics existing behavior, but injects our own data into each message.

This ContextInterceptor could be swapped into the Spring configuration without any code changes, and silently pass context information along with service calls. A similar interceptor exists on the server side to listen for incoming messages, and a similar replacement with our version, the ContextListener, pulls the context out of the message before passing it on to the service. In JAX-RS, this could be accomplished using a ClientRequestFilter to insert the context into messages on the client side, and a corresponding ContainerRequestFilter to pull it back out server side. Schematically, what we’ve done looks like this:
BlackRock Exception Monitoring

With this change in place, we could transparently pass context information around and make it available on a thread-local basis. Next we needed a way to get it automatically sent to a central server in real time.

Intercepting Log Messages

Our existing codebase makes heavy use of the Apache Log4j infrastructure, which provides logging services via a set of Java classes called ‘appenders’. Developers use a configuration file, called log4j.xml, to configure appenders for their application. A typical appender, which writes all logs to the console, may look like this:

BlackRock Exception Monitoring

To avoid wrapping all existing logging statements, we decided to create our own ContextAppender, which could then be included via a configuration change to log4j.xml. We use this appender to forward log messages and context information to our Exception Monitoring Service. This is implemented as two sub-appenders. The first is a generic log4j SocketAppender, which is used to send all logs to an instance of the open source ELK stack (described below), where they are indexed and made available for visualization. This is customized to minimize delays due to connection errors, and to selectively drop information, such as server heartbeats, that don’t relate to a workflow. The second appender is a custom-built ExceptionAppender, which forwards exception data directly to our Exception Monitoring Service. This appender includes a lot more data, including stack traces and a hashmap of data the developer can provide. Because of the importance of not missing exceptions, we utilize our internal messaging system, which supports queuing should the Exception Monitoring Service itself be down.

Tracking Process Workflows & Reconstructing Merged Logs

When our services talk to each other, each service only knows its immediate neighbors in the call chain. It is in general impossible to determine where a workflow was kicked off, and by whom. We enable this via the contextual information we are now able to pass around. Specifically, our service automatically constructs a unique identifier whenever a new workflow is started, which is transferred between service calls and persists throughout the call chain. With this identifier in place, it is possible to isolate only those log entries that relate to a specific workflow, resulting in a single merged log for the entire workflow. For the most part, this is sufficient to generate a merged log, but occasional timing errors and network latency can cause this log to have some out of order entries.

To solve this, we include additional call-chain information that allows the order to be reconstructed, called a SequenceNumber. The naïve approach to sorting this information would be to include an integer that is incremented each time a line is written to the log. While this works within each microservice, it requires information to be sent back up the call chain in order to work across service calls, and thus fails completely for asynchronous calls. The reason for this failure can be illustrated by the following simple example.

Consider the case of two services, A and B. Service A writes a line to the log, calls service B, and then writes another line to the log. Without knowing how many times Service B wrote to its log, Service A can’t know what the ‘next’ SequenceNumber is. This requires that Service A wait until Service B is complete before continuing, which prohibits asynchronous service calls. Schematically, the log messages we would receive from this example would be something like this, with the SequenceNumber in angle brackets:

A: Calling B
B: Received Request
B: Responding to Request
A: Response returned from B

Effectively, we needed to develop a piece of information that could be used to sort the vertices of an arbitrary tree, according to a depth-first search, without requiring any information to be passed back up from child vertices to their parents. Additionally, we wanted this information to be easily sortable. We solved this by implementing the SequenceNumber as a string which transforms as follows.

Each service maintains its own SequenceNumber, and represents it as a string of lower case letters, with a representing 1, b representing 2, and z representing 26. After reaching z, a new a is appended to the end and the process continues, with za representing 27, zb representing 28, and so on. Using this ordering is critical because it preserves the lexagraphic order when sorted: za comes after z but before zb. When a service is called, it is provided with the current SequenceNumber, and appends .a to the end. This process allows the ancestors of a vertex in the call chain to act as a namespace, generating unique, ordered, SequenceNumbers:

A: Calling B
B: Received Request
B: Responding to Request
A: Response returned from B

By structuring the SequenceNumber this way, we were able to sort log messages coming from any number of interconnected services into logical order.

Designing Exception Handler Interfaces

Having centralized, searchable, log information is useful beyond simply being available for developers to use for debugging. For example, to make use of it, we included a set of generic Exception Handlers, which can be configured to run based on incoming exception data to perform a variety of tasks. Developers could set up handlers to run whenever a defined set of criteria is matched, such as the application causing the exception, the exception type, or who the requester is. As an additional example of using the Exception Monitor Service, a development team could create a handler that sends emails to a support mailbox whenever an exception occurs in one of their applications, containing a merged log detailing the event. This can also be used to send user friendly messages to nontechnical users when errors occur. For instance, if a user drops a file in some designated directory to be processed, this handler could ask them to drop it again if any errors occur.

To facilitate complex handlers, we included a method by which processes can store data to be sent to our service in the event of failure, which is made available in exception handlers. Using this method, developers can configure handlers to automatically perform very specific tasks, such as updating specific rows of a dashboard. Additionally, we provide a simple interface for designing custom handlers should developers require one.

In total, an integrated service architecture looks like this:

BlackRock Exception Monitoring

Visualizing Logs

Once we had solved the problems of getting logs and contextual information to our service, we needed a way to visualize and search it. We chose an existing, open source set of services called the ELK stack (Elasticsearch, Logstash, Kibana). Our ContextAppender forwards log data directly to a Logstash instance, which parses it and forwards the result to Elasticsearch to be indexed and stored. Kibana provides a web-based interface for visualization and searching of the data, along with real-time dashboard functionality. With this in place, developers can watch all the logs of their applications for exceptions in real time.

Here is an example cross-service merged log, generated in real-time as a workflow involving four separate services is invoked:
BlackRock Exception Monitoring

This merged log is generated by simply filtering on a specific ‘correlationId’, which was generated when the workflow was kicked off, and persisted throughout, and then sorting lexicographically by the ‘sequenceNumber’ to order the events appropriately. Many other columns of data are available, but at a glance, we can see that the process was kicked off by the ReportIngesterClient, and set off a chain of calls ending in the DocumentRepository, where a SocketException occurred.

With a proper Exception Handler in place, a developer would receive this log within seconds of the error occurring, and know the root cause within a minute or two. Contrasting this with an angry call from a user followed by the drawn out process of first finding, and then tracing this error across several lengthy and almost illegible log files shows the vast savings achieved. By using the Exception Monitoring Service, the developer’s approach to errors becomes proactive rather than reactive.