"That sounds...interesting" is a pretty common response when I explain to people that "I spend a lot of time working on logging, and no really, it's quite exciting." Yet I can never seem to drive home just why I find logging so fascinating as an area of technical inquiry.
In practice, logging is one of the few areas of data science in which one can work on the entire data lifecycle, from the generation of log data, to their transport to a data store, to their eventual analysis.
Most people, technical or otherwise, see logging as a relatively narrow thing. Programmers think of logging as
console.log("Interesting message"); or
logger.debug("Interesting message"); or, more commonly, the things you look at to diagnose problems. Non-technical people think of logging as some back-end thing that computer applications do rather than an essential source of data generation.
But logs really matter.
They matter to those aforementioned engineers who need the right information to diagnose their issues. They matter to cybersecurity teams that want to detect unauthorized system access. They matter to the marketer attempting to understand product usage. They matter to the data protection officer who seeks to understand which employee leaked sensitive data to the press.
In each of these cases, the social implications of properly leveraging log data are profound, but I'm just going to focus on the technical today. In particular, I want to explain why implementing and improving the full life cycle of logs is an interesting problem.
We're going to break down that life cycle at a high level, outlining some of the key components an interesting questions at each of the following stages:
- Transport and Integration
The goal is to give you a mental schema from which to understand how to divide up the challenges of logging. But first let's start with a quick outline of what I'm actually talking about when I reference "logging."
What is logging?
Logging is the process of producing a record of something that occurs in a computer system. The broader concept of a log is important to virtually all computer systems, but I will defer to this exceptional post from Jay Kreps (one of the creators of Kafka), as he explains this lower level concept far better than I ever could.
Logging as understood here is a multi-stage process that touches on everything from the question of "What should we record?" to "How should we understand the data we just recorded?" Those questions vary depending upon the type of log, of which there are many:
- Error Logs: Logs that tell you about something that went wrong.
- Usage Logs: Logs that help track specific usage statistics.
- Server Logs: Logs about the operation of a specific server.
- Performance Logs: Logs designed to track a system's performance over time.
- Audit Logs: Logs that tell you what a user did.
In other words, a log is a record of something that happened that someone wants to track. Simple as that. Let's talk about where things get messy.
1. Log Generation
Sure, we know what a log is in a technical sense, but how are they created and what thought goes into that? The short answer is "in code," but that doesn't do a great job of telling us anything at all.
It is true, though, that most logs are generated programmatically. This means that there is a programmer somewhere - let's call her Sarah - that explicitly decided that x should be recorded, where x could really refer to anything at all that can be captured in text.
But there is a ton of hidden complexity in that previous statement. Let's break down the things Sarah has to think about if she wants to record an action:
- Relevant Information: What information is relevant for the action being logged. Does Sarah understand the end use-case well enough to say what's important?
- Format: What format should the log be in? Should it be human readable? Machine readable? JSON and XML are very popular, but not necessarily suited for all types of logs.
- Enrichment: Does Sarah have everything available in the local scope to log the information she needs? If not, does she need to implement enrichment (the adding of additional information) of the log data?
- Performance: Are there unnecessary string conversions, memory allocations, etc. in the logging implementation that might impact performance?
In other words, logging is a specific part of programming, but it's one that is closely correlated to data generation andstorage. Even beyond the individual programmer, a system architect might need to think about whether their product is tracking all of the information needed. Such questions might be simple in the case of error logging but far more complex in the case of audit workflows or analytic logging.
In practice, engineers will rely on all sorts of tools in the process of generating logs. Some of the most popular in the Java world include log4j, Logback, slf4j, and other less-used frameworks. These do a lot of the heavy lifting around processing known types of log messages at a certain "Level" of importance.
Thus, the actual act of generating logs is dealt with in most places. Yet even these well-optimized frameworks cannot answer the difficult questions of logging, such as what to track and how to format that information. The above tools work for Java, but different tools exist for different types of applications. In practice, this reality can complicate the process of actually using logs.
2. Transport and Integration
So let's assume we have a perfect answer to the generation question. Our systems are outputting log data, consistently formatted, for the events we care about. That's great!
Unfortunately, there's not much we can do with our logs as written. Here is a single log taken from this blog, for reference (scroll right):
This is a log I would consider pretty well formatted, but it is still virtually indecipherable for someone who hasn't spent days of their lives growing accustomed to the process of staring a JSON data. Believe it or not, I've worked with a variety of groups that actually look at logs in such formats as an analytic method (software engineers for one). A software engineer may be used to looking at logs like this, but this simply doesn't scale across various systems, each of which producing hundreds to millions of logs per day.
Therefore, we need a way to extract the things we care about from high scale log data. Here we encounter another interesting technical problem, particularly in real-time use cases such as monitoring real time data streams and alerting.
If we assume that multiple logs are being written each second, and each is the size of the one above, we need a tool to analyze them. Unfortunately, most applications can't read log files very well. Most computer applications deal with structured data or at the very least data in a database. Therefore, the next challenge is to take logs from their source and get them to an analytic tool.
The process of batched and real-time transport of logs is different depending on the environment, but a few considerations are virtually always involved:
Where are the logs now?
- Are they on a server host? An Internet of Things Device? Your local computer?
Where do we want the logs to be?
- Depending on the analytic tool, logs will need to be replicated to different data storage mediums
Do Logs need to be processed in real-time?
- If we are attempting to stop malicious actors on our system, the answer is usually yes.
Are all logs in a consistent format (i.e. JSON, XML, log line, etc.)?
- Usually, the answer is no.
How mission-critical are these logs? Do we need to guarantee that we will not lose logs in transit?
The full scope of work around log transport and integration could fill hundreds of pages (and follow-up posts!), but suffice to say there are a great number of tools that make such processes easier. To name a few of the most popular:
- Kafka: Serves as a log store, message broker, and log queue.
- Logstash: Takes things from input A and sends them to output B. Supports filtering on the way.
- FileBeat: Tracks files on disk and transports messages as they are written.
- Redis: Can serve as a message queue in transit to a data store.
Data integration encompasses a huge scope of challenges, in part because every system is slightly different, so I'll leave it at that for now.
Analysis finally gets us to the question of "So What?" Presumably we have just invested time in generating logs properly, infrastructure in transporting them, all so we could ultimately analyze them.
Of course, I can't enumerate the exact workflow upon which an analyst might embark to analyze logs. I can however, outline a few things that are universally useful for log analysis based on my experience.
Schema and Normalization
Remember how we mentioned up above that programmers have the option of defining format for logs? This extends to the fields in the logs, so you might have one log with the time the event occurred labelled as "time" and another labelled as "timestamp" and another labelled as "ts" or just "t".
Particularly when you are dealing with data from multiple systems (pretty much always), you need to normalize to a defined schema of event records.
Without this in place, analysis will be virtually impossible.
Filtering and Aggregation Capabilities
One of the most powerful tools in the analyst's arsenal when it comes to log analysis is the ability to answer the question of "Show me all clicks on this button." or "Tell me how many times this error was thrown." These types of filter and count actions are particularly common in the world of usage analytics.
Given the scale of log data, however, any system need to be able to filter an aggregate at significant speed, and enabling these workflows should be a priority in the deployment of a log management system.
Alerting and Ongoing Monitoring
How do we act on log information as it is being generated? For cybersecurity workflows, this is a significant area of focus. While Machine Learning is frequently hailed as an answer to these challenges, that is but one approach to the question of alerting.
In practice, human-driven approaches tend to have the most effectiveness here, particularly in the context of setting up alerts based on their knowledge of the log data itself.
Oftentimes, the nature of computer systems causes us to record specific events multiple times. This isn't a major issue if we're recording errors, but it is if we want to get an accurate count of users on our platform. In such cases, data deduplication workflows can make log data significantly more valuable.
This raises the natural question of what tools exist to conduct these sorts of analysis? Perhaps the most popular answer to this question is Splunk, which has built a massive technology company purely from the analysis of logs.
That said, a number of other powerful tools exist for log processing:
Many, many more exist, but they obviously vary based on the application of log analysis. In follow-up posts, I may dig into some of these in more detail as we touch on particular types of logs.
The goal of this post is to touch on a lot of components of logging and more generally, to show how logging and log analysis is a complex challenge.
Despite a wealth of tools that exist, it continues to be a fascinating problem with data integration challenges akin to those across enterprise data systems. The main difference is that you don't need to be a Fortune 500 company to encounter challenges with logging. Just deploy a few different software programs with a few hundred users and you'll see the challenges for yourself.
In other words, logging actually is interesting. More importantly, doing the work to generate and process logs really pays off. You can trust me on that one.