Every developer knows (or at least should know) about application logging and different log levels. After all, logging is one of the most important functionalities every piece of software should support. Yet, at the same time it is also one of the more difficult to do proper.
Logging itself is a fairly simple process. Basically, it comes down to “write some application state or event to somewhere”. Which could mean writing “Application started” to stdout, but it could also mean writing “2016-11-27T16:27:12.324Z [INFO] [thread: main] Application started in 100 ms” to a file, database, broker, etc. I will not explain how to log, how to write proper log messages or how to handle exceptions, but I will give an example why log levels matter, because I recently had an argument about it with a fellow programmer.
When I mean “log level”, I mean the usual ERROR, INFO, DEBUG, etc. classification contained in log messages. It makes sense to use at least these three levels. Many frameworks provide support for more than these, but these three are the bare minimum in my opinion.
Of course there are different opinions of whether a situation is an ERROR or rather an INFO classification. You can use whatever definition works for you and your colleagues, but USE them!.
Why different levels make a difference…
Imagine the following situation. You are on call and responsible for a system’s operation, when suddenly a red-light starts flashing, indicating the system isn’t working properly. Opening the log of the affected application, many events are shown. This log might look like this…
2016-11-27T16:27:12.324Z Business Event 1 occurred for customer 4711
2016-11-27T16:27:12.324Z Event invalid. Rejecting.
2016-11-27T16:27:12.325Z Business Event 2 occurred for customer 4712
2016-11-27T16:27:12.326Z Saving event to database
2016-11-27T16:27:12.326Z Executing some business logic
2016-11-27T16:27:12.401Z Business Event 3 occurred for customer 4711
2016-11-27T16:27:12.402Z Business Event persistence failed.
2016-11-27T16:27:12.402Z Database connection broken
2016-11-27T16:27:12.500Z Business Event 1 occurred for customer 4712
2016-11-27T16:27:12.501Z Event invalid. Rejecting.
2016-11-27T16:27:12.502Z Executing some business logic
Be honest… did you see the responsible log entry at first glance?
I didn’t. Ok.. I have to admit, this is a very, very simple and reduced example and the log messages are not the best.
Point being, though…
Do you have any indication of the importance of any of the above log messages to the system?
Does a rejected event leave the system in a operational state?
Will the broken database connection recover?
Developers will most likely have an idea of what’s going on, but who says, that developers will be in charge of monitoring the system?
Now imagine having many thousands, hundred-thousands, or even more of these entries and you are the one in charge of analyzing the latest outage. Not an easy task at all, right? (Even without the interference of thousands of additional stacktrace lines)
Ok, let’s try again with the following log…
2016-11-27T16:27:12.324Z [DEBUG] Business Event 1 occurred for customer 4711
2016-11-27T16:27:12.324Z [WARN] Event x invalid. Rejecting.
2016-11-27T16:27:12.325Z [DEBUG] Business Event 2 occurred for customer 4712
2016-11-27T16:27:12.326Z [DEBUG] Saving event to database
2016-11-27T16:27:12.326Z [INFO] Executing some business logic
2016-11-27T16:27:12.401Z [DEBUG] Business Event 3 occurred for customer 4711
2016-11-27T16:27:12.402Z [WARN] Business Event persistence failed.
2016-11-27T16:27:12.402Z [ERROR] Database connection broken
2016-11-27T16:27:12.500Z [DEBUG] Business Event 1 occurred for customer 4712
2016-11-27T16:27:12.501Z [WARN] Event invalid. Rejecting.
2016-11-27T16:27:12.502Z [INFO] Executing some business logic
How about now? Which of the above statements is most likely responsible for the outage? Not sure about you, but I’d bet on the ERROR event.
Finding the correct log level might not prove too easy. However, I think the most discussed question when it comes to log levels is: Is an unexpected exception automatically an ERROR?
Well? What do you think?
(Note: I’m not talking about the system-crashing type of exceptions, which ARE obviously errors. I’m talking about the exceptions, which occur from time to time but doesn’t make the system go BOOM.)
I for myself classify exceptions and events as ERROR if the “2 a.m. rule” applies. “If you’re on call, do you want to be woken up at 2 a.m. if this condition happens?” Based on this definition.