People are divided on how to log, what to log, how much to log. A never ending discussion this is. In addition many open source libraries are available for logging. Not to mention many standards. I am not going to go in details of what is available out there. Use Google to pick your poison. What I am going to discuss here is what I think makes most sense with available technology.
- Log file should be machine readable.
Why? How often do you or someone checks the file vs if you have a monitoring solution that scrapes the logs. If logs are in machine readable format it simplifies the monitoring solution, spends less resources both during logging and distilling for events. And do not forget if you want to process your logs later to find performance bottlenecks or create transactions for troubleshooting. Same goes for stats files which are usually the snapshots of current state of application. There are few advantages of machine readable files.
- It is fast to write.
Because you would keep the formatting to the minimum. For example use the raw unixtime (or millisecond or microsecond resolution). I would suggest at least millisecond resolution. And use a single character severity (e.g. D for debug, S for status, E for error and so on).
- It is fast to read and re-process for monitoring and management.
Because timestamp is already in machine form it is easy to process it. Severity of messages is single character which again simplifies matching.
- Log as much as possible but log what makes sense.
This one is a double-edge sword and should be used carefully. Storage is cheap but Disk IO is not which could become a performance bottleneck for real-time applications. On the other hand more logging you have, easier it will be to troubleshoot. So think through and log what is must for troubleshooting. Main two options in my opinion are to use remote syslog type logging OR use different thread in your application to log to disk. Remote syslog is a vetted solution which I do not need to go through here. I will explain the second solution in next post.
So what should you log? Lets see. I would log following in below order. I will explain my motive in future posts.
- Short-code Severity of event
This is obvious. That is certain that Error and Fatal conditions are interesting for any monitoring solution. I prefer a single character short name. For example S for status, D for debug, E for error, F for fatal, W for warning. You could also choose to use integer levels but using these single characters have a distinct advantage for readability of log messages. Though here I am discussing that file should be a machine readable and user should convert it to human readable, yet there are times when you may want to look at raw data and a single character would convey more information than integers which you have to decode. And with single character you still remain within my logging standard.
- Timestamp of the event
Do not need to format the timestamp. This way you save few CPU cycles that would otherwise be spent in converting a Unix timestamp to a readable format (usually YYYY-MM-DD HH:MM:SS.ms) during logging. You would grab the current timestamp from system, preferably in millisecond resolution. Why? Obviously for real-time application you need to see how quickly application is performing. So millisecond or microsecond resolutions will help identify the bottlenecks. In addition it will allow to sort events from multiple threads easily and correctly create the time-line of events. Second advantage comes in when distilling the file for information. If the timestamp was readable, then monitoring application would need to read more and convert it to machine readable. So you hit two birds with one stone by using raw timestamp.
- Thread identification for Multi-Threaded applications.
This will allow easy identification of various work streams or threads in your application. Maybe you have multiple threads for distributing single incoming stream of data. All threads will be logging similar messages. So having a thread identification would help you distinguish which thread is logging what and isolate which one is experiencing problems. Either you can use numeric IDs here OR use something that is more readable. I would not use the thread ID returned by ptherad_self that being long, instead use an easy to identify number/short code in application. For example M for main application thread; L For logging thread; S for status thread; D1, D2, D3 and so on for data distribution threads and so on.
- An error code if your application supports that already (optional)
Using error code would simplify monitoring again. You would report less data and monitoring would easily match it with monitoring rule. But if you do not have it then don’t fret. You could still use string patterns as we discuss in next point.
Must report what has actually happened. This could also include data, data structures or whatever you think is useful for troubleshooting. Make sure that there is some kind of string pattern or token that can be used by monitoring software.
- Filename, function, class name, line number or whatever else that makes sense
This is self explanatory. Either you can use built in macros that will report some of these or you can define short codes in your application.