I have the unfortunate task of having to report an unusual server outage that affected some clients today. Earlier we noticed some errors coming from one of our core monitoring servers. Shortly thereafter, the server failed to respond. We escalated the issue to our hosting provider who immediately began to triage the situation. Unfortunately, the diagnosis was not good.
It appears that at some point in the day the server, which runs on the Linux operating system, experienced a Kernel Panic. This is the equivalent of the "Blue Screen of Death" in the Microsoft world - otherwise known as a non-recoverable system error.
System crashes are a normal part of IT operations, but what is extraordinary in this case is that the system seemed to continue to operate partially during the crash, meaning that it went undetected for much of the day. Although the system was acting normal, it was not collecting data. Once the problem was isolated and the system restarted we learned of the missing stats for many of the customers on this server, which brings me to the point of this update.
Although the physical machine was restarted and is functioning properly at this time, frankly we no longer trust it. So in the next day or two we will be migrating the data to a new server as a precautionary measure. In addition to replacing this server, we were unnerved by the fact that it escaped notice for so long and we are working on new plans for earlier detection and greater redundancy.
Only a very small percentage of clients was affected, but if you are wondering if you were among them you can simply take a look at your stats for today and if you have any then you were not affected. If you see several hours of zero visitors when you would otherwise have expected them, then it's likely your account was on the server in question. Please keep in mind that as we move the data in the near future there may be a period of up to an hour where the server will be unavailable; however, we will do our best to minimize any further downtime.
On behalf of the team, we apologize for any inconvenience this incident may have caused, and we wanted to let everyone know that we continue to refine and improve the system every single day.
Thank you for your patronage and your kind understanding as we work through these issues.