Wow! Today we had our first ever complete server failure. After two years of continuous operation, and with a whole slew of servers churning through over 500 million pieces of data daily, it was bound to happen! So I'm glad that is over with. Now, we have to fix it and move on.
I'd rather give you the summary than bore you with the details, so let me just say that one of our "Engines" (that's what we call the servers that do the heavy statistics gathering) had a RAID controller failure today, which resulted in the corruption of all of the data. This my friends, is about as much damage as can occur in a single server.
RAID is a technology we utilize in our servers that mirrors the data across multiple hard drives. This means that if you have one hard drive fail, there is a copy of the data on another, and the machine continues operating. Unfortunately, on rare instances you can have a controller failure, which damages the data on the drives. This is when you pray to the God of backups that you have another copy of the data elsewhere.
The Good, The Bad, and the Ugly
The bad news is that as I write this update, a small - but still significant - percentage of the Woopra user base is without their live analytics fix! This can also cause some related issues such as WordPress plugin errors communicating with the server if you are trying to check your stats in the WordPress dashboard.
The ugly news is that for the period of time that the server continues to be down, there will be no statistics to report when it comes up. This is because when the server that your site is being tracked on is out of operation, our other servers do not temporarily pick up the slack. Currently we simply have to get the server back online as fast as possible.
The good news is - we do have the historical information backed up. Thank God that we are paranoid and built a whole system for the purpose of backing up the other systems. Our backup systems have backup systems. :-) So, once the physical server rebuild is completed we will restore the data from backup and be back in operation for the thousands of you who are currently affected.
What We're Going to Do About It!
First of all, let me say how sorry we are for any inconvenience it is causing for those of you who are affected. I'm going to guess that for some of you this probably came at the worst possible time! Like, your site just hit the Digg homepage and you wanted to watch the glorious traffic in real time! Or you were watching people as they shop during this Labor day holiday.
Well, we feel your pain. And we are actually already several steps ahead of the curve on this one. We've been developing a network of internal systems to ensure redundancy and scalability in cases where these events occur so that they become transparent to the end user (that's you guys).
In the near future, a complete failure of one machine would not result in interrupted service as another machine would temporarily take over. Unfortunately this failure exposed the weakness just before it went away.
I suppose I didn't really need to say that. But in summary just know that we are working very hard to get service restored to those of you who are affected, and we will be updating you via the normal Woopra Status Twitter account.
Thanks for your patience and understanding as we work to resolve this matter.