Wikipedia:Wikipedia Signpost/2005-02-28/Power outage

Wikipedia suffered a major interruption last Monday when the Wikimedia servers lost power due to circuit breakers tripping at their colocation facility. Read-only service was restored after several hours, although it took more than a day for editing to resume, but in the end no significant data was lost.

The servers crashed at about 22:14 (UTC) on Monday, 21 February, due to a loss of power at Neutelligent, the Tampa, Florida company that provides hosting services for the primary set of Wikimedia servers. After some scrambling to determine why network access had been lost, Jimbo Wales contacted Neutelligent and learned that the cause was the tripping of circuit breakers.

In spite of the fact that some of the servers had a second power supply on a separate circuit, this circuit breaker tripped as well, leaving all of the servers entirely without power. Speculative theories suggested that because the tripping mechanism is magnetic, one circuit breaker tripping could also have caused others around it to trip, or that the servers all simultaneously switching over to the new circuit might have overloaded it as well. However, the actual reason for the circuit breakers tripping remains unknown.

The recovery
Once power was restored, the developers then faced the challenge of bringing the servers back online from an unplanned shutdown. This was complicated by the fact that all of the databases actively replicating at the time of the crash ended up being corrupted. Fortunately, one slave database server had been halted earlier and was storing updates in a log rather than actively applying them to its copy of the database.

This intact copy was used to restore read-only service initially, while the logs were being applied to bring the database back to its current state before the power went out (at most, a few edits in process at the moment of the crash may have been lost). Additional technical details can be found in the report on Meta. Wiki editing was restored at 22:26 (UTC) on Tuesday, 22 February, just over 24 hours after the initial crash.

In order to get back to a full level of service, the database needed to be copied again to several servers, and features such as watchlists and contribution histories remained mostly unavailable over the next day. Performance also stayed rather slow for several days.

LiveJournal comparisons
The incident had several points in common with the power outage last month that took down another community website with a sizable database, LiveJournal. The LiveJournal servers, hosted by Internap in Seattle, Washington, crashed on 14 January when someone improperly pushed the EPO (Emergency Power Off) button at the facility. Like Wikimedia, LiveJournal did not have its own uninterruptible power supply for the servers as an additional form of backup; whether the local fire code would allow this has not been determined.

LiveJournal, which uses a MySQL database like Wikimedia, needed a similar amount of time to restore service to users after its power outage. They also had to deal with corruption in the databases that had been shut down. Based on LiveJournal's experience, some developers suggested that the data corruption in the active copies of the database might have been due to the RAID hardware giving the database incorrect reports that information had been written to the hard disk when it was actually being stored in cache first.

Brad Fitzpatrick, founder of LiveJournal, expressed his sympathies over the situation as the developers were working to restore service.

Fundraising progress during downtime
Pages related to the ongoing Wikimedia fundraiser (see archived story) were naturally disabled along with everything else. As an emergency measure, a static version of the main fundraising page was temporarily hosted on Angela's website, but the traffic there, with a Slashdot story about the crash also contributing to the load, soon exceeded her bandwidth quota. About eight hours after power was lost, Wikimedia was able to make another backup of the fundraising page available on its own servers.

Wikimedia CFO Daniel Mayer reported that donations were not significantly affected by the downtime. PayPal donations for Monday and Tuesday remained fairly steady at levels somewhat below the initial surge from the start of the fundraiser. Mayer did indicate that donations increased after the Slashdot story was posted, about four hours after the power was cut off.