Planning to Fail

hammer-smashing-computerAnyone who has ever used a computer knows that from time to time, computers fail. Hard drives go bad, network cards stop working, software crashes, you name it and it is bound to happen. So we all take precautions (at least we should take precautions!) in anticipation of these failures. We backup our critical data to another computer or some other form of removable media (CD’s, DVD’s, flash drives, etc). It’s all part of using computers and the software that runs on them.

This is a very important consideration at Covenant Eyes because, if we have a failure at some point in our systems, it may cause an ‘Internet Outage’ for our customers.

This past Saturday we had a failure that caused an outage for our filter users – less than an hour, which was caused by a failure in one of our power supplies. I won’t bore you with all the gory details, but some background is appropriate here. . . .

We have power supplies that are designed to carry us through short periods of power outages (we also have a huge generator to run all of our computers for days if needed). In addition to these power supplies, we have backup power supplies in case the primary supplies fail (we eliminate single points of failure wherever we can). So on Saturday evening, when the primary supply failed, the backup kicked in the way it was supposed to. Soon after the failover to the backup occurred, it overloaded because of the increased demand placed on it–and so it too failed. This caused our computers to stop running (which usually happens when you pull the power plug on your computer).

We have system monitoring tools in place that notified us immediately, and our tech team was on the spot within minutes to find out what caused the failure. We moved the power cords to a spare power supply and brought the servers up immediately. Because of the sudden nature in which the servers went down (due to lack of power), they took a while to come back up. During this slow startup period, our filter users were still unable to use the Internet.

Why am I telling you all of this? Several reasons, I guess.

First, so that you will know that we take outages very seriously here. We have built in redundancy as much as possible in all of our hardware systems. We have also designed our database and server software to be fault tolerant and to work correctly–even if some of our computers are down (which by the way happens very rarely).

Second, to let you know that when outages occur we learn from them. Monday morning we sat down as a team and went through the sequence of events to figure out how to prevent the same thing from ever occurring again. We are already working with our Dell representative to get some newer and more powerful power supplies as quickly as possible. We also have designed a way for the filter to function (properly) even when one of our database servers is down.

Finally, to let you know that we do not take your business for granted. The technology team continues to look for ways to improve our accountability and filter software, the web site and accountability reports, as well as our server systems and our network infrastructure.

Share this with a friend:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • email
  • LinkedIn
  • MySpace
  • Reddit
  • Technorati
  • Twitter
  • StumbleUpon

Leave a Reply

Subscribe without commenting