Report on the “Black weekend” at our hosting provider
On Friday, 3rd February 2012, around 8pm, our site went offline due to a catastrophic failure of hardware within our hosting provider’s data centre environment. The root cause of this was multiple hardware failure within their central storage repository.
Our hosting provider had configured their hosting environment to deal with multiple hardware failures however because of the extent of the failures the hosting environment was not able to recover from its failed state. The hosting infrastructure was designed with multiple redundancies in mind and on all layers within the infrastructure.
Starting from the top, we have two enterprise firewalls serving the hosted environment and protecting it from nasties of the Internet. If one firewall was to fail then this will not result in the loss of connectivity to the environment. All traffic within the environment would seamlessly transfer over to the secondary non-faulty firewall. We have a cluster of two load-balancers that are both actively balancing connections to the various servers within the hosting environment, however, should one load balancer fail then the other non-faulty load-balancer would be able to handle all the traffic in the network without any loss of service.
We have four switches (two for external traffic and two for iSCSI traffic) each server has multiple connections to each switch thus allowing for a single switch on each layer to fail without any loss of connectivity. We have a cluster of four high end servers (Dell PowerEdge R710 servers with 64Gb RAM and RAID 1 OS drives) serving the Citrix Xen Virtualisation Platform running within this infrastructure. Each server has four network interfaces connecting to the four switches. There are two power supplies connected to each server with separate power feeds.
The Xenservers are in turn connected to the large SAN (Storage Area Network) with multiple large (1TB) hard disk drives (HDD). The SAN was configured with two disk volume groups; RAID5 disk volume group for virtual servers and a RAID 10 volume group for high throughput storage. In a RAID 5 configuration, there is allowance for a single disk failure and RAID 10 configuration allows for half the number of disk failures configured in that group to fail. Both RAID groups also have a hot spare drive to cater for an automatic recovery of a single disk failure therefore it would automatically take over the role of the failed hard drive and the RAID volume group will reconstruct all active data onto the hot spare, while waiting for a replacement HDD.
The SAN also have multiple protections against failures (redundancies); this also has two power supplies connecting to two separate power feeds. It also has two independent RAID controllers connecting to its disks such that should one fail then the other would take over the tasks of handling the data held within the hard drives.
Unfortunately, the incident that took place on Friday 3rd February 2012 was a result of multiple hardware failures within this SAN that resulted in the whole environment going offline. Further outages were caused when the SAN came back online but it had lost its IQN (iSCSI Qualified Name) and that also resulted in the failure of the virtualisation platform.
Our hosting continues to provide monitoring for the new hosted environment in a similar manner to the original hosted environment and looking at ways in which to further improve and upscale our services to best meet the needs of our users.