23/08/07 – Datanet Service Outage – Explanation and Initial Report
Following the Major Service Outage we experienced on the 23rd August, please find below an explanation of the problem and our initial report into the cause and solution.
We would also like to offer our sincere apologies to you if you were affected by this incident. We at Datanet take our responsibilities to our customers very seriously and are very sorry for the problems and inconvenience this technical problem caused you.
On the morning of Thursday August 23rd, one of the Datanet racks in the IXEurope data centre was impacted by a power outage caused when supply breakers at the IXEurope hosting facility tripped. It is yet to be determined what caused the breakers to trip and Datanet are working with and are awaiting a report from IXEurope on this point.
Engineers were quickly able to determine that several systems were affected and a team of engineers were deployed to the main Heathrow data centre facility to join on-site technicians and begin systematically testing each server to determine the cause of the power failure. The power was then re-supplied to each server until all facilities were back online.
The email systems were then found to have a file system software failure on one of the storage devices which required file system re-building which took several hours to complete, engineers worked through to 02.30am on Friday morning at which point the mail system was restored. Our priority during this time was to safeguard historic, current and new email and to prevent any data loss and return services as soon as was possible, we have subsequently had no reports of lost or missing email.
The Windows hosting platform also suffered file system corruption. Due to the nature of the problem it took longer to rectify than normal, as a new Windows hosting platform had to be brought on stream. Datanet have since put in place new processes and procedures to prevent a recurrence of this Windows specific problem.
The solution, moving forward
In the short term, we have replaced faulty hardware and increased our monitoring of this particularly sensitive part of our network. In the medium term, we have identified that whilst critical systems are replicated across our network in geographically diverse locations there are dependencies which also need to be replicated elsewhere. This additional diversity of dependent services will require installation of new hardware and services in alternative locations to ensure that these dependant services are also more resilient in the future. With regard to this particular item, Datanet will provide further updates by way of newsletters in the future.
Once again, please accept our sincere apologies if you were impacted by this incident and please be assured that we are working hard to make sure this problem does not happen again.