Stability Changes to our Storage Systems

DEAR MEMBERS,

This report will address issues concerning service outages and improvements we are planning to undertake over the coming weeks to improve the reliability of the service we provide for you.

SUCCESSFUL IMPROVEMENTS TO RELIABILITY OF UTILITIES (POWER AND NETWORK)

Many factors affect service availability. Our current system infrastructure was engineered and deployed in the summer of 2010 to address critical utility failures, such as power supply and network connectivity. Prior to then these had been disrupted on numerous occasions due to unforeseen external incidents – such as lightning strikes and network attacks. However, since our 2010 infrastructure upgrade, we are happy to report that we have achieved virtually 100% uptime for both power and network supplies.

STILL SOME DISTANCE GO TO WITH OVERALL SERVICE AVAILABILITY

Nevertheless, overall we have not been happy with the various other outages we have experienced this year. Whilst our current uptime is well above that of 2010 (98%), we are slightly down on the overall uptime we maintained during 2011 (99.75%) and are determined to regain and improve upon this in 2013, with a target of 99.9%.

graph-01

STORAGE SYSTEMS ARE THE MAIN CAUSE OF OUTAGES

Many of you will no doubt have noticed that whenever we have experienced service availability problems, our service updates have generally referred to issues concerning our storage layer. During the past few months we have attempted to determine the precise cause of this instability and have implemented small changes to improve availability. One part of the system that frequently seems to be implicated in these outages is our Network File System (NFS) layer which lets services share their storage over the network.

UPCOMING CHANGES TO OUR STORAGE SYSTEM – REMOVING NFS

This month we are implementing our biggest change yet, involving the removal of the entire network storage layer from our server infrastructure. Currently our storage servers exist separately from our mail access servers, but after this change we will have a direct storage access model which should result in less data transfer overhead and a system that is not prone to instabilities in NFS. If NFS has indeed been the culprit for these outages, we expect to see a marked improvement in service uptimes.

A SEAMLESS MIGRATION TO THIS NEW SYSTEM

We have created a migration protocol that should not require any downtime to our service nor any changes to your own configurations. In practice, no data is actually being moved… we are simply redirecting traffic flows and you should not notice any interruption.

Naturally we are closely monitoring this process so that, should something go wrong, we will be able to respond instantly. The whole process will be completed within the next two weeks.

THANK YOU

We appreciate your patience as we strenuously work to improve the service we provide. Once this important upgrade has been completed I will publish a more general update to review the changes and to report on future developments.

With best wishes,

sig2