Director's Report: Apology and Overview

No charge for February

By way of apology I have taken the decision to credit all subscribed accounts with an extra month free of charge; effectively asking no fee for service throughout February. This credit will show up on your accounts shortly. I hope you will accept this token of goodwill as evidence of our deep regret for the events of the past week.

I am glad to report that your email service is again back to normal. We are currently running our system on an alternative infrastructure deployed into action since last week. This infrastructure is performing well but is not a long term solution in terms of ongoing performance and reliability as the service grows each year.

This report will both summarise the events leading up to and during the past week and also look to the future in terms of developing a sustainable architecture.

For those readers who are less interested in the detail and just want reassurance that the service will work I have separated the technical information out to follow in a later post.

OVERVIEW

In Autumn 2009 we recognised that that stability problems had started to increase and identified our storage solution as the root cause. We researched various ways to evolve the storage subsystem and received much advice that using a decent quality Storage Area Network was generally the most reliable and highest performing solution available. At this time we engaged with our Data Centre providers to deploy for Aluminati their SAN solution which was manufactured by the reputable company HP.

The SAN was tested over a period of 2-3 months and we gradually migrated users over to it over the quiet Christmas period. We noticed an immediate increase in performance, not just for those accounts on the SAN but also for accounts on the original solution. It was clear that they were straining under the load and this move was a very necessary procedure.

This migration was designed to be slow and completely transparent to our members. As such it was still in progress at the time of the outage with approximately 80% of user data having been moved.

On Wednesday evening we suddenly noticed a massive drop in performance of the SAN with it performing at approximately 5-10% of capacity. This caused massive problems for the system as a whole as it was not possible to feed data requests as fast as they came in. A build up of requests resulted in the subsequent failure of the service over the evening of Wednesday 17th February.

Recognising that the problem was to do with the SAN we suspended access and raised the issue with our SAN providers. They had already started looking into the situation when we called and we were confident that service would resume shortly. These SANs were built to recover from failure quickly.

Over the course of that night we monitored the performance of the SAN and noticed no improvement. At about 1am we confirmed that the issue had been escalated to HP engineers and they were investigating. They continued to work throughout the night but at 5am I had to make the call to abandon the SAN. I did not want the service to be down throughout Thursday and had lost confidence in the ability of the SAN to be recovered immediately.

Spare servers were brought online and configured to take data from the SAN. These were ready by about 7:30am. Details of what happened next are covered in my previous postings so I will not repeat this again apart from to confirm that mail access to new mail (Phase 1) was successfully restored around midday on Thursday 18th February.

During the final restore phase, and shortly after, we had a few other issues to resolve on the way. These included:

Overload of server due to restore process. Resolved through increasing throttling.
Sending of attachments failing to work. This was because the attachments directory was still trying to write to the SAN. This was corrected on Saturday.
POP accounts downloading mail again. We had to reindex mail after the restore. Unfortunately this caused POP clients think that all mail was new and download messages again.
Inbound email delay on Sunday for some accounts. We identified a permissions issue with some of the restored accounts preventing final delivery. This was corrected on Sunday.
Webmail access failing for some users – one of the webservers lost connection to the new mail stores and this had to be reset.
Sending email timeouts – one of the webservers had its mail accept process stop. Usual monitoring and automatic recovery systems had not been enabled yet as these had to be disabled during the recovery. These are now enabled again.
Mail filters rules missing meaning that filtered email was arriving in the inbox but not being moved to subfolders. These filters were restored after the rest of the email data.

Speaking with the President of our Data Centre last night we finally determined the cause of the SAN failure was a bad sub-node that had not failed badly enough for the SAN systems to automatically route around. We are demanding three questions:

What exactly in the sub node failed?
Why was it not detected automatically?
Why was a human expert engineer not able to identify the problem manually?

Until these questions are answered we will not be able to consider using the SAN again.

Our new long-term architecture is currently in planning and we expect to move to it, with minimal interruption over the next 4-6 weeks. This architecture is effectively a scaled up version of what was working very well for us last year it became too overloaded.

Learning from this experience we have determine the following requirements for the new architecture: