Debrief Part 2: Technical Detail and Future Systems

Last updated: 4:00PM 5/03/2010

Tags: Server, Update, News

Technical Detail

Original System Architecture: Single Replicated Pair

In 2007 we started to plan our next generation of data storage. We wished to have a resilient and replicated system that enabled us to not rely on any one machine staying up.

We went for a dual system configuration using a technology called DRBD (http://www.drbd.org/). Each storage system would run several large disks in a RAID 5 (allowing any single disk to fail without affecting the system) and be replicated in near-real-time to the second server. All access servers would be able to access these servers via a network file system (NFS).

Network Diagram of Replicated Pair Storage System:

Network Diagram of Replicated Pair Storage System

Each server would have its storage stored as two slices, A and B. Server 1 would run Slice A as the primary and replicate Slice A to Server 2 who would hold that data on standby. Likewise Server 2 would run Slice B as primary, replicating to Server 1.

Should any server experience a total failure, we would be able to mount the standby slice and point all access servers to the backup server for both Slices. Once we repair the first server we can then move the slice back and continue operation as normal.

The downsides to this system is that any storage failure would require manual intervention to bring up the backup slice and repoint the access servers. Additionally, the final restore phase of moving the backup slice back to its original server would require some additional maintenance downtime. The work would take only a matter of minutes but any downtime at all was something we wished to avoid.

When this system started to reach its capacity in Autumn of 2009, we started to investigate a solution that would not require such off-line maintenance periods and decided to test out a Storage Area Network (SAN) solution.

SAN Storage Architecture: “Highly reliable” solution

By the end of 2009 we had spent over two months testing the performance and reliability of a HP SAN solution provided by our Data Centre providers. Throughout this time the SAN performed flawlessly and never experienced any outages whatsoever.

The idea behind a SAN is that it is a very very reliable system that you can depend on. We regularly heard phases such as ‘never fail’ in our investigations of SAN systems and this gave us confidence that this would be the right solution for us. The SAN isn’t just one machine but a network of machines presented as one. Our mail access cluster only had to have knowledge of a single point of connectivity but in reality this was split across 4 highly powered storage servers. If any of those servers went down the SAN was designed so that we wouldn’t be affected so no downtime would be necessary.

Network Diagram of SAN System:

Network Diagram of SAN System

In December we decided to ‘pull the trigger’ and set our migration scripts into action. These were designed to run 24/7 and most of you will not have noticed your account move as they intelligently worked around when you were online. Throttled right down the entire migration would have taken several months and was about 80% complete when the SAN system failed.

It is now clear that the problem was that one of the nodes had ‘semi-failed’. If the node had completely failed we would have been absolutely fine. The problem is that with this partial failure, the node was kept active in the cluster and dragged performance down.

There is obviously a flaw in the monitoring system which must be corrected before we would trust our user’s data to this system again. Our Data Centre management have told us that should this happen again they would be able to detect and correct the problem within minutes rather than days but we have decided to allow their system to mature before we consider using it for mission critical use.

Future System: Multiple Replicated Pairs

During the outage, our priority was to restore access as soon as possible. Moving back to the two original servers was out of the question as they had already started to struggle with the load before the migration and we had grown since then.

Instead, we activated two additional machines and rapidly set them up as storage servers. We moved data from the SAN to these servers over the 3-4 days after the outage and stabilised our system. These servers are running independently (unreplicated) but are being backed up on a continuous basis. This is of course only here as an interim setup whilst we finalise a long term solution.

Our replicated pair storage architecture performed well up until the point of stress. At this point the system became slow and, along with other expected benefits, prompted the move to the SAN system. Given the SAN is not viable at the moment, we have decided to evolve the original solution and are setting up a lightly-loaded multiple replicated pair system:

Network Diagram of Multiple Replicated Pairs

Network Diagram of Multiple Replicated Pairs

This appears to be a relatively simple improvement to the original system, and on the surface it is. Underneath, however, we will be introducing some more advanced configurations such as increasing the number of slices, but keeping them small, for faster restore and investigating some, now-mature, automatic failover functionality that would reduce the downtime visible to users to a few seconds. There is also scope for combining some of the mail access cluster functions into the storage servers themselves to further increase performance. These will be investigated in due course.

In addition we will be adding additional capacity at regular intervals and well before any load related issues are introduced. Many of our past issues with our NFS mounts were related to the storage subsystem unable to keep pace with the number of storage read/write requests. A lightly loaded storage server will not suffer from these bottlenecks.

We will also be investing in a resilient infrastructure where possible - in particular looking to double up on power feeds and network connections. Last night’s brief power outage, affecting two out of three of our webmail servers, illustrated the importance of this.

These improvements are being worked on every day and will take a number of weeks to finalise. In implementing them we will do our very best to ensure that any impact on your day-to-day use of the system is avoided. Where necessary, we will schedule essential maintenance works to be carried out during off-peak evening hours and would appreciate your understanding.

I hope this brief gives you reasonable assurance that we are working as hard as we can to move on from the negative events of the last two weeks. We are certainly keen to avoid having to repeat the experience ourselves!

To close, I am keen to share that we have been working hard on a couple of exciting projects which will be back on course once we have stabilised the core service. These include a much upgraded webmail interface and we will release a preview-beta to you as soon as we have the basic functionality implemented. In our members’ survey many of you have commented how the number one thing you want improved is speed and responsiveness. This new interface has been designed with that in mind.

Thank you again for your understanding and patience. With these improvements implemented, I look forward to much calmer sailing ahead.

Yours faithfully,

Daniel Watts
Managing Director

Directors Report: Debrief Part 1: Apology and Overview

Last updated: 3:00PM 24/02/2010

Tags: Server, Update, News

Dear Members,

No charge for February

By way of apology I have taken the decision to credit all subscribed accounts with an extra month free of charge; effectively asking no fee for service throughout February. This credit will show up on your accounts shortly. I hope you will accept this token of goodwill as evidence of our deep regret for the events of the past week.

I am glad to report that your email service is again back to normal. We are currently running our system on an alternative infrastructure deployed into action since last week. This infrastructure is performing well but is not a long term solution in terms of ongoing performance and reliability as the service grows each year.

This report will both summarise the events leading up to and during the past week and also look to the future in terms of developing a sustainable architecture.

For those readers who are less interested in the detail and just want reassurance that the service will work I have separated the technical information out to follow in a later post.

Overview

In Autumn 2009 we recognised that that stability problems had started to increase and identified our storage solution as the root cause. We researched various ways to evolve the storage subsystem and received much advice that using a decent quality Storage Area Network was generally the most reliable and highest performing solution available. At this time we engaged with our Data Centre providers to deploy for Aluminati their SAN solution which was manufactured by the reputable company HP.

The SAN was tested over a period of 2-3 months and we gradually migrated users over to it over the quiet Christmas period. We noticed an immediate increase in performance, not just for those accounts on the SAN but also for accounts on the original solution. It was clear that they were straining under the load and this move was a very necessary procedure.

This migration was designed to be slow and completely transparent to our members. As such it was still in progress at the time of the outage with approximately 80% of user data having been moved.

On Wednesday evening we suddenly noticed a massive drop in performance of the SAN with it performing at approximately 5-10% of capacity. This caused massive problems for the system as a whole as it was not possible to feed data requests as fast as they came in. A build up of requests resulted in the subsequent failure of the service over the evening of Wednesday 17th February.

Recognising that the problem was to do with the SAN we suspended access and raised the issue with our SAN providers. They had already started looking into the situation when we called and we were confident that service would resume shortly. These SANs were built to recover from failure quickly.

Over the course of that night we monitored the performance of the SAN and noticed no improvement. At about 1am we confirmed that the issue had been escalated to HP engineers and they were investigating. They continued to work throughout the night but at 5am I had to make the call to abandon the SAN. I did not want the service to be down throughout Thursday and had lost confidence in the ability of the SAN to be recovered immediately.

Spare servers were brought online and configured to take data from the SAN. These were ready by about 7:30am. Details of what happened next are covered in my previous postings so I will not repeat this again apart from to confirm that mail access to new mail (Phase 1) was successfully restored around midday on Thursday 18th February.

During the final restore phase, and shortly after, we had a few other issues to resolve on the way. These included:

  • Overload of server due to restore process. Resolved through increasing throttling.
  • Sending of attachments failing to work. This was because the attachments directory was still trying to write to the SAN. This was corrected on Saturday.
  • POP accounts downloading mail again. We had to reindex mail after the restore. Unfortunately this caused POP clients think that all mail was new and download messages again.
  • Inbound email delay on Sunday for some accounts. We identified a permissions issue with some of the restored accounts preventing final delivery. This was corrected on Sunday.
  • Webmail access failing for some users – one of the webservers lost connection to the new mail stores and this had to be reset.
  • Sending email timeouts – one of the webservers had its mail accept process stop. Usual monitoring and automatic recovery systems had not been enabled yet as these had to be disabled during the recovery. These are now enabled again.
  • Mail filters rules missing meaning that filtered email was arriving in the inbox but not being moved to subfolders. These filters were restored after the rest of the email data.
  • Speaking with the President of our Data Centre last night we finally determined the cause of the SAN failure was a bad sub-node that had not failed badly enough for the SAN systems to automatically route around. We are demanding three questions:

    • What exactly in the sub node failed?
    • Why was it not detected automatically?
    • Why was a human expert engineer not able to identify the problem manually?

    Until these questions are answered we will not be able to consider using the SAN again.

    Our new long-term architecture is currently in planning and we expect to move to it, with minimal interruption over the next 4-6 weeks. This architecture is effectively a scaled up version of what was working very well for us last year it became too overloaded.

    Learning from this experience we have determine the following requirements for the new architecture:

    • Highly reliable individual storage nodes.
    • Lightly loaded servers running at 50% of capacity.
    • Maintain near-live replicas of data on standby nodes.
    • Failover ability to standby nodes within 15 minutes of detection.
    • Continuous backups kept in triplicate.

    We are confident that this will allow us to grow the system and improve stability to the levels you expect and demand.

    I shall follow this report with a more detailed one explaining these future plans.

    Yours faithfully,

    Daniel Watts
    Managing Director

Last updated: 13:30 22/02/2010

Tags: Server, Update, News

The Director will be publishing a full write up of the recent SAN situation along with a corrective action report. This is expected to be finalised before Wednesday.

Last Updated: 14:30 19/02/2010

Tags: Server, Update, News

Dear Members,

As announced yesterday, Recovery Phase 2 initiated yesterday afternoon at about 4pm. This procedure was to cover the restoration of old emails to your accounts.

Recovery continued all evening and into the night with additional processes being started up at 10pm once regular traffic had started to wane. Transfers continued until early morning and we are happy to report that 75% of data has now been transferred.

Unfortunately the early morning increase in member activity coincided with a few large transfers being initiated and caused the servers to become unavailable. We had to throttle the migrations back further and restablise the service (completed by about 9am). We've been stable since and are starting to gradually ramp up the restoration rate again to complete the last 25% of accounts. We will keep a close eye on progress and update you via the news section on the homepages.

Many of you have taken advantage of our emergency recovery service for access to important emails and this is still available. Email help@aluminati.net with details and we will try to restore the information for you rapidly.

Thank you to many of you for citing your understanding with the situation. Once the dust has settled we plan to take stock and give a full evaluation of what happened, why and what we are going to do about it. I am sure you'll agree that right now the priority is to get everyone back on their feet.

Yours faithfully,

Daniel Watts
Managing Director

Director's Report: SAN FAILURE

Last updated: 09:00 18/02/2010

Tags: Server, Update, News

Dear Members,

A very major failure has occurred within the emails storage system.

Our central Storage Area Network started showing degraded performance at around 6pm Wednesday evening. As our systems depend on this system to store your data, incoming email and access requests began to pile up, causing the system in general to overload. Usually such an incident is a temporary affair and service may be restored within minutes. It seems however that our technicians who maintain the SAN have been unable to pinpoint the exact cause. The SAN supplier, HP, is now on the case and providing ongoing top level expert assistance.

Meanwhile our priority is to restore access to your accounts. We have a two phase plan:

  • Phase 1
    We will restore your accounts to allow NEW email to arrive. You will be able to login and view new email and reply to it. We expect this to be available by lunchtime today.

  • Phase 2
    We will then begin the restoration of your email data. This will take somewhat longer as we have to move nearly a terabyte of data back into your mailboxes. We will have a more precise estimate of completion time once this begins.

Meanwhile we will release in the next 30-60 minutes an emergency forwarding interface to allow you to forward your incoming email to another account of your choosing. This will be available from this front page.

We will keep you updated through the website News section as progress is made. Please understand that we are working as hard as possible to respond to this unprecedented situation and are executing our Disaster Recovery Plan (which began last night) with the utmost urgency.

Please accept my deepest apologies for this inconvenience. As you know we take a failure of this magnitude very seriously. This SAN system was believed by all to be a very reliable one and has good standing in the industry. Unfortunately we seem to have had considerable bad luck in receiving one that has not performed as reasonably expected.

If you have an emergency and must have information from a particular email please email help@aluminati.net and we will attempt to recover this data for you on a case by case basis. Please try to only use this for actual emergencies so we can help those most in need.

Yours faithfully,

Daniel Watts
Managing Director
Aluminati Network Group