Director's Report: Inbound Email Server Failure

Dear Members,

Our mail servers suffered an outage on Monday which resulted in the loss of email being processed on that server. We take the integrity of our data extremely seriously and this is the first such incident that caused any amount of data loss within our email system. I apologise for the inconvenience this will have caused a number of our users and submit this report to explain the incident itself, outline its impact, explain our mitigation actions and explain the corrective and preventative actions going forward.

Incident

At around 9:30pm on Monday 23rd November, one of our inbound mail servers suffered a sudden and total disk failure. Whilst the remainder of our mail infrastructure continued to operate, this server remained offline throughout the night. All of our key servers are protected by enterprise grade “RAID” equipment to protect against single disk failures – if one disk fails the server can carry on operating just fine. On Monday night however, it appears that a fault in the server’s RAID, combined with maintenance work that involved physically relocating that server in our data centre, caused the RAID partition as a whole to become unstable and crash. On Tuesday morning 24/Nov the delay for emails on that server was announced, but attempts over the next 24+ hours to restore the server and recover the data proved unsuccessful.

Impact

This server failure resulted in any emails being processed in the queues on that machine at that particular time being lost. After a detailed forensic analysis of the available log data we have managed to recover the senders of the affected emails but the available data did not allow us to see the recipients. Discounting a substantial amount of spam, commercial bulk email and ‘do not reply’ style notifications, we have gathered a list of 2504 senders that we think will have been affected.

Response

We have hand-generated a polite ‘non delivery’ message to the senders of those emails including the exact time that their email was received by our server. They have been invited to identify and resend any appropriate emails. Naturally any automated mailers will not be able to react to this so you may have to take action to, for example: to resend yourself a receipt from a Paypal purchase or to resend a boarding pass.

Corrective Actions

In response to this we have implemented some changes:

Prior to any physical maintenance of any machine that may risk dislodging a hard drive from a RAID connection, all systems must be double checked to ensure that the RAID status is fully healthy. Whilst this does not prevent major impact traumas from damaging the server it reduces the risk of an incidental loose connection taking down a whole RAID set.
Isolation of email queues prior to maintenance. Emails are usually processed through our system within a few seconds. Prior to any maintenance we will block inbound email to flush all queues through to the central storage units. This will minimise any impact of ‘live’ queue data being stored on relocated servers.
Better monitoring and alerting has been put in to help give advance notice of any signals that might indicate an upcoming disk or RAID failure.
We are also going to investigate the feasibility of real-time replication of email queues – meaning that any email that arrives on our servers will be replicated in real -time to a second email server.

Again I would like to offer my apologies for this incident. This has happened on old infrastructure that we are currently in the active process of migrating all accounts away from. Our new infrastructure consists of brand new equipment offering much greater performance as well as a more robust architecture built from the extensive experience we have gained from operating email services for you, our valued members, since 2002. Thank you for your understanding.

Best wishes,

Daniel Watts
Managing Director