Wednesday, March 17, 2010

Update on System Issues

Update (as of 10/19): All systems have continued running smoothly since the issue was resolved on Wednesday. The root cause has been determined to have been a defect in the replacement switch that was provided by HP, which caused issues with another otherwise functional switch in the network. The problematic switch has been replaced and all systems are stable.


Update (as of 2:00 EST): All systems are back up and functioning properly. We are continuing to monitor the environment closely and remaining cautious until the root cause analysis is completed.

We have had some system issues today and I wanted to provide an update and some explanation here - Andera customers have been receiving regular communications from our Operations team but I thought I would add some color commentary.

On Monday, we had a hardware failure with one of the switches that sat between our Oracle database and storage array. A redundant switch picked up all the traffic so the issue was limited to some performance problems during a two hour window when configuration changes were taking place. A replacement switch was delivered yesterday, and scheduled to be put into production during the maintenance window already scheduled for this Sunday.

Unexpectedly, this morning we had a hardware failure in the backup switch. Given that redundancy was not to be restored until Sunday, this has caused a connectivity issue between our database and storage array, causing a full system outage that is currently ongoing. Our Ops team, along with engineers from HP, are focused on getting yesterday's replacement switch into operation and restoring service. We will then conduct a full root cause analysis to figure out how two switches can fail within a couple of days of each other. We have some theories but need to investigate more closely once the system is back up. One thing that we do know for certain is that this is a hardare issue, not a capacity problem.

We have made very significant progress on system performance and resiliency over the last year, so this is out of character for us given the stability we've achieved over the last twelve months. Of course, we take this very seriously and fully understand the impact any Andera system issue has on our customers and partners.

I will post another update here once we have more information. Apologies for the inconvenience and thanks for your patience.

0 comments:

Blog Counter