Saturday, February 14th, 2009...21:28

Total Order Drop

Jump to Comments

Shortly before 7am GMT on February 12, 2009, Amazon.co.uk slowly began refusing to fulfill MP3 orders. By 8am, the MP3 store was failing to fulfill 95% of all orders and continued to do so until almost 2pm. I often wondered how things like this come about and now I have first hand knowledge.

I was sloppy, not fully testing a change which I presumed to be fairly innocuous. Unfortunately, QA was also a bit careless during testing and accidentally missed the same test case I hadn’t executed. Even more unfortunate, this test case was the one that would have caught the problem which eventually made it to the live site.

Once the problematic code made it to the live site, there’s the issue of why it stayed there for so long. Again, a few problems snowballed, starting with our team not being notified of the issue for 1.5 hours. It took another hour to get the rollback started, which subsequently failed and took another 30 minutes to get proper permissions to start the process.

As a result of this, several process improvements will be made. Some of them are good i.e. some of the escalation paths will be changed to include our team when appropriate but others just seem reactionary and will likely just result in more overhead with little reward.

While Amazon wasn’t hemorrhaging thousands of dollars (pounds in this case) per hour, I’ve also been wondering if there isn’t some sort of penalty I and others involved should pay. Sure I had to write up a document explaining the root cause and what will be done to prevent this from happening again, and I’m quite embarrassed and angry with myself; but there are really no other repercussions that I’m aware of. Incorrect things were done, and there’s no ambiguity, but there’s also no specific penalty. There’s no good way to quantify the damage done Amazon’s reputation. The money lost could be calculated, but there are no rules describing how that gets translated into penalties for the people involved. I’m not going to be put on bathroom duty or have to peel potatoes or write, “I will thoroughly test my code.” 1000 times on the whiteboard. Is the line of work I’m in above that, or does it just not make business sense to come up with these things. Whatever the reason, why are penalties so prevalent in other aspects of life?

Comments are closed.