In the last few days the news has been full of stories about another high-profile IT failure, this time it is the consumer banking giant RBS, whose IT upgrade (provided by CA) did not go according to plan. A planned upgrade seems to have caused significant performance problems, making the system unable to cope with day-to-day load, leaving thousands of customers without access to their money, and unable to pay their bills. While the staff at the banks branches have apparently done the best they can, in a situation not of their making, it still leaves customers frustrated and understandably angry. This is the kind of IT failure that loses customers and can ultimately topple a business, but what went wrong?
Well I have no inside knowledge of these systems, but I would be willing to put good money on the fact that the events leading up to the failure followed a pattern familiar to C2B2 and our consultants. We have performed technical fault identification and resolution for a number of customers with similar high-profile IT failures, and while we can usually pinpoint a technical cause of the problem and identify a fix, the issue remains that the real underlying cause is one of process and management, rather than IT. The timescale to a modern IT project failure goes something like this:
An agile, feature driven methodology is chosen as the software development process, while this is not a problem in itself, insufficient weight is given to the delivery of non-functional requirements.
Six months before the project delivers, everything is looking good but the delivery of features is slightly behind schedule. No problems have been identified so far in the very limited performance testing done so far, so the decision is taken to re-allocate some of the testing time and budget to finishing the desired features.
Four weeks before the project delivers, ‘performance testing’ starts, running load against a cut-down version of the live environment. Tools like load-runner or JMeter are used to test the system, with a guess of what behaviours the users might exhibit, based on an old version of the site with less features and a fraction of the expected live volumes. Testing will focus on the “sunny day scenarios” when all hardware is working and an average number of users is doing “normal” stuff.
One week before the project delivers, ‘soak testing’ is performed by the developers, a small number of machines repeatedly hit the front page of the site for an hour, if the site doesn’t fall over then testing is considered to have passed.
The project delivers, users hit the site in large numbers using it in a way completely differently to how it was expected, the site stays up for a while, but after a few hours it slows down and fails.
The root cause of most IT failures we come across is a lack of proper performance testing with realistic user loads, and a lack of interpretation of the results when limited performance testing is performed. Throwing load at a system for one hour, with a binary outcome of “it fell over” or “it stayed up” tells you nothing about your system, you have no idea if it will fall over after 61 minutes or if it will stay up forever. Real testing needs to start early, use representative loads, and be performed for extended periods of time. Monitoring should be in place so that bottlenecks can be identified, and failure conditions can be extrapolated. When you come out of performance testing you should know what the weak points of your system are, what modes of failure it may demonstrate, and how it will behave with bursty traffic.
When you bet significant parts of your budget and reputations on a software release, skimping on the testing and focussing on “sunny day scenarios” may sometimes save you a bit of money, but when it fails it has the potential to cost far more in reputational damage than the IT system cost in the first place.
RBS are now facing a government enquiry into what went wrong, with the possibility of legislation forcing banks to disclose details of their IT systems and the causes of failures. I don’t imagine that the CTO is having a particularly nice time of it right now, so if you have a responsibility for IT systems, please don’t make the same mistakes.