It's been a while since I posted, I know. The delay hasn’t been for lack of interest or distraction as much as it has been from the fact that I just haven’t been doing anything interesting. Most of my time lately has gone to fixing “little stuff.” You know, that old app that works most of the time but has an issue that wakes everyone up once a month. Or that emergency “gotta-have-it” report that no one will bother to look at. Or that special request that comes from the local potentate that is meaningless, but that everything stops for because the potentate wants it.
It’s given me time to think about the overall question though: why do computer systems suck so bad?
If you’re bothering to read this blog, I assume you have to be a geek. Otherwise, you’d not be wasting your time. But we all know that geeks like to waste their time, so you’re here. So, for a minute, pretend you’re not a geek. Let’s say you’re a high-powered manager in a decent organization. You were that guy who played second string on the football team in collage and drank a bit too much the night before the early morning practice. Or you were that young woman whose social calendar was filled with must-do events and whose classes just kept getting in the way.
But you got your MBA and found you had a real aptitude for understanding and managing how companies actually make money.
Now, you find yourself going in each day and looking at a spreadsheet with your morning coffee. This spreadsheet gives you a snapshot of yesterday’s performance and today’s challenges. And it sucks. The tabs aren’t right. The numbers are usually off. The thing is often late. You’ve learned to rely on this and you need it to get the competitive edge you crave. But the thing can’t even reliably calculate something as simply as a sales conversion rate.
You call your IT director and scream into your phone in what becomes a daily ritual. And you just can’t understand why creating a simple spreadsheet should be so beyond your IT department.
Why is it? You’ve got a bunch of good people who all work too many hours and really want to make it right. You’ve got a – maybe not ideal but – reasonable budget. What’s wrong?
What IT groups usually miss is the big picture and the end goal. That business manager cares about his or her spreadsheet. They couldn’t care less that their IT department just worked a 14 hour day to install a new version of a J2EE engine.
What tends to happen in IT is that you get a developer. Let’s say that developer’s name is Marvin (you know… the guy who wears white socks with black shoes and reads comic books at lunch). And he’s pretty good. The code that he writes incorporates complex logic to pull data across a variety of distributed platforms, converts data types – because we all know that duplicated data isn’t really cloned, it’s mutated – and does these complex mathematical calculations that would make a physics professor blink. And it works pretty well – say about 90% of the time. Let’s say that 3 days each month (or about 3/30) it has some issue. It could be something small, like it tried to write data, but couldn’t get a lock for one of the rows. It could be an all-out crash that was really simple to fix by just restarting the app, but required a little manual intervention. Or it could be anything in between – maybe Marvin has a bug in one of his calculations, so that if some value is less than zero he gets the wrong results. But, all in all, it only has issues 3 days each month. Not a huge deal. Someone calls Marvin, he gets out of bed, logs in, fixes the issue and republishes the data.
But suppose the server admin (Alvin) has the same success rate – 9 in 10. About 3 days each month, the server runs out of memory or disk or drops off line without warning to auto-install new patches or something.
If the failure rate of Marvin’s process is 1 in 10 and the failure rate of Alvin’s server is 1 in 10, then the sum of the failure rates is 20%. But here’s the kicker. The whole is greater than the sum of the parts, because a failure in the server can cause downstream consequences that may not be visible instantly. (This is one of the things I can’t get my infrastructure staff to see, by the way. So as part of his process, Marvin, let’s say, writes some temp files. When he gets called because the server did an unplanned restart and he needs to re-run his process, he logs in, goes to a command prompt and runs the process. But he runs it with his credentials, not the credentials of the scheduler’s account, thereby causing the scheduled process to run into privilege issues on its next run. Now he’s just created a problem that was not part of either 10% failure rate. It’s a perfectly reasonable mistake to make. It doesn’t make Marvin a bad guy.
But the end result is that the total failure rate of the end process is now 10% for Melvin’s process + 10% for Alvin’s server + X% for mistakes made during clean up + Y% for other failures in the process caused by fallout from the server crash – missing temp files, lost session data, etc.
In the end, the failure rate of the whole process may be something like 25% or 30%. And that likely doesn’t include the failure of whatever systems Melvin is pulling the data from. If those systems also have a 10% failure rate, and their failure can cause downstream problems, this can add an additional 15% or more to the overall failure of the process.
So the reason computer systems suck is that the failure rates grow geometrically with each new error.
How do you fix it?
Naturally, you need to fix the individual issues. You need to get Melvin’s 10% failure rate down to 5% and then under 2%. But the real answer is that the systems need to be more loosely coupled and self-resilient. Any cross-system dependency needs to be identified and planned for. And someone needs to own – not the architecture of the individual pieces but – the architecture of the interaction between the pieces. This is probably the most overlooked piece of the puzzle.
Someone once said: “the more complex you make the drain, the easier it is to stop up the pluming”.
No comments:
Post a Comment