|Homepage / Publications & Opinion / Silicon.com
Peter Cochrane's Uncommon Sense: Reliability and downtime
Five ISPs and he still ends up cursing his computer like the rest of us
The concept of downtime has been with us for more than 100 years and emerged from the early telegraph and telephone network era of the nineteenth century. As soon as we moved into telecommunications and extended our reach and control reliability and availability became an important features of government, management and society.
Well into the era of the automated telephone a magic performance figure emerged as a design target for each individual telephone exchange or switch. This was necessary as telephone networks grew across continents and ultimately linked every nation on the planet, which by the way only occurred in my lifetime. The increasing number of concatenated switches for long distance communications demanded extremely high levels of reliability - the failure of one meant the failure of all. This is the 'weakest link in the chain' problem.
So there is now a celebrated figure of five nines, often quoted in the industry, which says that a switch has to have an availability (or and uptime, in modern parlance) of 99.999 per cent - in other words a probability of 0.99999. In any one year of operation the totalised unavailability or downtime of a single switch has to be less than 0.0001 per cent, or a probability of 0.00001, which is total of only 5.3 minutes in any single year.
As an engineer I can tell you that 99.999 per cent is not easily achieved in complex machines and presents a substantial challenge. It dictates the use of multiple battery power supplies, generally backed-up by diesel generators, with many items of the control and switchgear at least duplicated by hot-standby circuits. All have to be switched over automatically in a seamless manner undetected by the customer should any single component fail.
There are not many items of technology that can boast such a performance or indeed such a high reliability figure. But when you consider the concatenation of around five switches for a single in-country connection, or 10 for an international call, it becomes obvious why this is so necessary. The downtime for five concatenated switches increases to around 26 minutes a year, while 10 switches will see around 53 minutes a year. This is all still pretty impressive but barely adequate for some modern businesses, especially banking.
The number of customers served by each switch compounds all of these reliability figures - for 100,000 customers terminated on one switch we have the potential for 100,000 x 5.3 minutes of totalised downtime.
The computer industry looks on 99.999 per cent with some envy and often struggles to approach 99 per cent. Is your PC up and running for 99% of the time or more? How about your ISP? In my experience ISPs have gone from struggling to give 90 per cent availability to now achieving 99 per cent. It is not that 99.999 per cent can't be achieved, it's just very tough to engineer as systems become increasingly complex. It is also very expensive.
Reliability is directly related to technological maturity and, as a general rule, the more we use and engage with a technology, the better we understand it and the more likely we are to achieve high reliability. This is axiomatic in the case of the automobile, for example, which over the last 25 years has gone from being a piece of technology to just a car. Today they always work and very seldom fail. Go back 25 years and the converse was true.
In my experience when trying to connect to ISPs it is not unusual to hear a line engaged signal or to get a modem that doesn't respond correctly or some synchronization failure, not to mention disconnections due to protocol mangle. The opportunities for ISP connection failure are compounded by a variety of software and hardware suppliers that have immature technologies. This is in complete contrast to the well-established telephone network.
As I spend a good deal of my life on the move and still have to maintain phone and email communication no matter what, I have adopted strategies to combat the shortfall in performance of today's technology. So I have accounts with five ISPs. Roughly speaking this gives me a downtime ~(0.1) EXP5 = 0.00001 or 99.999 per cent. Do I actually achieve this? Well, not exactly.
Although my laptop is extremely robust and I carry a full backup hard drive in case of theft or severe damage, there are times when I run out of battery power, encounter a software glitch or can't get an adequate wireless connection. But I am achieving around 99.99 per cent when I do wish to connect. In any one year my enforced downtime is only around one hour. While I suspect this figure will remain a distant dream for single ISPs and most users, it is probably a reasonable target for our new mode of mobile business.
To be offline for a full day is clearly unacceptable for any business but most can live with one hour.
The proliferation of WLANs and Bluetooth may see new levels of connectivity realised by accident. If all the devices I carry can communicate wirelessly then I may see a combination of digital mobile phone and WLAN roaming agreements securing 99.99 per cent connections. I will access my ISP via any media that happens to be available.
Many cities now have entire districts contiguously connected by Wi-Fi. Should Bluetooth also be adopted in the same way then connectivity options will also increase. But there is a further opportunity as the same technology is adopted for vehicular use. My laptop talking to a passing taxi, bus or train is not beyond feasibility and has already been demonstrated.
So it looks as if we may achieve the magic 99.999 per cent by default.
This column was typed after a full week of never using the telephone line to connect to the internet. Everywhere I worked there was either a LAN or WLAN available. But I despatched this to silicon.com from the back of a cab using my mobile phone on the way to Changi Airport in Singapore.