As we all know, Glitch Happens™. But did you know that the true cause of a glitch is rarely a single mistake? In the technology world, a series of mistakes must come together before a massive systems failure happens. Yet companies often tell the public that it was all just due to an “update” by someone on the technical team.
Telecom outages that hit all of Canada
When one of Canada’s telecom giants, Rogers, experienced a major outage in April 2021, they claimed it was simply a mistake made during a software update. The company issued an apology and promised it wouldn’t happen again.
Until it did.
A little over a year later, on July 8, 2022, Rogers had an even more massive outage and blamed it on an update again.
Do massive technologies go down just because of a software or system update? Yes and no. I’ll explain why shortly, but first let’s take a look at how the most recent Rogers nationwide outage not only affected their customers but people who weren’t even Rogers customers at all.
Rogers’ Second Outage in 15 Months
The outage in July was so serious that many of Rogers’ technical staff could not even access the Rogers network to see what the problem was and couldn’t fix it. Only those employees with emergency SIMs on other Canadian networks were able to use their phones to look into the situation.
For nearly one full day, 25% of Canadians couldn’t make a telephone call, reach Emergency Services (911), or use the internet. Absolutely nobody in Canada could make an e-payment or use their debit cards at all, whether they were a Rogers customer or not. Many Automated Teller Machines (ATMs) were affected, as were specific banks that relied upon the Rogers network. Flights were disrupted. And in Toronto 25%, of traffic lights were affected and none of the parking meters worked.
In another city, medical appointments for cancer patients requiring radiation therapy had to be rescheduled to another town.
A big outage highlights bigger problems in the organization
I’m not privy to exactly what goes on at Rogers, but I have helped other companies that have complex technologies, including banks and telecoms, improve their quality issues. When systems go down, there’s a bigger problem to fix than just how a programmer codes or how the software is updated.
One of my banking clients had been experiencing many “little” outages for months. That is, each month, dozens of individual applications would suddenly stop working. The ATMs would fail in one region for a day, the mortgage system would have a hiccup that prevented house sales from closing, bank accounts would freeze up, the online system would not be accessible, and so on.
Any one of these would be an irritant to the customer, but the problems kept compounding until these sorts of “hiccups” happened nearly 100 times per month, affecting every region of the country.
The IT Executive Team kept directing the technical staff to fix the individual problems — sort of like running around with a fly swatter while leaving the doors wide open — until a Board Member said, “No. This isn’t a programmer problem. This is something the IT Executive needs to fix.”
An Outage is Not Just a Technical Problem
This is when I was brought in to help sort the situation out. These outages weren’t just a technical problem, of course. After all, every bank and every other company that uses software run updates and maintenance fixes. Why was this bank having so much trouble?
We discovered that there were three major issues impacting the company:
- the growing complexity of its technology
- poor quality and risk management practices
- limited capabilities of its test environments
The combination of these issues meant that no matter what the bank’s technical staff did, each time an update occurred, problems ensued.
Who can fix these problems?
The first two issues were fixed by the leadership team and showed impressive results. The bank’s leadership team simplified its technology and strengthened its quality and risk management practices, reducing the bank’s outages to less than 10 per month in two years. But the third issue, creating a full-blown test environment that was robust and could handle the types of tests that were absolutely necessary, was far too expensive to tackle. e first sentence of the next paragraph, so I’d cut this and perhaps tweak the sentence there to encompass this.
To build the right type of test environment that could help deliver a reliable solution to the customer would cost anywhere from $50 – $60 million, plus millions of dollars for ongoing operational costs. It’s not that the bank couldn’t afford it. It’s that unless all of the other banks were mandated to have this sort of test environment, my client would be at a competitive disadvantage.
Companies require a level playing field
Unless all companies in a given industry are required to play by the same rules, a company that takes the responsible steps to deliver robust quality to their customers will be punished in the marketplace. This is why we must look even higher than the Executive floor or even the Board Directors at any company providing critical services to its customers.
We must look to the politicians and the regulators to mandate a level playing field to ensure that companies do everything they can to reduce the likelihood of putting customers at risk. The telecom industry is a perfect example. We used to do an adequate job of this in it, but politicians and regulators have not kept up with the new technologies.
Getting Back to the Telecom Industry
In Canada, the telecommunications industry is regulated by the Canadian Radio and Telecommunications Commission (CRTC). As with many regulators, the CRTC is comprised of industry experts who return to the industry when their time at the commission is over. Having telecom industry or technical experience isn’t really required to shape an industry to do right by its customers.
What is required is concern for the safety and well-being of the public.
For example, when Canada’s telecom companies wanted to introduce cell phones to the market, they balked at the notion of implementing the same “continuance of use” safeguards that, decades earlier, had been mandated by telecommunications regulators around the world. All telecom companies were told that landlines had to work even when the electricity failed.
Proactive Safeguarding
If you ever had a landline during a power outage, you may remember the relief you felt when you picked up the phone and reached loved ones to check that they were fine. You may even have had to call for help. Back when telecom regulators were working more on behalf of the citizens rather than the industry, itself, they realized that even though few households actually had telephones, they were becoming more common and, in the future, more and more people would rely upon their telephones in time of need. Telephones were becoming powerful tools to reach out for help in an emergency. Regulators, from the UK to Canada to the US and around the world, insisted that telephone companies protect citizens by having double backups should the electricity fail. The UK considers communications to be “critical infrastructure” and so long ago its landlines were supported by:
[A]n emergency generator within telephone exchanges … [that] kicks in during a power cut to provide electricity to critical infrastructure. These generators have enough fuel for seven days and are often backed up by batteries. These batteries are trickle fed every day to keep them topped up with nine days’ worth of power, providing an additional runway for the exchanges until the mains is restored.
As I mentioned, when cell phones came along, regulators did make a gentle push to have this level of redundancy included in the development of cellular networks. I remember reading that the telecom companies managed to wriggle out of this requirement by claiming it wasn’t necessary. After all, if the electricity failed, it wouldn’t matter that the cell phones wouldn’t work. People who needed help could just use their landlines to call Emergency Services.
Except who owns a landline these days?
Telecom regulators didn’t look forward in time to the day when most of us would rely solely on cell phones and be without a landline. They left the public unprotected.
[By the way, many people imagine they have a landline because they have a home phone, but it’s likely VOIP (Voice Over Internet Protocols). When the electricity goes out, or your provider encounters a massive outage like the one Rogers experienced in July, bingo. They won’t be able to get help, either, because the CRTC has not mandated these services have backups, either. A true landline has built-in safeguards.]
Don’t we need safeguards for certain industries?
The most significant impact of the massive Rogers outage in July was probably the fact that many people could not contact Emergency Services or even call a friend for help. In response to the outage, the Canadian regulator asked Rogers to come up with ways to use other telecom providers as backup services. This is a great start as I suspect all telecom companies will want to have the same sort of backup arrangements, and the CRTC might very well mandate this for the entire industry.
But here’s another question: what happens if (and when) there is an electrical failure? It is not clear to me that the CRTC has ever mandated that cell phone towers must have a backup source of electricity in the same way that original landline services are required to do.
Clearly, all telecom service providers need to align to the same safety measures so that the public is protected.
It would be wrong and unfair to expect only one telecom company to invest in public safety when the rest aren’t required to do so.ired is concern for the safety and well-being of the public.
Other impactful industries
Companies that provide critical services to the public — like banks and telecoms and those in the medical device industry — very much need to have regulators make sure that there’s a level playing field. While the Rogers outage was painful for their customers and no doubt very difficult for the company, itself, it’s quite likely that the end result will be that the regulator requires all Canadian telecoms strengthen the quality of this crucial service.
Similarly, it would help if other critical service industries that rest on the backbone of complex technologies were also encouraged to strengthen their product quality and reliability, too. To do this fairly, companies building products that we humans rely upon for our safety and well-being need regulators that set a level playing field so that companies who do the right thing are not placed at a financial disadvantage.