Around 4.05am on Wednesday November 8 2023, Optus suffered a nationwide network outage lasting well into the evening, more than 12 hours later.
Now, Optus has released some information on what happened, stating “we now know what the cause was and have taken steps to ensure it will not happen again”.
As a telecommunications expert, I believe we should have no confidence in this statement, because the poorly worded explanation leaves many questions unanswered.
Could a similar outage happen again? We don’t know – but there are ways to make it less likely.
How did the outage unfold?
The Optus outage caused all services to go offline. Landlines, mobile phones, home internet, small business and enterprise, and cloud connections all dropped out.
The most serious impact of the outage was that Optus landlines couldn’t dial 000 and Optus mobile phones were unable to connect to the 000 emergency call service unless the connection occurred through Telstra or Vodafone infrastructure.
More than 10 million Optus customers were affected by the outage that brought Melbourne’s trains to a halt and left Optus’s small business customers unable to carry out EFTPOS transactions.
So, what went wrong with Optus?
Optus has revealed that a “routine software upgrade” triggered a cascading failure in the Optus internet protocol (IP) core network – the central backbone of their network that authorises device access and provides customer management.
“These routing information changes propagated through multiple layers in our network and exceeded preset safety levels on key routers which could not handle these. This resulted in those routers disconnecting from the Optus IP Core network to protect themselves.”
Routing information is used to find a path from one location on the internet to another – a router is a device that manages the traffic flows.
The explanation provided by Optus points to human error. This confirms what industry experts suspected had happened. The resulting flood of “routing information changes” overwhelmed key routers in the core network causing them to disconnect, thereby bringing the entire network to a halt.
Should the outage have been preventable?
Outages of this kind are not uncommon – human error has led to major companies going offline in the past.
But an entire telecommunications network going offline is unusual. The network should be designed in such a way that redundancy (backups) and resiliency are built in from the outset.
Before a software upgrade occurs, there should be modelling, testing and several layers of sign-off.
In case something goes wrong, there should be infrastructure and system redundancy. An automated or manual procedure should exist to ensure the redundant systems become operational within a few minutes.
It can be assumed Optus has a number of deficiencies, such as problems with engineering capability, testing, procedures, network redundancy and resilience.
Optus states they are “committed to learning from what has occurred” and will continue to work to “increase the resilience” of their network.
For this to lead to an effective outcome, Optus will need to carry out a review and put in place new processes, infrastructure and systems to prevent a similar outage in the future.
How do we know a similar outage won’t happen again?
We need enhanced government regulation of the Australian telecommunications network operators to provide improved visibility of the redundancy and resilience of their networks. The Senate has commenced an inquiry into the Optus outage.
Telecommunications is an essential service. Australians should be able to connect to the 000 emergency call service at all times. Reliable access to medical services, EFTPOS and the internet are vital.
If necessary, penalties should be introduced into the Telecommunications Act 1997 to ensure telecommunications network operators implement and maintain “best practice” related to network operation, redundancy and resilience.