Published on August 20, 2020
How to deal with network service degradations (and keep CTOs happy)
“What is wrong with the network?” Not exactly the question that an operations team at a mobile network operator (MNO) wanted to hear when the chief technology officer (CTO) called. The CTO was attending a major sporting event, and said that both he and his wife could not access the mobile network for a few minutes. “We have to fix this,” the CTO told the team.
I recalled this story1 recently when reading the news about a network service issue at a well-known MNO. Interestingly, mobile network availability and performance are taken for granted when things go well. But whenever issues occur, with service degradation ranging from slight to full-scale (i.e., outage), MNOs incur the wrath of their customers. So, how can MNOs deal with service degradations2 in the best-possible manner?
Detect network service degradations
In our story, service degradation was reported by a “customer” who happened to be heading the MNO operations team. Although such queries are important for MNOs, the impact of network/service issues discovered by customers can be detrimental, including negative publicity on social media. Unsurprisingly, the need to detect issues as early as possible, ideally in real time, is frequently highlighted by MNOs.
Most MNOs also understand the importance of warning customers about expected or experienced degradation, and of prioritizing issue resolution based on business criteria. Network/service issues are of a different nature, which necessitates service and customer impact analysis. In practice, some network faults may hardly be noticeable, while other issues affect key (e.g., enterprise) customers or CTOs attending sporting events…
Troubleshoot network service degradations
Detecting and prioritizing network/service issues is one thing. To address an issue (and, if possible, avoid it in the future), MNOs must find its root cause. In our story, the reported degradation was due to insufficient network capacity. Detailed analysis later showed that the sporting event spectators—who included many national/international roamers—caused network congestion through their unexpectedly high use of social media apps.
Unfortunately, issues may be far from straightforward to troubleshoot. Network/service complexity and visibility limitations regularly lead to hours, if not days, of investigation that involve different MNO employees, “war rooms”, etc. Some issues cannot even be fully resolved, becoming a recurring headache for CTOs. Yes, for their operations teams too.
Predict and preempt network service degradations
In effect, MNOs have been taking some form of preventative action for years. Still, action has often been suboptimal. For example, adding cell capacity based on the demand expected (by the relevant MNO team) at the sporting event was proven inadequate in our story. So, MNOs have been looking for more accurate, efficient and dynamic ways3 to predict and preempt network/service issues.
For the 5G health/life-related use cases and industry automation, where every millisecond matters and where degradations may have lethal repercussions, this preemptive capability becomes essential. At the same time, the experience of “traditional” mobile users—including CTOs—should not be overlooked.
Use AI and automation versus network service degradations
How can we better detect, troubleshoot, predict and preempt service degradations? Enter artificial intelligence (AI) and automation4. Exact definition aside, AI constitutes a key enabler of intelligent operations. AI promises to help MNOs address increasing multilayer network/service complexity, unveil hidden relationships between seemingly unrelated events, learn from past incidents, identify issues early (e.g., the network congestion in our story) and more.
AI is also a crucial foundation for automated operations, including real/near-real time or offline diagnostics and analysis. In summary, MNOs expect AI and automation to support their efforts for enhancing user—and, in the 5G era of massive and critical IoT (Internet of Things), device—experience and for meeting strict service level agreements (SLAs). The ultimate goal: customers should hardly ever notice service degradations. In other words, no more unexpected calls from CTOs to operations teams.
For the record, the issue in our story could only be resolved after the event had ended, which prompted the MNO to look for new solutions to help identify and analyze service degradations. When will such degradations and such stories become memories of a challenging mobile network past? Not easy to tell. Also, for now, even attending an event in a packed stadium is not an option5. But we can reassure the CTO in our story, and every MNO CTO concerned about service degradations: we are going to fix this.
1 A colleague told me this story, related to a football match between two national teams in a packed stadium.
2 The degradation reference includes outages. Despite the debilitating effect of outages, service degradations such as reduced download speeds should not be overlooked, and will gain in significance with 5G.
3 As per the discussion about moving from a reactive-diagnostic to a predictive-prescriptive model of operations. In reality, the operational evolution of MNOs is a multifaceted process (e.g., it must incorporate faster reaction to network/service issues).
4 AI is a “hot” term that encompasses machine learning (ML). AI/ML-based solutions for network/service operations make use of various inputs, e.g., data from active/passive monitoring agents (or virtual probes). Intelligent automation can improve issue prioritization and troubleshooting, e.g., by enhancing the knowledge of topology (the links between network elements, services and customers).
5 Due to COVID-19, big sporting events have been cancelled/postponed while spectators are not permitted at events (e.g., football matches), with few exceptions.