Using OAM for Efficient and Cost-Effective Ethernet Backhaul Network Management

Mobile carriers are increasingly challenged by the ever-growing demand for bandwidth to support the data usage of their customers. Legacy technologies such as PDH ( T1/E1, etc.) do not scale to meet the needs of these carriers as their cost increases directly with bandwidth. More and more, Ethernet is seen as the only solution that can meet the requirements of next-generation mobile backhaul networks both from a CAPEX and OPEX perspective. In addition, this next-generation backhaul network infrastructure enables mobile carriers to deliver value-added services at incremental costs. It also provides new revenue streams for existing wireline providers and new access vendors who can sell wholesale services to mobile carriers by leveraging existing backhaul infrastructure and capacity.

Moving to a new technology comes with its own challenges. Both mobile carriers and wholesale access vendors are concerned about managing the service lifecycle effectively to reduce operational costs. They are also interested in understanding the performance of their backhaul links so that they can segment the network and isolate troublesome spots easily, optimize performance and increase customer quality of experience (QoE). Ideally, they would like to fix as many problems as possible without having to dispatch trucks, as truck rolls tend to be expensive.

For mobile carriers and wholesale access vendors, understanding the historical performance of their networks also means they can more effectively plan for the future. This includes keeping up with changing network quality trends over many dimensions such as time, regions, customers, partners, etc.

The Metro Ethernet Forum (MEF) has defined the mobile backhaul implementation agreement as part of MEF 22, which provides generic specifications for Ethernet backhaul architectures for mobile networks (2G, 3G and 4G) and explains how existing MEF specifications are applied. The implementation agreement contains guidelines for the architecture, equipment and operation of the mobile backhaul part of the network. One of the key components of MEF 22 is using Ethernet operations, administration and maintenance (OAM) for fault management and performance monitoring of the mobile backhaul network. Ethernet OAM draws on and includes existing standards such as IEEE 802.1ag for connectivity fault management (CFM) and ITU-T Y.1731 for performance monitoring.

This article describes how the Ethernet OAM can be used for managing the Ethernet service lifecycle effectively and reducing the operational costs of a mobile backhaul network.

Managing the Service Lifecycle Effectively

The service lifecycle can be divided into four broad phases: network construction, service turn-up, service assurance and service troubleshooting. Network construction describes the physical wiring of the network elements. Service turn-up relates to validating that the Ethernet service has been installed and provisioned correctly. Service assurance relates to measuring the key performance indicators (KPIs) for continuous service (24/7/365); validating that service-level agreements (SLAs) are being met; and ensuring that any service problems are being detected. Service troubleshooting relates to isolating, diagnosing and fixing service problems that are detected and validating that the fix actually solved the problem.

Both the mobile carrier buying the service from a wholesale operator and the wholesale operator selling the service to the mobile carrier are interested in all aspects of the service lifecycle. For example, wholesale operators need to provide “birth certificates” to their customers to prove that they have turned up the service correctly and according to the SLA. Both wholesale operators and mobile carriers independently want to ensure the performance of the service—wholesale operators to make sure that they are living up to their SLAs and mobile carriers to be assured that they are getting the service they have paid for. Even for carriers that own their backhaul network, this demarcation between the wholesale portion of the network and the services that are carried over the network simplifies network management and enables easier isolation of problems.

Network Construction

A clean installation of the physical link between the tower and the mobile switching center is a necessary part of enabling any new service in the network. Before turning up new services, technicians must characterize the physical link that connects the tower to the network. In most legacy networks, the physical link is made of copper; in new 3G/4G deployments, fiber is becoming the preferred medium. In either case, the physical layer must be characterized to ensure it will perform as expected. This is done through either copper-pair characterization or fiber-to-the-tower characterization, which allows a field technician to locate physical-layer faults, qualify the copper local loop, determine link loss and return loss measurement, qualify WDM channel (OSA), and validate connector cleanliness.

Service Turn-Up

Service turn-up relates to validating that the Ethernet service has been installed and provisioned correctly according to the SLA. Typically, this includes a guaranteed level of throughput (a committed information rate, or CIR), a guaranteed level of burstable throughput (an excess information rate, or EIR), and guarantees on loss, latency and delay variation, etc. RFC 2544 from the Internet Engineering Task Force (IETF) provides a method for validating these metrics. The standard defines a set of tests (throughput, loss, latency and back to back frames) that can be run to ensure that the service meets the agreed upon performance guarantees.

The most important output of the service turn-up procedure is the birth certificate, a performance report that can be used as a circuit sign-off with the end user. The birth certificate is the documentation that inspires customer confidence as it provides a consistent procedure to get the service running quickly and efficiently; it can also be used as a baseline for future performance validation and comparisons.

Service turn-up can be accomplished in one of two ways: inside-out testing or outside-in testing. Inside-out testing refers to running a turn-up test from a central test head to a device at the service demarcation location that is in loopback. The device at the demarcation location can be a cell site router, a NID, a handheld or a simple loopback device. Outside-in testing refers to running the turn-up test from the demarcation location to a central test head. Inside-out testing tends to be more economical than outside-in testing because it does not require a technician or any other user at the demarcation location. By automating the procedures necessary to put the remote device in loopback, inside-out testing can be run by providers whenever it is deemed necessary.

An important aspect of service turn-up testing is to capture key service metrics in both directions (upstream and downstream). This allows the turn-up process to capture any direction-specific characteristics of the service. No matter how the turn-up tests are run, the results of the tests are used to generate birth certificates and archived for future comparisons.

Service Assurance

Service assurance entails ensuring the performance characteristics of the Ethernet service on an ongoing basis (i.e., 24 hours a day, 365 days a year) and validating that the service performance meets the conditions stipulated in the SLAs. The performance data collected can be compared against pre-defined thresholds to generate alarms when performance levels are violated. Service assurance enables early detection of problems, improving network reliability and customer satisfaction. The ITU and IEEE standards described earlier (Y.1731 and 802.1ag) provide the key underpinnings for measuring service performance at a very low cost.

Both Y.1731 and 802.1ag define various mechanisms for collecting the data related to Ethernet service performance. The main mechanism is driven by the DMM message in Y.1731. This message (and the associated DMR response message) allows a provider to measure all the performance characteristics for a service including loss, latency, delay variation and availability. A central test head that sits at the service demarcation location periodically generates the DMM messages to the demarcation device at the other end of the circuit. The demarcation device can again be a cell site router, a NID or a simple DMM responder. Whenever the remote device receives a DMM message, it replies with a DMR response. The test head uses this response to compute the KPIs for the circuit. By generating these messages continuously and periodically, a provider has all the data needed to understand the service performance.

Comparing the performance data collected above with thresholds set for that service helps identify any problems with the service. This can be used to create tickets for service outages and generate alarms for service degradations and failures. The data collected can be used to produce reports on the daily health of the service and can be integrated into other operational support systems (OSS), billing and troubleshooting systems. The wealth of data collected can also be used to yield audience-appropriate reports that can answer questions from various groups in a provider organization—operations, engineering, account managers, executives, etc.

Ethernet Troubleshooting

Once a service problem is detected, service troubleshooting implies isolating the problem, diagnosing it and then fixing it. Service troubleshooting also encompasses validating that the fix has actually solved the problem. Effective troubleshooting requires the ability to segment the problem and understand which part of the network has contributed to the main issue affecting the service.

IEEE 802.1ag (CFM) provides many of the tools necessary for segmenting service problems. This standard defines mechanisms such as loopback and linktrace that allow a provider to test to various key points in the service path. In addition, continuous testing to these key points in the service path in addition to the demarcation location provides valuable historical information that can be used to identify hard-to-debug, transient problems. The biggest advantage of the CFM tests is that these can be run without having to take customer traffic down. For even further troubleshooting, one or more of the turn-up sub-tests (throughput, latency, loss or back to back) can be used. Unlike the test procedure used in service turn-up, these sub-tests can be quick and for a short duration.

The key aspect of these service-troubleshooting procedures is to restore service performance and avoid truck rolls as much as possible. If a truck roll is necessary, a technician can be sent to the customer premises with a handheld unit and can run a series of tests including the RFC 2544 test to the central test head.

Service Lifecycle Management

EXFO’s BrixNGN solution provides both providers and their customers with the tools necessary for successfully managing all the phases of the Ethernet service lifecycle. It includes service turn-up, assurance and troubleshooting as one integrated solution spanning the entire carrier Ethernet network. The solution consists of integrated software and hardware components, collectively called the Brix System that analyze and display performance data collected from the measurement sources (called Verifiers) deployed throughout the carrier Ethernet network. In the Brix System, advanced performance management applications run on a central-site software engine called BrixWorx.

Managing the Service Lifecycle

Leveraging a comprehensive family of measurement sources (Verifiers), as well as third-party devices and industry standards, the Brix System provides service turn-up, 24/7 network and service performance assurance, and service troubleshooting. The RTU-310 verifiers are used for service turn-up and troubleshooting; they are capable of running RFC 2544 tests to line rate speeds (up to 10G). The Brix 2500 and Brix 3000 verifiers are used for service assurance and troubleshooting. They provide the OAM capabilities (802.1ag and Y.1731) required for continuous assurance of service performance. The BrixView system provides the historical reporting engine that allows operators to generate birth certificates as well as trend and comparison reports.

With BrixNGN, providers have the required visibility into their network as well as the service performance and quality to prove service-level objectives. BrixNGN evolves network assurance and engineering functions from a break/fix reactionary method to a proactive approach, enabling early detection and quick resolution of service-affecting issues.

Starting at the core and moving out to the edge and then to customers’ endpoints, BrixNGN allows providers to establish strategic points of demarcation that can be used to quickly identify and isolate problems. By continually monitoring critical KPIs such as availability, latency, packet loss and delay variation or jitter, providers can set thresholds and alarms to alert them of potential service degradations and where in the network they are occurring. BrixNGN collects this information from Brix Verifiers, other devices and industry standards, such as 802.1ag, Y.1731 and TWAMP, to isolate problems through network segmentation and provide a cost-effective method of measuring service quality. With BrixNGN, providers have the information they need to significantly improve mean-time-to-repair (MTTR), reduce trouble tickets and provide more effective and efficient customer care.

Network Capacity Planning and Turn-Up Verification

BrixNGN enables providers to proactively assure and baseline network traffic patterns, throughput and link paths to ensure new services can be properly supported over next-generation networks. When the service goes live, providers can also use BrixNGN to conduct on-demand and scheduled tests to generate instant birth-certificate reports for turn-up validation and detailed troubleshooting results when a problem is identified. This affords providers the assurance that the service(s) worked as promised from the onset and provides a benchmark for future potential service issues.

Network capacity planning and turn-up verification

Service-Level Management

When providers leverage best-effort data delivery systems such as Ethernet for their services, real-time SLAs are a requirement. The BrixNGN module feeds the BrixWorx™ correlation and analysis software engine with the performance and quality information to produce the advanced analytics and visualization (real-time dashboards, historical reports and customer portals) to manage and continually prove SLAs. The Brix System reports address the needs of a broad audience—from technical to executive levels—to provide the business intelligence required for the organization to be successful. With the Brix System, delivering high-level, at-a-glance and audience-appropriate reports, deep diagnostic capabilities and customer-facing portals, providers can simplify SLA management, as well as provide customer visibility into their SLAs. The open architecture of the Brix System also allows providers to seamlessly integrate this award-winning converged service assurance solution with their existing OSS and business support systems (BSS) to provide a complete unified view of network and service performance.

Conclusion

Effective management of the service lifecycle is a must for both mobile carriers and wholesale operators to successfully manage their Ethernet backhaul networks. OAM standards provide the key to this effective management. It is in the interest of all providers to push their vendors to implement the same set of approved standards, as these enable interoperability and protect providers from being locked into one vendor’s way of doing things.

The MEF is working on a new message pair called SLM/SLR that alleviates some of the problems with the DMM message. Once the messages are ratified, they should be used in the place of DMM/DMR pair.

EXFO