Do we need to rethink network monitoring?
Earlier this year, I gave a talk at RIPE NCC 78 in Reykjavik, Iceland on how network monitoring evolved and kept on doing so. I've spoken about the importance of active (synthetic) networking monitoring and the requirement for deeper integration with passive network monitoring, along with requirements of a holistic approach when it comes to monitoring overall. Regardless of how or where you serve the traffic (Edge to data centres over the public Internet or over the private backbones, Inter/Intra Data Center Traffic, or if you moved on-premise infrastructure to the Cloud), approaches outlined in the blog post could be instrumental in meeting your business objectives. Moreover, these could result in preserving or even improving your performance and brand and improve your customer acquisition and retention. You need it, and your business expects it.
Regardless of whether you’re a knowledgeable computer user or not, there is a high probability that you’ve heard of, or have used, traceroute and ping.
People often ping google.com to test if the Internet ‘works’ and use traceroute to find out more about their network performance. These two essential troubleshooting utilities have served us well for quite a long time.
However, as the complexity of computer networks has increased, some of the deficiencies in those tools have emerged. For example, traceroute can fail to discover nodes or report false links, which can send troubleshooting in the wrong direction. Ping works pretty well, but it relies heavily on Internet Control Message Protocol (ICMP) and quite often these days, ICMP is either blocked or heavily policed.
These deficiencies inspired people to write better utilities. That’s how we got the Paris Traceroute, which solves the majority of the issues seen in traditional traceroute. Innovation didn’t stop there; we got tools such as MTR (my traceroute), that network engineers commonly resort to, for troubleshooting packet loss. There’s also Dublin Traceroute, which can peek beyond Network Address Translation (NAT) boundaries, and even complete suites of utilities such as NLNOG Ring. The list goes on.
The challenges in detecting problems
All these tools are used during the troubleshooting cycle once issues are discovered, but there are various ways in which issues are initially found. In the worst-case scenario, customers notice problems first, but often it is a network monitoring solution that detects problems and sends notifications.
Network monitoring solutions have long relied on classic ‘sources of truth’ such as Syslog and Simple Network Management Protocol (SNMP). More recently, with the rise of the Network Reliability Engineering (NRE) approach, developers noticed that many important network metrics and counters weren’t exposed, so they started developing newer data collection methods. These rely on establishing a remote session with the target device, then executing specific commands and storing the results in backend solutions for analysis and tend to be largely automated. For example, many popular networking vendors have implemented gRPC and streaming telemetry solutions.
However, there are challenges with all of these methods. SNMP collection may not have access to all the management information bases (MIBs) needed for sufficient visibility, or the monitoring platform may not support non-standard MIBs. Syslog can be configured to report only at certain severity levels and as a result, important messages can and often do get filtered out.
The automated approach adopted by the NRE teams has also shown that modern platforms have bothersome limitations. For example, it is quite easy to hit a maximum of allowed concurrent ssh sessions, and executing commands to gather detailed Multi-Protocol Label Switching (MPLS) label-switched path (LSP) statistics can create prohibitively high CPU overhead.
Furthermore, all of these mechanisms tax the compute resources that both management and control planes rely on, and can starve resources needed for critical control plane functions such as Best Path Selection.
Finally, some mechanisms such as gRPC, aren’t widely available on current network infrastructure platforms.
Is the network telemetry accurate?
The NRE approach, using Python and Go programming languages, and solutions such as Salt, NAPALM and Ansible means that much of the discovery and remediation of issues can be executed automatically. But once you gain confidence that automation can get information flowing properly, it’s only logical to question whether the telemetry generated by vendor equipment is in fact accurate.
Not only are there somewhat unusual issues with accuracy of data from network equipment, such as bit flipping caused by solar flares (where no in-depth root cause analysis has ever been provided), but it’s not uncommon for engineers to find that metrics aren’t available to aid their troubleshooting (sometimes only after several hours of being engaged with vendor technical support teams).
Is automation enough?
Nobody is going to argue that automation can’t significantly improve event response and help by remediating often repeated incidents that would otherwise consume engineering time. The investment put into automating those events pays off, in the form of more time available for engineers to spend on innovation.
However, the real question is whether automation alone is enough? Automation has helped, but let’s be honest, events often still go undetected for long periods. Or even worse, they get spotted by your users first, which brings with it multiple adverse effects such as loss of confidence in a brand or negative financial impact.
Going beyond passive data collection
Generally, to alert on a specific event, you need to be aware of the possibility of its occurrence. That means, alerts are codified based on previous occurrences.
Unfortunately, that is not how things work in real life on production networks. New events come up, counters may not be available, SNMP may not have a relevant MIB, data may not be supported by your monitoring solution, or gRPC won’t be supported on your platform. More fundamentally, getting all the data you might possibly need places a lot of strain on the networking devices themselves.
Passive data types aren’t bad. But they need to be complemented with synthetic or active monitoring. This means sending simulated user traffic (which is using the same characteristics as the real user traffic) to measure critical performance indicators such as packet loss and latency. An active monitoring approach with automation that provides fast response and remediation is a must. Especially when you now rely on so many networks that aren’t directly under your control, meaning you can’t collect passive data from the network devices.
A holistic approach is needed
Whether you’re working in network or service reliability, teams should adopt a more holistic approach and stop blaming each other.
No, the network is not an unlimited resource, as many developers tend to treat it. On the other side, not every issue should be addressed as a bug or as a service-related failure, as network engineers may attempt to prove.
From experience, we learnt that symptoms in one layer of the stack often represent issues in another one and vice versa. Therefore, it is quite essential to have full visibility on the service side.
All of these efforts combined provide you the opportunity to evolve your network monitoring to a state where you can reliably identify what the issue is and where it happened in a timely matter. You need it, and your business expects it.