Anomaly Rate By Type

Deep Dive into Understanding and Managing Unusual System Behavior

We have recently added a more detailed anomaly rate chart to Netdata that breaks out the overall node anomaly rate by type, this lets you more easily see what parts of your infrastructure might be experiencing an uptick in anomalies when you see the overall node anomaly rate increase.

What is type?

type is generally the prefix of the chart id in Netdata and controls where charts live within the menu on the overview page, for example the mem.available chart has a type of mem which in part controls why it lives under the “Memory” section of the menu.

mem

You can read a bit more about type and chart id’s and specifications in general in the docs on Netdata Learn.

So having anomaly rates by type can let you quickly get a feel for what “physical” parts of your infrastructure are experiencing an uptick in anomalies when you see the overall node anomaly rate increase.

Anomaly rate by type

Since every metric in Netdata also has a corresponding anomaly-bit, its easy to calculate anomaly rates on any subset of metrics across your infrastructure. This is what we are doing here, we are simply calculating the anomaly rate over all metrics that have a given type and then collecting that as a “synthetic” or “derived” metric.

Example 1 - spike in net anomalies

Here is an example of how you might find this useful. In the below screenshot we can see that within the highlighted area, the overall node anomaly rate has suddenly increased to around 3%. You can clearly see this in the anomaly_detection.anomaly_rate chart. This gives us an idea there is something going on, but not much more than that.

Now with the addition of the anomaly_detection.type_anomaly_rate chart just below, we can see that this spike is mostly made up of spikes within the net and net_packets type’s.

example1

If we quickly navigate to the network related parts of the dashboard, sure enough we can see some anomalies in the net and net_packets charts where we have increases sent and received traffic that has been flagged as anomalous.

example1-net

Example 2 - some disk anomalies

Similarly, in the period following our initial anomaly i was doing some troubleshooting on my disks and so we also see a little spike in the node anomaly rate but this time the anomaly_detection.type_anomaly_rate chart is suggesting its something related to various disk related type’s.

example1-disk

And, sure enough, looking at some of my disk charts i can see the activity that has been flagged as anomalous.

example1-disk-detail

Both of the examples above illustrate how you can easily use the anomaly_detection.type_anomaly_rate chart to quickly see what might be “underneath” the overall node anomaly rate when it increases.

Next steps - app and user anomaly rates

This is our first step (done in this PR) in providing lower level anomaly rates below the summary node anomaly rate. We are planning on adding anomaly rates by app and user soon so that you can easily see what applications and users are experiencing an uptick in anomalies. You can follow along in this GitHub issue which is just part of our wider “Machine Learning Roadmap” project which is also publicly available on GitHub.

Of course you can always also use the Anomaly Advisor to get a more detailed bottom up view of all anomalies in a highlighted window or the “AR%” button to surface the highest chart anomaly rates within each menu section. Really this is just another quick way to view things that might come in useful.

Try it for yourself in our “Machine Learning” demo room here.