Netdata Team

Netdata Team

June 14, 2023

Release 1.40.0: Dashboard Summary Tiles, Silencing alerts, ML tweaks and more!

Advancing Monitoring with Latest Features and Updates

Another release of the Netdata Monitoring solution is here!

Netdata Growth

🚀 Our community growth is increasing steadily. ❤️ Thank you! Your love and acceptance give us the energy and passion to work harder to simplify and make monitoring easier, more effective and more fun to use.

  • Over 63,000 GitHub Stars ⭐
  • Over 1.5 million online nodes
  • Almost 94 million sessions served
  • Over 600 thousand total nodes in Netdata Cloud
    Wow! Netdata Cloud is about to become the biggest and most scalable monitoring infra ever created!

Let the world know you love Netdata. Give Netdata a ⭐ on GitHub now. Motivate us to keep pushing forward!

Unlimited Docker Hub Pulls!

To help our community use Netdata more broadly, we just signed an agreement with Docker for the purchase of Rate Limit Removal, which will remove all Docker Hub pull limits for the Netdata repos at Docker Hub. We expect this add-on to be applied to our repos in the following few days, so that you will enjoy unlimited Docker Hub pulls of Netdata Docker images for free!

Release Highlights

Dashboard Sections' Summary Tiles

Netdata Cloud dashboards have been improved to provide instant summary tiles for most of their sections. This includes system overview, disks, network interfaces, memory, mysql, postgresql, nginx, apache, and dozens more.

To accomplish this, we extended the query engine of Netdata to support multiple grouping passes, so that queries like “sum metrics by label X, and then average by node” are now possible. At the same time we made room for presenting anomaly rates on them (vertical purple bar on the right) and significantly improved the tile placement algorithm to support multi-line summary headers and precise sizing and positioning, providing a look and feel like this:

image

The following chart tile types have been added:

  • Donut
  • Gauge
  • Bar
  • Trendline
  • Number
  • Pie chart

To improve the efficiency of using these tiles, each of these tiles supports the following interactive actions:

  1. Clicking the title of the tile scroll the dashboard to the data source chart, where you can slice, dice and filter the data based on which the tile was created.
  2. Hovering the tile with your mouse pointer, the NIDL (Nodes, Instances, Dimensions, Labels) framework buttons appear, allowing you to explore and filter the data set, right on the tile.

Some examples that you can see from the Netdata Demo space:

Silencing of Cloud Alert Notifications

Although Netdata Agent alerts support silencing, centrally dispatched alert notifications from Netdata Cloud were missing that feature. Today, we release alert notifications silencing rules for Netdata Cloud!

Silencing rules are applied on any combination of the following: users, rooms, nodes, host labels, contexts (charts), alert name, alert role. For the matching alerts, silencing can optionally have a starting date and time and/or an ending date time.

With this feature you can now easily setup silencing rules, which can be set to be applied immediately or at a defined schedule, allowing you to plan for upcoming schedule maintenance windows - see some examples here.

Image

Read more about Silencing Alert notifications on our documentation.

Machine Learning - Extended Training to 24 Hours

Netdata trains ML models for each metric, using its past data. This allows Netdata to detect anomalous behaviors in metrics, based exclusively on the recent past data of the metric itself.

Before this release Netdata was training one model of each metric, learning the behavior of each metric during the last 4 hours. In the previous release we introduced persisting these models to disk and loading them back when Netdata restarts.

In this release we change the default ML settings to support multiple models per metric, maintaining multiple trained models per metric, covering the behavior of each metric for last 24 hours. All these models are now consulted automatically in order to decide if a data collection point is anomalous or not.

This has been implemented in a way to avoid introducing additional CPU overhead on Netdata agents. So, instead of training one model for 24 hours which would introduce significant query overhead on the server, we train each metric every 3 hours using the last 6 hours of data, and we keep 9 models per metric. The most recent model is consulted first during anomaly detection. Additional models are consulted as long as the previous ones predict an anomaly. So only when all 9 models agree that a data collection is anomalous, we mark the collected sample as anomalous in the database.

The impact of these changes is more accurate anomaly detection out of the box, with much fewer false positives.

You can read more about it in this deck presented during a recent office hours (office hours recording).

Rewritten SSL Support for the Agent

The SSL support at the Netdata Agent has been completely rewritten. The new code now reliably support SSL connections for both the Netdata internal web server and streaming. It is also easier to understand, troubleshoot and expand. At the same time performance has been improved by removing redundant checks.

During this process a long-standing bug on streaming connection timeouts has been identified and fixed, making streaming reliable and robust overall.

Alerts and Notifications

Mattermost notifications for Business Plan users

To keep building up on our set of existing alert notification methods we added Mattermost as another notification integration option on Netdata Cloud. As part of our commitment to expanding our set of alert notification methods, Mattermost provides another reliable way to deliver alerts to your team, ensuring the continuity and reliability of your services.

Business Plan users can now configure Netdata Cloud to send alert notifications to their team on Mattermost.

image

Visualizations / Charts and Dashboards

Netdata Functions

On top of the work done on release v1.38, where we introduced real-time functions that enable you to trigger specific routines to be executed by a given Agent on demand. Our initial function provided detailed information on currently running processes on the node, effectively replacing top and iotop.

We have now added the capability to group your results by specific attributes. For example, on the Processes function you are now able to group the results by: Category, Cmd or User. With this capability you can now get a consolidated view of your reported statistics over any of these attributes.

image

External plugin integration

The agent core has been improved when it comes to integration with external plugins. Under certain conditions, a failed plugin would not be correctly acknowledged by the agent resulting in a defunc (i.e. zombie) plugin process. This is now fixed.

Preliminary steps to split native packages

Starting with this release, our official DEB/RPM packages have been split so that each external data collection plugin is in its own package instead of having everything bundled into a single package. We have previously had our CUPS and FreeIPMI collectors split out like this, but this change extends that to almost all of our external data collectors. This is the first step towards making these external collectors optional on installs that use our native packages, which will in turn allow users to avoid installing things they don’t actually need.

Short-term, these external collectors are listed as required dependencies to ensure that updates work correctly. At some point in the future almost all of them will be changed to be optional dependencies so that users can pick and choose which ones they want installed.

This change also includes a large number of fixes for minor issues in our native packages, including better handling of user accounts and file permissions and more prevalent usage of file capabilities to improve the security of our native packages.

Acknowledgements

We would like to thank our dedicated, talented contributors that make up this amazing community. The time and expertise that you volunteer are essential to our success. We thank you and look forward to continuing to grow together to build a remarkable product.

  • @n0099 for fixing typos in the documentation.
  • @mochaaP for fixing cross-compiling issues.
  • @jmphilippe for making control address configurable in python.d/tor.
  • @TougeAI for documenting the “age” configuration option in python.d/smartd_log.
  • @mochaaP for adding support of python-oracledb to python.d/oracledb.

Contributions

Collectors

Improvements

  • Add parent_table label to table/index metrics (go.d/postgres) (#1199, @ilyam8)
  • Make tables and indexes limit configurable (go.d/postgres) (#1200, @ilyam8)
  • Add Hyper-V metrics (go.d/windows) (#1164, @thiagoftsm)
  • Add “maps per core” config option (ebpf.plugin) (#14691, @thiagoftsm)
  • Add plugin that collect metrics from /sys/fs/debugfs (debugfs.plugin) (#15017, @thiagoftsm)
  • Add support of python-oracledb (python.d/oracledb) (#15074, @EricAndrechek)
  • Make control address configurable (python.d/tor) (#15041, @jmphilippe)
  • Make connection protocol configurable (python.d/oracledb) (#15104, @ilyam8)
  • Add availability status chart and alarm (freeipmi.plugin) (#15151, @ilyam8)
  • Improve error messages when legacy code is not installed (ebpf.plugin) (#15146, @thiagoftsm)

Bug fixes

  • Fix handling of newlines in HELP (go.d/prometheus) (#1196, @ilyam8)
  • Fix collection of bind mounts (diskspace.plugin) (#14831, @MrZammler)
  • Fix collection of zero metrics if Zswap is disabled (debugfs.plugin) (#15054, @ilyam8)

Other

  • Document the “age” configuration option (python.d/smartd_log) (#15171, @TougeAI)
  • Send EXIT before exiting in (freeipmi.plugin, debugfs.plugin) (#15140, @ilyam8)

Documentation

Packaging / Installation

Streaming

  • Streaming improvements and rewrite of SSL support in Netdata (#15113, @ktsaou)

Health

Exporting

ML

Other Notable Changes

Improvements

Bug fixes

Code organization

Deprecation notice

The following items will be removed in our next minor release (v1.41.0):

Patch releases (if any) will not be affected.

Component Type Will be replaced by
python.d/nvidia_smi collector go.d/nvidia_smi
family attribute alert configuration and Health API chart labels attribute (more details on netdata#15030)

When using Netdata Cloud, the required agent version to take most benefits from the latest features is one version before the last stable. On this release this will become v1.39.1 and you’ll be notified and guided to take action on the UI if you are running agents on lower versions.

Check here for details on how to Update Netdata agents.

Netdata Release Meetup

Join the Netdata team on the 19th of June at 16:00 UTC for the Netdata Release Meetup.

Together we’ll cover:

  • Release Highlights.
  • Acknowledgements.
  • Q&A with the community.

RSVP now - we look forward to meeting you.

Support options

As we grow, we stay committed to providing the best support ever seen from an open-source solution. Should you encounter an issue with any of the changes made in this release or any feature in the Netdata Agent, feel free to contact us through one of the following channels:

  • Netdata Learn: Find documentation, guides, and reference material for monitoring and troubleshooting your systems with Netdata.
  • GitHub Issues: Make use of the Netdata repository to report bugs or open a new feature request.
  • GitHub Discussions: Join the conversation around the Netdata development process and be a part of it.
  • Community Forums: Visit the Community Forums and contribute to the collaborative knowledge base.
  • Discord Server: Jump into the Netdata Discord and hang out with like-minded sysadmins, DevOps, SREs, and other troubleshooters. More than 1400 engineers are already using it!

Running survey

Helps us make Netdata even greater! We are trying to gather valuable information that is key for us to better position Netdata and ensure we keep bringing more value to you.

We would appreciate if you could take some time to answer this short survey (4 questions only).