In 2019, the Netdata team already knew that a Netdata Cloud solution in the form of an online platform would greatly complement Netdata’s distributed monitoring by making it much easier to organize large infrastructures and by enabling new ways for teams to collaborate. The old node registry available at the time wasn’t enough for Netdata’s users.
Building an online platform, even one that does not directly process users’ metrics, is challenging. But less challenging than it was even a few years ago, since the technology stack has improved greatly over the years.
Choosing a primary message brokerOne of these new options available is VerneMQ. Developed in Erlang, it has a certain set of features that other middleware solutions did not have, primarily MQTT 5 (a lightweight, IoT-focused messaging protocol) and user-friendly clustering. These two features were the primary reasons for choosing it as our main message broker. Other solutions existed that fulfilled the rest of our requirements (specifically being open-source and having industry-proven use cases), but only VerneMQ fit the bill on all counts.
Of course, we also had to confirm that VerneMQ would work with the loads we had in mind. What this really means is that it would not be too expensive to run per user, especially considering we are trying to provide a free service. So the team ran extensive load-testing on the platform. VerneMQ did really well! Using load tests built with a MQTT benchmarking tool, the team determined that, on our setups, VerneMQ can handle 23,000 messages per second with a maximum latency of 65ms in stress tests. CPU and RAM utilization was also reasonable in tests with large numbers of idle connections.
How Netdata uses VerneMQ todaySo how does Netdata use VerneMQ today? Agents connect using HTTPS WebSockets (WSS) to a HAProxy load balancer which feeds connections to VerneMQ. VerneMQ acts as our Agent-Cloud Link broker, using MQTT 3.1, with MQTT 5 available for future-proofing.
All the additional middleware software for Netdata Cloud is written in Go, using a standard microservices architecture, with a few additional components and databases handling state information. There are two Go services that consume from VerneMQ, debounce and cleanup messages from Netdata’s node agents related to metrics metadata and alarms, and then push those messages into Apache Pulsar. After further processing, state information is finally updated and stored in CockroachDB.
Something to consider here is that the link is bidirectional. Since Netdata Cloud does not store machine metrics, the metrics that you see in the Cloud app are requested and returned on demand, almost instantly, and with minimum overhead – thanks to VerneMQ.
To learn more about the projects used here, check out: