Data Collection Strategies for Infrastructure Monitoring – Troubleshooting Specifics

Tailoring Data Collection for Targeted Troubleshooting and Insights

Monitoring and troubleshooting; unfortunately, these terms are still used interchangeably, which can lead to misunderstandings about data collection strategies.

In this article we aim to clarify some important definitions, processes, and common data collection strategies for monitoring solutions. We will specify the limitations of the described strategies, as well as key benefits which can potentially be also used for troubleshooting needs. 

IT infrastructure monitoring is a business process of collecting and analyzing data over a period of time to improve business results.

Troubleshooting is a form of problem solving, often applied to repair failed products, services, or processes on a machine or a system. The main purpose of troubleshooting is to identify the source of a problem in order to solve it.

In short, monitoring is used to observe the current/historical state while troubleshooting is employed to isolate the specific cause or causes of the symptom.

The boundary between the definitions of the terms monitoring and troubleshooting is clear; however, in the context of currently available monitoring solutions on the market for software engineers, SRE, DevOps, etc., the boundary has become a bit blurry.

The basic monitoring system can be built on top of “The Three Pillars of Observability” (Sridharan, 2018): Logs, metrics, and traces. All of them together have the ability to provide visibility into the health of your systems, behavior of your services as well as some business metrics. You would be able to understand the impact of changes you make or change in the users traffic patterns.

The main focus in this article is going to be on metrics rather than on logs or traces.

Metrics represent the measurements of resource usage or behavior - for example, low-level usage summaries provided by your operating system (CPU load, Memory usage, Disk space, etc) or higher-level summaries provided by a specific process currently running on your system.

Many applications and services nowadays provide their own metrics, which can be collected and displayed to the end-user.

Metrics are perfectly suited to building dashboards that display key performance indicators (KPI) over time, using numbers as a data type that is optimized for transmission, compression, processing, and storage as well as easier querying. 

Incident management KPIs

Let’s review a series of indicators designed to help tech companies to understand how often incidents occur, to predict how quickly incidents are going to be acknowledged and resolved, and to clarify later on how they are affected by different data collection strategies.

Mean Time Before Failure (MTBF): MTBF is the average time between repairable failures. The higher the time between failure, the more reliable the system

Mean Time To Acknowledge (MTTA): MTTA is the average time it takes from when an alert is triggered to when work begins on the issue. This metric is useful for tracking your team’s responsiveness and your alert system’s effectiveness.

Mean Time To Recovery (MTTR): MTTR is the average time it takes to repair a system and return to a fully functional state. This time includes not only repair time, but testing time and the time spent ensuring that the failure won’t happen again. The lower the time to recovery, the more efficient the troubleshooting process (root cause analysis) as well as the issues resolution process.

Troubleshooting

Intermittent issues within your infrastructure interrupt your flow of work, frustrate users, and can wreak havoc on your business. The higher the MTBF, the longer a system is likely to work before failing.

A working system will not fail breakdown and understanding why your services may be failing is your first line of defense against the serious consequences of unplanned downtime.

The best way is to identify the issue and resolve it ASAP, before users will notice it and make a buzz around it within the user’s company, or worse, in the community or social networks.

Configuring an alert is a very good way to indicate some abnormalities based on metrics going above or below specified thresholds or based on changes in the pattern of the data compared to the previous time periods (machine learning (ML)-powered anomaly detection). This type of work will allow you to reduce MTTA, however, it will require to have as many metrics collected as possible to configure an alert for all of them. 

On top of reduced MTTA, you also want to have as low MTTR as possible, which is why monitoring solutions should not just notify about an issue, but also help to identify the root cause of those issues and highlight the affected part of your infrastructure.

On the individual server, as well as in dynamic environments services are started, stopped, or just moved around between nodes at any given time. Therefore, it is important to have a way to automatically discover changes in running processes on the node to start and stop collecting relevant metrics with the fine granularity (this will also improve your MTTA and MTTR).

Troubleshooting requires not only an organized and logical approach to eliminate variables and identify causes of problems in a systematic order, but also enriched data, helping you verify your assumptions or drive your investigation process.

The troubleshooting process steps are as follows:

  1. Gather available information.
  2. Describe the problem.
  3. Establish a theory of probable cause.
  4. Test the theory to determine the cause.
  5. Create a plan of action to resolve the problem and test a solution.
  6. Implement the solution.
  7. Test the full system functionality and, if applicable, implement preventive measures.
  8. Document findings, actions, and outcomes.
In many cases the first three steps are the most challenging, so let’s go deeper into steps 1-3 only.

Gather available information

This is the beginning of your investigation process, therefore, limited information can lead to the wrong theory of probable cause of the issue.

The solution for this challenge: - have all possible metrics collected automatically for you without manual intervention (only if you want to tune it) - have the highest possible granularity (per second) - have all data automatically visualized (without prior manual configuration)

Describe the problem

The best way to describe the problem is to list side effects identified based on the Alert, and also understand how other parts of your system are affected. For example: an issue with a particular service generated more logs than usual, and as a side effect the free space has been exhausted on the storage attached.

Establish a theory of probable cause

Monitoring solutions should be able to not only expose metrics for investigation purposes, but to suggest correlation in between them. A good theory should take into account all aspects of the problem, including all anomalies that occur during the time of the investigation period or before that. In many cases, alerts are triggered based on symptoms and not on the actual cause of the issue. The extended data retention policy is a good addition for your investigation. 

Granularity and retention

Every monitoring solution should provide the current state of the system, but the real power comes with the historical data.

A rich history of data can help you understand patterns and trends over time. In the ideal world, it would be good to have all raw metrics data stored indefinitely. However, the cost of data storage and the cost of processing data requires applying a data retention policy.

A data retention policy empowers system administrators to maintain compliance and optimize storage; it clarifies what data should be available in the hot storage, archived, or deleted, and what granularity should be used.

An example of a common data retention policy for time series metrics is presented in the following table:

Retention Period Data Granularity
0 - 1 weeks 1 minute
1 week - 1 month 5 minutes
1 month - 1 year 1 hour
up to 2 years 1 day
Alternatively, a data retention policy can work with a tiering mechanism (providing multiple tiers of data with different granularity on metrics), as exemplified in the following table:
Tier        Retention Period Data Granularity
Tier 0 0 - 1 month 1 sec
Tier 1 0 - 6 months 1 minute
Tier 2 0 - 3 year 1 hour
In this tiered example, every second tier is sampling the data every 60 points of the previous tier.

When calculating the required storage size for metrics, it is important to remember that for aggregated tiers for a single counter, usually more data is going to be stored, such as the following:

  • The sum of the points aggregated.
  • The min of the points aggregated.
  • The max of the points aggregated.
  • The count of the points aggregated.

Data collection strategies

MTTA is highly dependent on the data collection strategy.

Data collection isn’t always as straightforward as it might seem. There are plenty of opportunities to stumble in this stage, some of which could affect the accuracy of your metrics or even prevent a timely analysis.

There are a few different data collection strategies currently available on the market.

Let’s focus the most common, which are as follows: 

  • Transfer all the data to a third party - Cloud Monitoring Service Provider (CMSP) 
  • Keep all the data inside your infrastructure - On-Premises Monitoring Solution (OPMS)
  • Hybrid, distributed solution (OPMS + CMSP)

Data Collection Strategy Option 1: Transfer all the data to the third party

CMSP requires sending all the collected data to the cloud. Users do not need to run the monitoring-specific infrastructure.

In this case, CMSP is following the principle “Fire and Forget.”

Examples: Datadog, Newrelic, Dynatrace

Installation and configuration

  1. Install data collector.
  2. Configure data collector for the following: 
    1. Define what metrics you would like to collect.
    2. Specify granularity for each metric.
  3. All collected data will be transferred to the CMSP.
  4. CMSP will store aggregated data on their side.
  5. Based on the pricing plan, a predefined retention policy will be applied.

Usage Requirements

  1. Data available only via CMSP webapp
  2. You have some predefined dashboards for specific integrations
  3. In order to visualize metrics data, you have to configure the chart by performing the following: 
    1. Select the specific metric. 
    2. Configure the visualization options.

Most common cost structure and limitations

  1. Pricing plan (usually based on number of nodes or number of metrics)
  2. Extra data ingestion (outside of your plan)
  3. Extra data processing (outside of your plan)
  4. Machine learning as an extra or part of the most expensive plan
  5. Limited data retention (restricted by your plan)
  6. Limited number of containers monitoring (restricted by your plan)
  7. Limited number of metrics (restricted by your plan)
  8. Limited number of events (restricted by your plan)
  9. Cost of sending email notifications usually included in your plan
  10. Low maintenance cost
  11. High networking cost (data transfer, usually Cloud Service Providers charge for outgoing traffic)
  12. In the end, the most expensive option

Key Benefits

  1. Well-rounded features set
  2. Ease of use
  3. Extensive number for integrations driven by the CMSP

Data Collection Strategy Option 2: Keep all the data inside your infrastructure

This option is usually available for On-Premises Monitoring Solutions (OPMS), mainly open-sourced based.

OPMS allows you to keep all collected data on premises and have full control of your data. Users have to run and support the monitoring specific infrastructure.

Example: Prometheus, Grafana, Zabbix, Dynatrace Managed, Netdata Agent Only

Installation and configuration

  1. Install data collector.
  2. Configure data collector for the following:
    1. Define what metrics you would like to collect 
    2. Specify granularity for each metric
  3. Install storage
  4. Configure storage for the following: 
    1. You can keep all collected data within your network 
    2. Flexible retention policy, you can use defaults of define your own.
  5. Configure your Email Service Provider (ESP)
  6. Install visualization tool 
    • Usually available as part of the chosen OPMS 
    • Might be used another open-source solution

Usage Requirements

  1. In order to visualize metrics data, you have to configure the chart for the following:
    1. Define data source 
    2. Select specific metric 
    3. Configure the visualization options
  2. Support monitoring infrastructure

Most common cost structure and limitations

  1. Compute cost based on your usage.
  2. Database cost based on your usage.
  3. High installation cost (time spent by SRE/DevOps to have the solution running).
  4. High maintenance cost.
  5. Cost of sending emails via ESP (Note: this is not required).
  6. Machine learning usually is not available.

Benefits

  1. Monitoring-focused features set.
  2. Extensive number of integrations driven by the open-source community.
  3. Full management of monitoring cost structure.

Data Collection Strategy Option 3: Hybrid, distributed solution

The third option is a mixed approach, allowing you to take advantage of the best of both options 1 and 2 by allowing an extensive feature set from CMSP as well as having flexibility of your data retention and low cost from OPMS.

Due to the distributed nature of this solution, users are able to collect and store data on their premises (in other words: have full control of collected data).

In this scenario, CMSP is playing the role of the orchestrator, as a result, only the metadata needs to be shared with CMSP for the request routing purposes.

In this option, the following metadata can be shared:

  • Nodes topology 
  • The list of metrics collected 
  • The retention information for each metric
Example: Netdata

Netdata can be classified as a hybrid solution because it has two components - open-source Agent and the cloud-based Netdata solution.

Primary responsibilities of the Agent

  • Collect metrics data for the Node, where the agent is running on. More than 2k collectors are currently supported.
  • Store collected metrics. Various database modes are supported: dB engine, ram, save, map, alloc, none.
  • Store data for other Nodes (in case agent playing a role of a “Parent” and collects data from other Agents, called “Children”). Streaming and replication.

Primary responsibilities of the Netdata Cloud solution

  • Visualize data collected from multiple Agents. Data requests routed to the very specific Agents. Routing information build based on the metadata received from Agents.
  • Provide Infrastructure level view data representation
  • Keep alerts state changes from all Nodes
  • Dispatching alerts notifications.

Installation and configuration

  1. Log in in to Netdata.
  2. Install the Agent (includes data collectors with auto-discovery and storage; data collectors are already preconfigured with 1 sec granularity.

Usage Requirements

  1. There is no need to install a visualization tool. Netdata's cloud solution is already there for you.
  2. There is no need to configure charts. Every single metric is already associated to the chart.
  3. You just need to log in in to Netdata to see various dashboards (infrastructure Overview, individual Nodes, Alerts, Machine Learning, etc.) as well as individual charts associated with Alerts.

Most common cost structure and limitations

  1. Compute cost based on your usage (inside your infrastructure)
  2. Database cost based on your usage (inside your infrastructure)
  3. Low installation cost (one-line installation command for manual installations or Ansible playbook for automation)
  4. Low maintenance cost (Agent automatically updated)
  5. Netdata will send all emails for free
  6. Machine learning enabled by default on the Agent, visualized for free within the Netdata
  7. Free Nodes reachability alerts from Netdata
  8. Stated plainly, this is the cheapest option.

Benefits

  1. Mainly troubleshooting-focused features set.
  2. Ease of installation and maintenance
  3. Extensive number for integrations driven by the open-source community
  4. Data immediately available for querying

Summary

The following summarizes what is important for troubleshooting purposes: 

  • You should be able to collect as many metrics as you want 
  • Metrics should be collected automatically with high granularity (1 sec) 
  • You need to retain as much data as you want at the minimum cost
  • You need to provide an ability to contribute (i.e. create your own collector) 
  • You should be able to easily visualize all metrics (no need to configure chart for every metric) 
  • You need fast access to data metrics (data should be available ASAP, ideally next second) 
  • You should be able to automatically identify anomalies and suggest correlations across all collected metrics
With these in mind, let’s come back to our data collection strategies

Option 1: Transfer all the data to the third party (CMSP)

This option is good for generic monitoring purposes, with limited troubleshooting capabilities due to designed data flow. It is also the most expensive option, leaving you to deal with the following: 

  • Manual intervention to enable and configure data collectors 
  • High cost for data transfer, processing and storage leads to low granularity of data and limited number of metrics to be collected 
  • Manual chart configuration, requires a prior knowledge of available metrics 
  • Making assumptions based on the experience, rather than on data available (you need to know what metric you would like to check) 
  • Significant lag before data will be available for querying (due to data flow design)

Option 2: Keep all the data inside your infrastructure (OPMS)

This option is cheap, but the least helpful solution for troubleshooting needs. It has the same limitations as Option 1, due to aggregation needs, plus you will be saddled with the following: 

  • Lower number of metrics and low granularity usually are the suggested way 
  • Limited number of features available; for example, an ML-based charts suggestion mechanism will not be available.
  • Burden of complete ownership of the monitoring/troubleshooting infrastructure on the user.

Option 3: Hybrid, distributed solution

This option is the best case for troubleshooting purposes, as this solution allows you to have highest granularity with a significant number of metrics automatically collected for you. 

  • Full control of cost 
  • No need to pay for outgoing traffic. Similar to Option 2, data is stored inside your own infrastructure that does not need transferred outside of your network (no need to pay for the outgoing traffic/ 
  • Data immediately available for querying, no need to wait for data transfer and processing.
It is worth paying attention to the free infrastructure monitoring solution focusing on troubleshooting in the first place:  Netdata.

The Netdata Agent is free by its open-source definition (license GNU GPL v3).

The Netdata cloud solution is the close-source software; however, it is able to provide a free orchestration service for everyone (only metadata is going to be transferred to the Netdata Hub and not actual data, this is why the cost of the service is negligible and can be provided free of charge by Netdata).

In the future, you will be able to get a paid support plan in case you would like to get extra help on top of the free community support. Netdata also plans to offer Managed Data Centralization Points (Netdata parents, to keep not only the metadata, but the actual data as well) at additional cost. More details are available here.

On top of the already-described benefits of the Hybrid solution, Netdata is able to automatically show charts relevant to the highlighted area across all collected metrics (every single metric automatically has a chart representation). Netdata is also able to show metric anomalies based on Machine Learning (running on the Netdata Agent - client side and not on the CMSP).

Questions? Ideas? Comments? Learn more or contact us!

Feel free to dive deeper into the Netdata knowledge and community using any of the following resources:

  • Netdata Learn: Find documentation, guides, and reference material for monitoring and troubleshooting your systems with Netdata.
  • Github Issues: Make use of the Netdata repository to report bugs or open a new feature request.
  • Github Discussions: Join the conversation around the Netdata development process and be a part of it.
  • Community Forums: Visit the Community Forums and contribute to the collaborative knowledge base.
  • Discord: Jump into the Netdata Discord and hangout with like-minded sysadmins, DevOps, SREs and other troubleshooters. More than 1100 engineers are already using it!