MongoDB monitoring with Netdata

What is MongoDB?

MongoDB is an open source, cross-platform, document-oriented NoSQL database system. It is designed for scalability and performance and is used to store data in collections of documents. MongoDB supports advanced features such as file storage, full-text search, and data replication. It is commonly used in web applications and cloud computing applications.

With Netdata, DevOps and SRE teams can effectively monitor MongoDB performance and gain insight into their distributed system quickly and easily.

Monitoring MongoDB with Netdata

The prerequisites for monitoring MongoDB with Netdata are to have MongoDB and Netdata installed on your system.

Netdata auto discovers hundreds of services and can be configured for manual discovery with a simple one line configuration. For more information on configuring Netdata for MongoDB monitoring please read the collector documentation.

You should now see the MongoDB section on the Overview tab in Netdata Cloud already populated with charts about all the metrics you care about.

Netdata has a public demo space (no login required) where you can explore different monitoring use-cases and get a feel for Netdata.

What MongoDB metrics are important to monitor?

Operations

Monitoring operations by type will help to identify the types of operations that are causing the most strain on the system and can help to identify any potential areas of optimization. Additionally, it can help to ensure that the system is not overburdened and that the resources are being allocated appropriately. Monitoring this metric will also help to identify any bottlenecks or unexpected spikes in the workload, which can prevent unplanned downtime or performance degradation.

image

High latencies may be caused by a number of different factors, such as inadequate hardware resources, inefficient queries, or contention between concurrent operations. By monitoring this metric, it is possible to quickly identify any potential performance issues and take appropriate corrective action. Monitoring this metric can help to avoid any unexpected downtime or performance degradation.

image

Connections

image

This metric provides insight into the traffic that is coming in to the MongoDB server. This will help identify spikes in traffic that could be caused by malicious activity or a sudden surge in usage from legitimate users. By monitoring this metric, you can identify potential bottlenecks or issues before they become problematic. Additionally, it can help you identify when you need to scale your MongoDB server to accommodate additional traffic. By monitoring this metric, you can prevent slowdowns and other issues related to an overloaded server.

This is an important metric to monitor for MongoDB servers, as it can help in understanding the overall connection health of the server. In particular, by monitoring the distribution of active, threaded, awaiting_topology_changes, exhaustIsMaster, and exhaustHello states, you can identify any connection issues that may be affecting the performance of the server. For example, if the active connection count is too high, it may indicate that the server is overburdened and needs to be scaled up. Similarly, if the exhaustHello and exhaustIsMaster states are too high, it may suggest that the server is receiving too many requests and needs to be optimized. By monitoring these connection states, you can identify potential issues before they become too severe and take preventive action.

Network

Network IO refers to the amount of data being transferred over the network when running MongoDB. This metric should be monitored closely as it can help identify network related performance issues such as slow queries, unresponsive nodes, or excessive network traffic. If the network IO usage is too high, it could lead to poor performance and even outages. Monitoring this metric helps to ensure that MongoDB is performing optimally and that the system resources are used efficiently. Network IO can also help detect potential security issues, such as malicious traffic or attempts to access sensitive data. Monitoring network IO can also help identify any problems with the network configuration or hardware, and prevent them from becoming a major issue.

image

If the number of requests is too high, it can cause the database to become overloaded and degrade performance. Conversely, if the number of requests is too low, it could indicate that the server isn’t being utilized efficiently. By monitoring the number of requests per second, you can ensure that the server is operating at an optimal level and identify any discrepancies in usage patterns.

For example, if the number of requests suddenly increases, it could be indicative of a problem with the database or a spike in usage. If the number of requests drops suddenly, it could mean that the database is not being utilized or that the application is experiencing an issue. By monitoring network requests, you can quickly identify any issues and take the necessary steps to rectify them.

image

Memory

Page faults occur when the MongoDB server needs to access parts of data that are not currently in the memory. As the dataset size increases, more page faults will occur unless the working set of data can be kept in memory. Monitoring page faults helps to identify when MongoDB is not able to keep the working set in memory. If the page faults are too high, it may indicate that the dataset size is too large for the available memory, and the server should be upgraded or the data should be partitioned.

In addition, monitoring page faults can help identify if the dataset is not being accessed in an optimal way. If the page faults are too high, it may indicate that the queries are not using the correct indexes or that a sufficient number of indexes are not used to cover the query.

Page faults can also help identify if the MongoDB server is being overloaded. If the page faults are too high and cannot be attributed to dataset size, indexing, or query optimization, it may indicate that the server is not configured correctly or that there are too many connections for the server to handle.

TCMalloc is a thread-local storage allocator that is used by MongoDB to allocate memory for various operations. It is important to monitor the usage of TCMalloc as it can help prevent memory-related issues and performance bottlenecks. By monitoring the usage of TCMalloc, you can identify when more memory is needed, when the memory is not being released back to the operating system, and when memory fragmentation is occurring.

For example, if the pageheap_free metric is consistently low and pageheap_unmapped is consistently high, it could indicate that memory is not being released back to the operating system, resulting in memory fragmentation. This can cause performance issues and lead to out-of-memory errors.

Monitoring the total_threaded_cache metric can help identify when more memory is needed, as it shows the total amount of memory allocated by TCMalloc in bytes. If this metric is consistently high, it could indicate that more memory is needed to handle the workload.

Monitoring the free metric can also help indicate when more memory is needed, as it shows the amount of free memory available to TCMalloc in bytes. If this metric is consistently low, it could indicate that more memory is needed to handle the workload.

image

TCMalloc Generic is an allocator that is used by MongoDB to manage memory use within the server. It helps to improve the performance of MongoDB by allowing for more efficient memory management. By monitoring the size of the heap, as well as the currently allocated bytes, it is possible to identify memory leaks, fragmentation, and other issues that may be causing performance problems. By monitoring this metric, it is possible to identify these issues before they become too severe, and take corrective action to address them. This can help to ensure that MongoDB performance remains optimal, and that any potential issues are quickly and efficiently addressed.

image

Asserts

Asserts are MongoDB’s way of detecting errors within the system. They are essentially checks that the system performs to ensure that data is being handled properly. Asserts can be caused by things such as incorrect data formats, incorrect user input, or a bug in the code. By monitoring the number of asserts encountered, it is possible to identify potential problems before they cause a disruption to service.

For example, if the number of regular asserts increases significantly over a short period of time, it could indicate a bug in the code, or a data format issue. If this is the case, further investigation should be done in order to identify the root cause of the problem and prevent it from causing any service disruptions.

Transactions

Poorly designed or overloaded transactions can cause the server to become unresponsive, resulting in application errors or unavailability. Monitoring the number of current transactions can help to identify potential bottlenecks and allow for adjustments to be made in order to improve performance. It can also help to identify rogue transactions that may be consuming excessive resources. By monitoring current transactions, administrators can make informed decisions about how to optimize their environment and prevent potential issues.

Clients

Monitoring this metric can help to identify when it is time to scale up the server to meet the demands of the workload. It can also help to identify any potential issues that may arise due to an overload of clients. For example, if the number of active clients is too high, it can result in requests timing out and errors being returned. By monitoring this metric, these issues can be identified and addressed quickly.

image

Locks

Flow control events are triggered when the server cannot keep up with the incoming requests and starts to fall behind. If these events happen often, it can indicate that the server is overburdened and needs to be scaled up. Monitoring the flow control events can help prevent performance bottlenecks and latency issues. For example, if your application is performing an operation that requires a lot of disk I/O, flow control events can be used to identify when the disk I/O is too slow and can be scaled up to improve performance.

WiredTiger

WiredTiger is the default storage engine starting in MongoDB 3.2. It is well-suited for most workloads and is recommended for new deployments. WiredTiger provides a document-level concurrency model, checkpointing, and compression, among other features. In MongoDB Enterprise, WiredTiger also supports Encryption at Rest.

The WiredTiger block manager subsystem manages the reading and writing of data from the disk. It is designed to facilitate high performance, economic use of disk space and customizability. A block is a chunk of data that is stored on the disk and operated on as a single unit. Each WiredTiger data file is made up of these blocks.

image

If the WiredTiger cache is not large enough, it can cause serious performance issues and affect the overall performance of the MongoDB server. Effective monitoring of WiredTiger Capacity can help prevent issues related to insufficient memory, as well as identify any potential bottlenecks or contention in the database. Additionally, monitoring WiredTiger Capacity can help identify trends that can be used to plan for potential capacity changes, such as increasing the amount of RAM available to the MongoDB server.

WiredTiger Cursors are a type of pointer that allow MongoDB to iterate through and read data from collections of documents. Monitoring the number of active cursors is important to ensure that MongoDB can respond quickly to database queries. If the number of active cursors is too high, it can lead to increased disk I/O, as each cursor requires its own disk space. It can also lead to database latency and decreased throughput, as the server must wait for each cursor to finish its work before it can process the next query.

By monitoring the number of active cursors, you can identify when the database is overburdened and needs additional resources. This can prevent bottlenecks, allowing MongoDB to process queries more quickly and efficiently. Monitoring can also help identify when the database is being overused and needs to be scaled up to handle the load.

image

WiredTiger Lock Acquisition is an important MongoDB metric to monitor, as it provides insight into the number of locks being acquired by the storage engine. This is especially important in multi-user environments, where multiple users or applications may be attempting to access the same data. Monitoring this metric can also help detect potential concurrency issues, as a sudden spike in lock acquisition can indicate that an operation is taking longer than expected and may be causing contention. Additionally, long-term trends in lock acquisition can indicate increasing contention, which can be addressed proactively by optimizing queries or increasing the number of connections allowed. Monitoring this metric can also help detect potential deadlocks, as a sudden decrease in lock acquisition can indicate that operations are waiting for locks to be released.

image

If the lock duration is too long, it can indicate that the database may be trying to access or modify data that is locked by another thread, leading to a deadlock. This can result in degraded query performance, or even data corruption.

One way to prevent this issue is to regularly monitor the lock duration on the server. If the lock duration is too long, it can indicate an issue that needs to be addressed. Additionally, increasing the WiredTiger cache size can help reduce the lock duration by allowing more data to be stored in memory, which can reduce the amount of disk I/O required.

Keeping an eye on the number of operations written to the WiredTiger log is important as it can indicate how much disk space is being used, and if it is increasing excessively it can suggest that there is an issue with disk space or with the database configuration. For example, if the log operations are rapidly increasing it could suggest that the database is not being properly indexed or that queries are not being optimised, resulting in an increase in disk usage.

image

image

Issues that can be prevented through effective monitoring of this metric include identifying slow queries, detecting potential deadlocks, and identifying any transactions that are taking longer than expected to commit.

image

Database

Monitoring the number of collections in a database can help identify potential issues such as excessive fragmentation and duplication of data, which can lead to slower response times, degraded performance, and higher resource utilization. Additionally, monitoring the number of collections in a database can also help identify and prevent potential security vulnerabilities, as collections with sensitive data should be carefully monitored.

Monitoring the number of indexes in a database can help ensure that the most efficient indexes are being used, which can improve performance and reduce query latency.

Monitoring the number of views in a database can help identify potential issues such as excessive duplication of data, which can lead to slower response times, degraded performance, and higher resource utilization. Additionally, monitoring the number of views can also help identify and prevent potential security vulnerabilities, as views with sensitive data should be carefully monitored.

Monitoring the number of documents in a database can help identify potential issues such as excessive fragmentation and duplication of data, which can lead to slower response times, degraded performance, and higher resource utilization.

Monitoring the storage size of a database can help identify potential issues such as excessive fragmentation and duplication of data, which can lead to slower response times, degraded performance, and higher resource utilization

Replication

Replication lag is the amount of delay between the primary and secondary nodes in a MongoDB replica set. It is an important metric to monitor because it indicates the health of the replica set and helps ensure that the data is correctly replicated across all nodes. If the lag is too high, it can lead to inconsistencies in data, or even data loss in the event of an outage.

Monitoring replication lag can also help prevent issues such as replica set elections, where the primary node may become unavailable and a new primary must be elected. If the lag is too high, it may take too long for the new primary to catch up and cause a disruption in service.

Replication heartbeats are messages that are transmitted between MongoDB servers in a replica set in order to ensure that they are in sync with each other. The latency of these heartbeats is an important metric to monitor as it can help identify problems with replication. High latency can lead to data inconsistency, which can have serious implications for data integrity and availability. For example, if a server fails, the replica set may not have the most up-to-date data, leading to data loss or corruption. By monitoring replication heartbeat latency, any potential issues with replication can be quickly identified and addressed.

High ping latencies can indicate a problem with the server’s network connection or the replication nodes themselves. If these latencies become too high, it can lead to delays in replication and replication errors. Monitoring Replication Node Ping allows a MongoDB administrator to identify any issues before they become serious problems, so the issues can be addressed quickly and the system can perform optimally.

Shards

Shards are a way to horizontally scale the data stored in MongoDB. By sharding, the data can be spread across multiple servers for better performance, scalability, and redundancy. In order to ensure that the shard is always available and can handle the workload, it is important to monitor the number of nodes within the shard. If the number of nodes is too low, the shard may not be able to handle the required workload and could lead to poor performance or even an outage. If the number of nodes is too high, it could lead to increased overhead and an inefficient use of resources. By monitoring the number of nodes in a shard, any issues with the shard can be identified quickly and addressed before they become a problem. Examples of issues that may be identified through effective monitoring include insufficient or excessive node counts, slow performance, or failed operations.

Shard Commit Types are important metrics to monitor in order to ensure successful write operations on MongoDB. In a sharded cluster, MongoDB splits the data across multiple shards, and each shard may require multiple steps to initiate or complete a write operation. If any of these steps fail, the write operation will fail. Monitoring the distribution of shard commit types allows you to identify any potential issues with write operations and take corrective action before those issues become more serious.

For example, if you notice that the number of failed no_shard_init commits is increasing, it could indicate an issue with the configuration of your sharded cluster. This could lead to reduced performance and possibly data loss. If you monitor this metric regularly, you can quickly identify and address the issue before it becomes a major problem.

Chunks are the basic unit of database storage in MongoDB. As the database grows, MongoDB automatically splits it into smaller chunks to optimize performance and storage space. Monitoring the number of chunks is critical for understanding the overall performance of your database. Too few chunks can lead to performance issues as data is transferred across the system, while too many chunks can reduce performance due to increased overhead for managing them. You should monitor the number of chunks to ensure that the correct number is allocated for the amount of data stored in the database. Additionally, by monitoring the number of chunks you can detect any imbalance in the data distribution, which can indicate fragmentation, or a need for rebalancing.

Get Netdata

Sign up for free

Want to see a demonstration of Netdata for multiple use cases?

Go to Live Demo