For each interface we use SNMP to monitor byte counters, unicast, broadcast, and multicast packet counters, and interface errors.

There are other SNMP variables monitored, but the majority of our monitoring is looking at these variables on the edge of the network.

If this occurs for a long period of time, you will need to investigate where the bottlenecks are for the graph updates.

To keep up, the write cycle needs to complete in under 200 seconds.

Graph of mrtg write performance, 2006-12-19 The load average on the machine was about 20, the cpu's were all stuck in iowait, and the disks were all 100% busy.

There are usually about 28-30 MRTG processes running (times 4 forks each, during polling). Each MRTG's snmp polling cycle completes in about 30 seconds, or sometimes a bit longer if a device is unreachable or slow at responding.

Graph of mrtg polling performance, 2006-12-19 So, here's the problem: writing data into RRD files for one polling interval was taking about 350 seconds on average, up to 450 seconds once an hour when the 1 hour consolidation RRA was being written, and up to 500 seconds every other hour as we write a 2 hour consolidation RRA.

Furthermore, our sysadmin team informed us that our one host is presenting more transactions per second to the storage system than the rest of the datacenter combined. open("file.rrd", O_RDWR) = 4 read(4, "RRD\0000001\0\0\0\0/%\300\307C \37[\2\0\0\0\10\0\0\0,\1"..., 4096) = 4096 _llseek(4, 0, [4096], SEEK_CUR) = 0 _llseek(4, 4096, [4096], SEEK_SET) = 0 _llseek(4, 4096, [4096], SEEK_SET) = 0 _llseek(4, -1324, [2772], SEEK_CUR) = 0 write(4, "\2557Q$ If you are updating an RRA with a consoldating function, add in another seek and then more reading of those data points before updating that particular RRA.

This clearly means that the I/O workload we are giving the disks is unreasonable. Now do this 175,000 times every 5 minutes, and guess what this looks like to the system.

While RRDtool performs very well for small- and medium-sized installations, the RRD update mechanism requires a surprising number of I/O syscalls for each operation.

These syscalls, in turn result in cache-unfriendly (random-seeming) I/O access, which defeats most OS and hardware caching algorithms.

This KB provides troubleshooting procedures for graphs of known data points on a monitored device that fail to render in Resource Manager version 4.2.x.


