Setup RRDTool for CPU and network graphs

Intro#

Having installed SNMP, it is time to measure something, store the measurements in a database and finally make some pretty graphs of it all. The tool I’m using for all of that is RRDtool.

RRDtool, or “round robin database tool” as it stands for, is a database specifically for storing data in time series. The way I understand this is that the database is essentially a fixed-length list, and that when you add an item, an item will be removed so the list is always the same length. The time series part means that a key is time. You measure something at a given time, store it in the database at this point in time, and the oldest element will be removed.

RRDtool also comes with a tool to make graphs of what was stored, so you can easily see the change over time.

You can store whatever measurements you want in an RRDtool database as long as they are at somewhat regular intervals in time. I will be using SNMP to capture CPU load and interface counters (network traffic) every minute, store it and eventually make graphs of it.

Installation#

RRDtool is fortunately available as a package for OpenBSD as can be found with pkg_info -Q rrdtool:

collectd-rrdtool-5.12.0p3
debug-rrdtool-1.9.0
py3-rrdtool-0.1.16p3
rrdtool-1.9.0

The one we want is the last one and it can be easily installed with pkg_add rrdtool-1.9.0

Creating a database#

This part I have to admit I don’t think I fully understand, even though I’ve done a lot of searching, reading, experimenting and copy/pasting to get this working.

RRD for CPU#

The people behind OpenBSD.Amsterdam have written a little about their configuration here and I have tried to replicate that for my server since I liked it. I will sort of dissect it here and go through it as I understand it and explain my changes.

To start with, they create the database for CPU measurements like this:

${RRDTOOL} create ${RRDFILES}/${HOST}-cpu.rrd \
        --step 300 \
        DS:ds0:GAUGE:600:U:U \
        RRA:MAX:0.5:1:20000

Since this is run from a script, ${RRDTOOL} is path to the rrdtool command and ${RRDFILES}/${HOST}-cpu.rrd is path to where the database should be stored, and they seem to have one file per host, each ending with -cpu.rrd.

The next line, --step 300 is how often in seconds measurements will be recorded, in other words, once every 5 minutes. I will measure every minute so my step will be 60.

Next comes DS, or data source. This line contains a lot so I want to break it down further.

DS:ds0:GAUGE:600:U:U

The field ds0 is the name of the data source, and it can be anything you want as long as it’s 1-19 characters long and containing only letters, number or the underscore character. This will be referenced when creating graphs.

The field GAUGE is the data source type. The documentation says this about this type:

is for things like temperatures or number of people in a room or the value of a RedHat share.

My interpretation of this is that it is for values that can seemingly change at will, without being affected by previous readings, and so will be stored as-is. Unfortunately that is not the whole picture as there is something called a “consolidation function”, or CF for short, that will be applied when we record data. I’ll get back to that later.

For storing interface counters, or network traffic, I will be using the COUNTER type, but more on that as I get to it.

Next on the data source line is the value 600. This field is called heartbeat, and is the max number of seconds that may pass between two updates before the value is considered to be unknown. In other words, if 600 or more seconds have passed, twice the number of seconds of step, the value is recorded as unknown. What can also be understood by this is that values doesn’t have to be recorded exactly at step boundaries, but there is some leverage as defined by heartbeat.

The next two fields which are both U here define minimum and maximum value, if known. If we don’t know what they can be, we can set the fields to U as is shown here. I’m not sure what the max for CPU load can be, but I would expect the minimum to be 0.

The next line defines the RRA, or round robin archive. This will contain the data from each of the defined data sources.

RRA:MAX:0.5:1:20000

The field MAX is the aggregation consolidation function being used. Other alternatives are AVERAGE, MIN and LAST. I don’t really understand what these do. It seems to me that the value I store is stored as-is, no matter what aggregation function I use. It could be this comes into play if values aren’t stored at their time interval, or if multiple values are stored in an interval, I don’t know. I might look into that later.

The next field which here is 0.5 is called the xfiles factor which defines how much of an interval may be made up of unknown values, while still being considered to be known. 0.5 means that half the values may be unknown. I’m not sure about how this works, but I think it mostly applies if you store multiple values and then let RRDtool do some calculation to get the final value.

The next field containing the value 1 is steps, which defines how many primary data points are used to build a consolidated data point. I think that for xfiles factor to have an effect, this has to be greater than 1.

Then the final field is rows, which define how many values are kept. In this example where we keep 20 000 values, and we record one value every 300 seconds that works out to storing data for 69 days, 10 hours and 39 minutes. Or the oldest value in the archive will be from that many days and hours ago. This is nice if you want to create a graph that shows the measurements as they were 2 months ago, but if you don’t need that kind of precision there are other ways of doing this which I’ll get back to.

RRD for interface counters#

Next up is storing data from our network interface. Here we’ll need two data-sources as we can store traffic both incoming and outgoing.

Continuing with the example from before, they show this:

        --step 300 \
        DS:ds0:COUNTER:600:0:1250000000 \
        DS:ds1:COUNTER:600:0:1250000000  \
        RRA:AVERAGE:0.5:1:600 \
        RRA:AVERAGE:0.5:6:700 \
        RRA:AVERAGE:0.5:24:775 \
        RRA:AVERAGE:0.5:288:797 \
        RRA:MAX:0.5:1:600 \
        RRA:MAX:0.5:6:700 \
        RRA:MAX:0.5:24:775 \
        RRA:MAX:0.5:288:797

Here it makes sense to use the COUNTER data type instead of GAUGE. RRDtool has logic built in to handle values that wrap around when using this data type.

One thing about this example I don’t think makes sense is the 1250000000 max value. It is 1 250 000 000 bytes, which works out to be 10 000 000 000 bits which looks like the network port speed, 10 GB/s. The interface counters we query with SNMP are either 32- or 64-bits and wrap around at 4 294 967 296 or 18 446 744 073 709 551 616 bytes respectively. That last number is 18 quintillion, or 18 million million million bytes.

In any case, they show how multiple archives can be defined with varying degree of precision for older data. The first RRA is similar to the CPU example, but only keeping an average of the data for about 2 days. The next is a little more interesting as it requires 6 values before storing one value as the average of the 6. With 6 values and a step of 300, one value will be stored every 30 minutes. Then keeping 700 in this archive, it stores data for around 14.5 days. The next are one value every 2 hours times 775 which is 64.5 days and the last is one value per day for 797 days. Then they do the same for the MAX values in the recorded interval.

I wrote all that as I’ve been trying to make sense of it all, and I personally think those numbers are a little random so I’ll be using other ones. You are now of course very welcome to tell me I haven’t understood this at all and got everything completely wrong. I’d welcome the learning oportunity! Please reach out on Mastodon if I’m mistaken and you can explain how this really works!

My RRD for CPU#

Getting right into it, I defined the database for CPU load like this:

        --step 60 \
        DS:ds0:GAUGE:120:U:U \
        RRA:LAST:0.5:1:1440 \
        RRA:AVERAGE:0.5:5:1440 \
        RRA:AVERAGE:0.5:10:1440 \
        RRA:AVERAGE:0.5:20:1440 \
        RRA:AVERAGE:0.5:30:1440

In short, I aim to insert one value every minute, if more than two minutes has passed consider the value unknown, and I have the following archives:

The actual values are stored for 24 hours, then I keep the 5 minute average for 5 days, 10 minute average for 10 days, 20 minute average for 20 days and 30 minute average for 30 days.

My RRD for interface counters#

This is quite similar to the CPU database, just that I’m also keeping the max value recorded:

        --step 60 \
        DS:ds0:COUNTER:120:0:U \
        DS:ds1:COUNTER:120:0:U \
        RRA:LAST:0.5:1:1440 \
        RRA:AVERAGE:0.5:5:1440 \
        RRA:AVERAGE:0.5:10:1440 \
        RRA:AVERAGE:0.5:20:1440 \
        RRA:AVERAGE:0.5:30:1440 \
        RRA:MAX:0.5:1:1440 \
        RRA:MAX:0.5:5:1440 \
        RRA:MAX:0.5:10:1440 \
        RRA:MAX:0.5:20:1440 \
        RRA:MAX:0.5:30:1440

I’ve set minimum value to be 0, but maximum to be unknown so RRDtool won’t perform any sanity checks on it.

Getting the data#

I wrote about SNMP, what commands to use and the relevant oid’s in my previous post. Please read that for more details.

Since my VPS only has one CPU it is easy to get the data for it:

$ snmp get -A "Dash A pass" -a SHA-256 -l authPriv -u user -X "Dash X pass" localhost hrProcessorLoad.1

If you have more, you have multiple options. You can get each value individually and combine them all in a graph so you can see load across CPUs, or you can combine them and show the average. Instead of getting them each one by one you can list them all at once with the walk paramter as shown in the previous post, and then do some scripting magic to get the number(s) you want.

For the interface counters I found it is a little more complex. Like for CPU the exact ID for the interface is needed, it can’t be referenced by name, but I found that an interface can change ID. At least I think the Wireguard interface did, but that may be because I started and destroyed it a lot when doing some testing. I think it’s fairly safe to assume that our main interface, vio0, will always be the first, but to make sure I do the following:

runstats() {
  ${SCRIPTS}/get-interface-stats.sh $1 $2
  ${SCRIPTS}/get-interface-stats.sh $3 $4
}

ifids=$(snmp walk -A "${AUTHPASS}" -a SHA-256 -l authPriv -u ${USER} -X "${PRIVPASS}" ${HOST} ifDescr \
        | grep -E 'vio0|wg0' \
        | awk '{print $1" "$4}' \
        | cut -d'.' -f2)
runstats $ifids

I get the list of all interfaces with snmp walk, then grep for the interfaces I want, pick out their id and name in that order and call the script that gets interface stats with that as input.

32 vs 64-bit counters#

Let’s talk about 32 vs 64-bit counters. Both are available to read from for interface counters. The 32-bit counter for incoming traffic is called ifInOctets.1 for interface 1, and the 64-bit counter for the same is called ifHCInOctets.1. The counters for outgoing traffic simply have Out in their name instead of In.

Now what are the difference between them, or why would we use one over the other? One important aspect is that these counters only go up unless reset due to a reboot for instance, and that they wrap around, meaning start again from 0 when they exceed their max value.

To take the 64-bit counter first, it goes up to 18 quintillion and that number is so large it’s unlikely to wrap around in my lifetime. If I got my math right, at a constant 10 Gbit/s it would take over 460 years to exceed its max value.

The 32-bit counter however, “only” goes up to around 4 GB before it wraps around, and that limit is quite easy to reach. So then, what happens if we read and store one value around 4 GB, and the next one is 1 GB? Fortunately RRDtool is “aware” of this, or more correctly has built-in logic to handle this, and is able to calculate how much the counter increased before it wrapped around. That means it can calculate the correct usage amount for the interval, instead of suddenly showing a negative number.

This can be a problem if the intervals are long and/or the interfaces are high speed with a lot of traffic. If the counter has wrapped around multiple times between one reading and the next, RRDtool won’t know and can’t calculate that, and will return a number a lot lower than the actual usage. In this scenario it might be necessary to use the 64-bit counter.

One final thing, if the 64-bit counter solves this quick wrap around problem, wouldn’t it make sense to just always use it? Well, here the problem is that if the interface is reset somehow perhaps because of a reboot, RRDtool will see a new number lower than the previous and calculate how much traffic must have passed to cause it to wrap around. That gives a high peak on the graph, basically making it unreadable as long as it is within the graph interval.

I see that this can be fixed by setting a max value or by using the data source type DERIVE instead of COUNTER, but that is a little beyond me still.

Anyway, it’s something to be aware of.

The scripts#

Putting it all together, here are the scripts I’m currently using to record data. Since SNMP needs certain things like passwords, username and host on every command, I’ve put that into its own file which I source at the beginning of all the scripts that need it:

/home/user/rrdtool/env.sh

USER="user"
AUTHPASS="Dash A pass"
PRIVPASS="Dash X pass"
HOST="localhost"
RRDFILES="/home/user/rrdtool/dbfiles"
RRDTOOL="/usr/local/bin/rrdtool"

Then I have a main script which calls the other scripts. This is the one I start from crontab every minute:

/home/user/rrdtool/collect-stats.sh

#!/bin/sh

. /home/user/rrdtool/env.sh

SCRIPTS="/home/user/rrdtool"

runstats() {
  ${SCRIPTS}/get-interface-stats.sh $1 $2
  ${SCRIPTS}/get-interface-stats.sh $3 $4
}

# Get current cpu load
${SCRIPTS}/get-cpu-load.sh

# Get current counters for select interfaces
ifids=$(snmp walk -A "${AUTHPASS}" -a SHA-256 -l authPriv -u ${USER} -X "${PRIVPASS}" ${HOST} ifDescr \
        | grep -E 'vio0|wg0' \
        | awk '{print $1" "$4}' \
        | cut -d'.' -f2)
runstats $ifids

I run the CPU stats first, to avoid it being affected by the rest of the stuff I am doing here, though I haven’t checked if it makes a measurable difference.

/home/user/rrdtool/get-cpu-load.sh

#!/bin/sh

. /home/user/rrdtool/env.sh

RRDCPU="${RRDFILES}/cpu.rrd"

TIMESTAMP=$(date '+%s' | cut -c 1-9)0

# For 1 CPU only
CPULOAD=$(snmp get -A "${AUTHPASS}" -a SHA-256 -l authPriv -u ${USER} -X "${PRIVPASS}" ${HOST} hrProcessorLoad.1 | awk '{print $4}')

if [ ! -f "${RRDCPU}" ]; then
${RRDTOOL} create ${RRDCPU} \
        --step 60 \
        DS:ds0:GAUGE:120:U:U \
        RRA:LAST:0.5:1:1440 \
        RRA:AVERAGE:0.5:5:1440 \
        RRA:AVERAGE:0.5:10:1440 \
        RRA:AVERAGE:0.5:20:1440 \
        RRA:AVERAGE:0.5:30:1440
fi

${RRDTOOL} update ${RRDCPU} ${TIMESTAMP}:${CPULOAD}

I always check if the RRD file exists before trying to update it. If it doesn’t exist, it will be created. It may be a little more work each time, but it means the script will work the first time, and I won’t have to remember to create the database files first. It also means I can just delete the files if I want to start fresh, or move them out of the way to have them archived for instance.

One thing to note here that I have not mentioned before is that RRDtool in addition to aggregating the data according to the consolidation function, will also perform normalization on the data. In other words, if values aren’t stored at the exact time they are expected to be, RRDtool will guess what they would have been, if they had been.

For another explanation of this, see this article: Rates, normalizing and consolidating

We can look at the data with the fetch command, and this is what it looks like when it is empty, not containing any values:

$ rrdtool fetch cpu.rrd LAST
1751219640: nan
1751219700: nan
1751219760: nan
1751219820: nan
1751219880: nan
1751219940: nan
1751220000: nan

Now I can update the database with the value 10 at exactly an expected time like this: rrdtool update cpu.rrd '1751219700:10'

The database now contains this:

$ rrdtool fetch cpu.rrd LAST
1751219640: nan
1751219700: 1.0000000000e+01
1751219760: nan
1751219820: nan
1751219880: nan
1751219940: nan
1751220000: nan

Now observe the results if I update the database 2 seconds after the expected time instead:

$ rrdtool update cpu.rrd '1751219762:20'
$ rrdtool update cpu.rrd '1751219822:10'
$ rrdtool update cpu.rrd '1751219882:30'
$ rrdtool update cpu.rrd '1751219942:10'
$ rrdtool fetch cpu.rrd LAST
1751219640: nan
1751219700: 1.0000000000e+01
1751219760: 2.0000000000e+01
1751219820: 1.0333333333e+01
1751219880: 2.9333333333e+01
1751219940: 1.0666666667e+01
1751220000: nan

The next value, 20, is fine, but look at what happens to the values after that. The next 10 is stored as 10.3, while the 30 is stored as 29.33 and the final 10 is stored as 10.6.

It’s not a huge difference and it’s not exactly breaking anything, I just want these values to be exactly as I read them in. Part of the reason is that RRDtool can’t really guess what happened in between, so it can’t know if the value really was increasing or decreasing at the exact minute compared to when I made the update.

So to achive exact values I fudge the timestamp a little bit with this line:

TIMESTAMP=$(date '+%s' | cut -c 1-9)0

This gets unixtime in seconds, cuts of the last number and adds a 0 instead. As long as my script now runs within the first 9 seconds of the minute, the reading will be logged at exactly the minute.

The line that updates the database is this:

${RRDTOOL} update ${RRDCPU} ${TIMESTAMP}:${CPULOAD}

If you don’t care about fudging the time like I do, replace ${TIMESTAMP} with N and RRDtool will use whatever time is current time.

Now on to getting the interface counters. This script is called twice, once for the vio0 interface, then again for the wg0 interface.

/home/user/rrdtool/get-interface-stats.sh

#!/bin/sh

if [ -z "$2" ]; then
  echo "Missing input: <interface-id> <interface-name>"
  exit
fi

. /home/user/rrdtool/env.sh

INTERFACEID="$1"
INTERFACENAME="$2"
RRDFILE="${RRDFILES}/${INTERFACENAME}.rrd"

IN=$(snmp get -A "${AUTHPASS}" -a SHA-256 -l authPriv -u ${USER} -X "${PRIVPASS}" ${HOST} ifInOctets.${INTERFACEID} | awk '{print $4}')
OUT=$(snmp get -A "${AUTHPASS}" -a SHA-256 -l authPriv -u ${USER} -X "${PRIVPASS}" ${HOST} ifOutOctets.${INTERFACEID} | awk '{print $4}')

if [ ! -f "${RRDFILE}" ]; then
${RRDTOOL} create ${RRDFILE} \
        --step 60 \
        DS:ds0:COUNTER:120:0:U \
        DS:ds1:COUNTER:120:0:U  \
        RRA:LAST:0.5:1:1440 \
        RRA:AVERAGE:0.5:5:1440 \
        RRA:AVERAGE:0.5:10:1440 \
        RRA:AVERAGE:0.5:20:1440 \
        RRA:AVERAGE:0.5:30:1440 \
        RRA:MAX:0.5:1:1440 \
        RRA:MAX:0.5:5:1440 \
        RRA:MAX:0.5:10:1440 \
        RRA:MAX:0.5:20:1440 \
        RRA:MAX:0.5:30:1440
fi

${RRDTOOL} update ${RRDFILE} N:${IN}:${OUT}

Here we update the database at current time and let RRDtool do its thing, since this counter will always be increasing. This also shows how to insert two values at once, simply by adding them to the line separated with a :.

Now make sure all the scripts are executable and finally add the main script to crontab for the user you want to run it as:

#minute hour    mday    month   wday    [flags] command
*       *       *       *       *       -n /home/user/rrdtool/collect-stats.sh

Graphs#

This post took a lot longer to write than I expected, I started it three days ago, as there was a lot to look into. I had cobbled together some scripts that were working, but to present them here I felt the need to clean them up a little. Then there was a lot to look into to make sure what I wrote wasn’t completely wrong. And I think I learned a lot in the process and understand RRDtool a little better now.

So the graphs will have to come later, I have to clean up those scripts too, and it will be its own post.