Creating graphs with RRDtool

Finally, we’re at the last stretch of what ended up being a 3-part series on installing, configuring and using SNMP and RRDtool to get some simple graps for CPU load and network traffic.

I’ll continue dissecting the OpenBSD.Amsterdam setup as it’s what I’ve been doing so far, but now we can start to be a little more creative. With SNMP it was about setting it up so we can collect system data from it. We could have set up “traps” to send messages to other system if certain thresholds were reached, but I might come back to that. With SNMP set up with a user, there wasn’t much more to it.

With RRDtool, the first part was about getting the data and storing them in databases. Once we have the databases defined which was mainly about how often we’ll collect data, how long we’ll store them for, and a little bit of what type of data we’ll collect, that was it. Changing this set up isn’t hard, but it means you can lose some of what has been collected as there isn’t, to the best of my knowledge, a way to change an existing database. Instead a new one will have to be created and then old data can be exported from the old database and imported into the new.

When it comes to creating graphs, which is about reading the data we have stored and creating visual representations of them, we can be more creative. Since we are only reading data, we won’t break anything and it’s easier to experiment.

The scripts#

I’ve split up “gathering the data” and “creating the graphs” into separate scripts. This gives me the oportunity to run them at different intervals. While I collect data every minute, I can select to create the graphs every 5 minutes, every 15 minutes, every hour, and perhaps not at night when I’m likely to be sleeping. Or it can be run every minute, same as the gathering scripts. So many choices.

Like with gathering the data, I have a main script which creates the graphs and it’s this I run from crontab:

/home/user/rrdtool/create-all-graphs.sh

#!/bin/sh

# We're in no hurry, wait for other jobs to finish
sleep 5

SCRIPTS="/home/user/rrdtool"

${SCRIPTS}/system-stats.sh >/dev/null
${SCRIPTS}/create-cpu-graphs.sh >/dev/null
${SCRIPTS}/create-interface-graphs.sh vio0 >/dev/null
${SCRIPTS}/create-interface-graphs.sh wg0 >/dev/null

If you have more interfaces, or your interfaces are named differently you’ll need to change the input to the create-interface-graphs script. I’ve been thinking about how to handle this more dynamically, but shell scripting isn’t what I’m best at. I often long for a programming language where I can for instance itereate through arrays when doing something like this, but it may be due to my limited knowledge of shell scripting. It seems to be possible to write a webserver in awk so I’m sure it can be solved without resorting to python. But do I want to learn awk, or do I want to learn python?

To continue, the first thing the script does is run “system-stats”, which simply gets current uptime as well as OS version and outputs them to files. These will be read and used when creating the graphs. I’m using SNMP for this as it seemed like the simplest and most portable way to do it. Uptime can possibly be found with the uptime command for instance, but it may require more parsing and scripting. The result from hrSystemUptime.0 isn’t perfect as it will always say “day” and not “days” where applicable, but I’ve deemed it good enough for me. sysDescr.0 could be replaced by uname but again, it may require more scripting and parsing.

One thing to note is that RRDtool tries to interpret : in values and will crash if something goes wrong, that’s why there is a sed command to change from : to \: (escaping) which I hope readers are familiar with.

/home/user/rrdtool/system-stats.sh

#!/bin/sh

. /home/user/rrdtool/env.sh

snmp get -A "${AUTHPASS}" -a SHA-256 -l authPriv -u ${USER} -X "${PRIVPASS}" ${HOST} hrSystemUptime.0 \
  | awk -F"[)]" '{print $2}' \
  | sed 's/:/\\:/g' >${DBFILES}/uptime.txt
snmp get -A "${AUTHPASS}" -a SHA-256 -l authPriv -u ${USER} -X "${PRIVPASS}" ${HOST} sysDescr.0 \
  | awk '{print $6,$7}' >${DBFILES}/version.txt

We’ve previously used rrdtool create to create the database containing the archives, then we used rrdtool update to insert our measurements into the archives, and now we’re finally going to use rrdtool graph to create some graphs to visualize all we’ve done so far. This is where the fun begins.

This little script will create a graph for CPU usage similar to what OpenBSD.Amsterdam have on their page for their hosts:

/home/user/rrdtool/create-cpu-graphs.sh

#!/bin/sh

HOST="obsd-web"

RRDCPU="${DBFILES}/cpu.rrd"

IMAGEDIR="/var/www/htdocs/rrdtool"
IMAGEBASECPU="${IMAGEDIR}/cpu"

UPTIME=$(cat ${DBFILES}/uptime.txt)
VERSION=$(cat ${DBFILES}/version.txt)
NOW=$(date "+%Y-%m-%d %H\:%M\:%S %Z")

# cpu usage last 6 hours
${RRDTOOL} graph ${IMAGEBASECPU}-6h.png \
        --start -21600 \
        --title "${HOST} - ${VERSION} - CPU" \
        --vertical-label "CPU load" \
        --border 0 \
        DEF:CPU=${RRDCPU}:ds0:LAST \
        AREA:CPU#FFCC00 \
        LINE2:CPU#CC0033:"CPU" \
        GPRINT:CPU:MAX:"Max\:%2.2lf %s" \
        GPRINT:CPU:AVERAGE:"Average\:%2.2lf %s" \
        GPRINT:CPU:MIN:"Min\:%2.2lf %s" \
        GPRINT:CPU:LAST:" Current\:%2.2lf %s\n" \
        COMMENT:" \\n" \
        COMMENT:"  Up for ${UPTIME} at ${NOW}"

There is quite a lot going on here. I’ve created an “imagebase” variable which contains the path to where the graph should be written and most of the filename, to make it easy to make graphs for different periods.

--start -21600 is the start point for the graph, 21600 seconds or 6 hours in the past. This gives us a rolling graph, always for the last 6 hours.

--title "${HOST} - ${VERSION} - CPU" sets the title on the graph and you can put whatever you want here, really. It is optional and can be omitted. I use hostname in short form and the OS version as well as specifying that is a CPU graph. The last bit may seem redundant, but systems using net-snmp like FreeBSD does also have access to load averages for the system. I like to have those graphs in the same color, and so I specify which is what in the title.

--vertical-label "CPU load" is an optional, explanatory text, which will be written to the left of the graph. It can be omitted, but I think it looks nicer with it there.

Then we specify that we don’t want a border around the image with --border 0.

This has been mostly about the canvas so far, the place we’ll use to draw our graph, and there’s still a lot more that could have been configured, like its size and colors. We’ll just use the defaults for now, but get back to colors at least in the more creative section.

The next part is about getting the data we want to draw, and a little bit about how to draw it.

DEF:CPU=${RRDCPU}:ds0:LAST

This command fetches data from the RRD file RRDCPU defined earlier, and stores the value in the variable CPU. It fetches data from the first data source, ds0, using the consolidation function LAST for each value.

Next up comes the part where we decide how to draw things.

AREA:CPU#FFCC00

This specifies that for whatever the value CPU has, we want to fill the area of CPU with the color #FFCC00. In more detail, area means that if the value is 5, everything from 0 to 5 will be filled with the color we specified.

LINE2:CPU#CC0033:"CPU"

This specifies that we will draw a line with a width of 2 for the value in CPU in the color #CC0033. It also specifies that we will write out “CPU” on the graph as the legend, which gives us a box with its color (red) next to it explaining which element it is for. This is not so useful here, but will be when we have multiple elements in the graph, like for incoming and outgoing traffic.

GPRINT:CPU:MAX:"Max\:%2.2lf %s"
GPRINT:CPU:AVERAGE:"Avg\:%2.2lf %s"
GPRINT:CPU:MIN:"Min\:%2.2lf %s"
GPRINT:CPU:LAST:" Current\:%2.2lf %s\n"

This uses the GPRINT function to write text below the graph. Here we write out the MAX, AVERAGE, MIN and LAST value of CPU with 2 decimals precision.

Then we add a comment containing server uptime as well as current time. This lets us see if the server has restarted for instance, as well as when the graph was last updated, so we can see if our scripts are running or not.

Now let’s create some graphs for the interface counters, which requires interface name as input:

/home/user/rrdtool/create-interface-graphs.sh

#!/bin/sh

if [ -z "$1" ]; then
  echo "Missing input: <interface-name>"
  exit
fi

INTERFACENAME="$1"
HOST="obsd-web"
RRDFILEIF="${DBFILES}/${INTERFACENAME}.rrd"

IMAGEDIR="/var/www/htdocs/rrdtool"
IMAGEBASEIF="${IMAGEDIR}/${INTERFACENAME}"

UPTIME=$(cat ${DBFILES}/uptime.txt)
VERSION=$(cat ${DBFILES}/version.txt)
NOW=$(date "+%Y-%m-%d %H\:%M\:%S %Z")

${RRDTOOL} graph ${IMAGEBASEIF}-6h.png \
        --start -21600 \
        --title "${HOST} - ${VERSION} - ${INTERFACENAME}" \
        --vertical-label "Bits per Second" \
        --border 0 \
        DEF:IN=${RRDFILEIF}:ds0:AVERAGE \
        DEF:OUT=${RRDFILEIF}:ds1:AVERAGE \
        CDEF:BITS_IN="IN,8,*" \
        CDEF:BITS_OUT="OUT,8,*" \
        AREA:BITS_IN#00FF00:"In " \
        GPRINT:BITS_IN:MAX:"Max\:%5.2lf %s" \
        GPRINT:BITS_IN:AVERAGE:"Avg\:%5.2lf %s" \
        GPRINT:BITS_IN:MIN:"Min\:%5.2lf %s" \
        GPRINT:BITS_IN:LAST:" Current\:%5.2lf %s\n" \
        LINE2:BITS_OUT#0000FF:"Out" \
        GPRINT:BITS_OUT:MAX:"Max\:%5.2lf %s" \
        GPRINT:BITS_OUT:AVERAGE:"Avg\:%5.2lf %s" \
        GPRINT:BITS_OUT:MIN:"Min\:%5.2lf %s" \
        GPRINT:BITS_OUT:LAST:" Current\:%5.2lf %s\n" \
        COMMENT:"  Up for ${UPTIME} at ${NOW}"

A lot is similar, but there are also some new things. I set the interface name in the title so it’s easy to see what interface I’m looking at the stats for.

Then we have two DEF lines, as we now have two data sources, one for incoming traffic and one for outgoing.

Next we have CDEF, which performs calculations on the DEF values and returns the new result. This might look strange and/or unfamiliar as it uses reverse polish notation, but it is quite clever.

DEF:IN=${RRDFILEIF}:ds0:AVERAGE \
DEF:OUT=${RRDFILEIF}:ds1:AVERAGE \
CDEF:BITS_IN="IN,8,*" \
CDEF:BITS_OUT="OUT,8,*" \

To try to explain in full what is going on here, we first define IN with the first data source value from the specified file using the “average” consolidation function. Then we do the same for the second data source, which is outgoing data.

Next we create a new variable called “BITS_IN”, which does something with the “IN” value. If you remember from when we got data with SNMP, the counter was called something with “octets” in it, which usually is a byte. Given there are 8 bits in a byte, we get number of bits by multiplying bytes by 8, and that is what this does: IN,8,*.

RRDtool reads the values from left to right, pushing them onto a stack until it reaches an operator. Then what happens depends on the operator. The operator in question here is the * or multiply sign. This operator gets the previous two values from the stack, which is the value of IN and 8 and multiplies them. Then that value is returned to the stack, and since there are no more values or operators, we are done and the value is returned.

Then as you can see, instead of creating a graph based on the “raw” values from DEF, we used the calculated values instead:

AREA:BITS_IN#00FF00:"In "

and

LINE2:BITS_OUT#0000FF:"Out"

Incoming data are colored in green and is drawn using area, while outgoing data is drawn on top of the area using a blue line. Note that order matters here, if we had everything related to out listed first and then IN, the blue line would have been overwritten and invisible every time the green incoming area is bigger than the outgoing values.

I think that is it for now. I’ve spent some time cleaning up my scripts so the next step is publishing them so they can be downloaded and experimented with, I think that makes understanding this a lot easier. Then I’ve also experimented with different colors and some more advanced calculations which definitely warrants a post of its own. I also need to figure out how to add images to the blogging solution, so instead of a wall of text with explanations the actual results can be seen.

Hopefully that won’t take me as long as it took me to get this post out of the way.