Graphing Git repository activity with rrdtool

16th June 2024 from jc's blog

After my previous post on Monitoring with Munin I want to shine some more light on the tool powering Munin, RRDtool. RRDtool is a command-line based time-series database that also allows you to export its data in a highly customizable graphical format. To get an idea of the graphs you can make with it, see the official gallery.

I’m going to showcase how to store and graph git repository changes using rrdtool: to be more specific, we are going to graph the lines of code over time. rrdtool is primarily intended for timeseries data that is fed to the tool in regular intervals, but we’re going to mess around a bit so that it works with non-regular data.

We will use the Erlang/OTP repository (using the extended history since 1999, see Extending the history of Erlang/OTP) in my examples, you are free to use whichever repository you prefer.

Creating the round-robin database

To start off we need to create a round-robin database using the rrdtool create command. We need to specify the start time of the database and define data sources.

We will use git log to obtain the unix timestamp of the first commit and subtract 1 from it (such that we can use the first commit as the first data point). This will be used as our round-robin database start time:

FIRST_COMMIT_TS="$(($(git log --reverse --format='%ct' | head -n 1) - 1))"

Let’s define the data source. One RRD can have multiple data sources, which are either fed from the command line or computed based on other values. The format for a GAUGE data source is:

DS:ds-name:GAUGE:heartbeat:min:max

heartbeat is how long we should consider the last update relevant before the value of the data source is assumed to be unknown. min and max, if specified, define the expected range values for any data supplied. Other values will be regarded as unknown. We can also use U if we don’t know nor care about them. For heartbeat, we will specify a day, and we’ll name our data source lines, so we end up with:

DS:lines:GAUGE:1d:0:U

Next up we need to define one or more round robin archives (RRA). Let me quote the manpage to explain what it is:

The purpose of an RRD is to store data in the round robin archives (RRA). An archive consists of a number of data values or statistics for each of the defined data-sources (DS) and is defined with an RRA line.

When data is entered into an RRD, it is first fit into time slots of the length defined with the -s option, thus becoming a primary data point.

The data is also processed with the consolidation function (CF) of the archive. There are several consolidation functions that consolidate primary data points via an aggregate function: AVERAGE, MIN, MAX, LAST.

Okay, so let’s simply use a -s / --step value of 1 day and tell rrdtool to store the last value for every day for the last 30 years. The RRA argument takes the following form:

RRA:{AVERAGE | MIN | MAX | LAST}:xff:steps:rows

Let’s unpack that:

Note that for steps and rows we can also specify durations, so to make it easier for us, let’s just say that we want to store the archive all the way back (30 years are sufficient for now):

rrdtool create activity.rrd \
    --start "$(git log --reverse --format='%ct' | head -n 1)" \
    --step "1d" \
    DS:lines:GAUGE:1d:0:U \
    RRA:LAST:0.9:1:30y

Feeding data

We then want to feed the diff on each commit timestamp to rrdtool. We can do that by piping it through awk to format it for us. Let’s inspect some git log output first:

$ git log --reverse --shortstat --format='%ct'
1258728880

 7642 files changed, 3306484 insertions(+)
1259149578

 3 files changed, 73 insertions(+), 1 deletion(-)

Okay, so we know that:

We can parse that and feed it into rrdtool update with the following:

git log --reverse --shortstat --format='%ct' \
  | awk '
      $0 != "" && $2 == "" {
        if (ts != "") {
          # Work around commits not necessarily being in ascending time
          if (ts > max) {
            print ts":"count;
            max=ts
          }
        };
        ts=$1
      };
      $5 ~ /insertions/ {
        count+=$4;
      };
      $5 ~ /deletions/ {
        count-=$4
      };
      $7 ~ /deletions/ {
        count-=$6;
      }' \
  | xargs --max-args=10000 rrdtool update activity.rrd

We’ll let that run for a bit. After 38 seconds it’s done.

We now have our datapoints in the (60kB big) RRD file.

Graphing it

To graph it, we need the start time once again. Luckily, we saved it in $FIRST_COMMIT_TS earlier! Let’s start with a simple graph:

rrdtool graph \
    activity.png \
    --title="Lines of code in erlang/otp" \
    DEF:lines=activity.rrd:lines:AVERAGE \
    --start "$FIRST_COMMIT_TS" \
    'LINE2:lines#0000FF:Lines'

And we get the following:

Hmmm… It’s a start, but something is off.

Fixing the problem

rrdtool expects to be fed data in regular intervals, at least more often than the --step size. With the pre-2009 commits it stops displaying our data, as those are simply previous open source Erlang releases imported into git as opposed to a more normal git workflow with regular commits.

We can interpolate the data it wants by updating our awk script a bit:

$0 != "" && $2 == "" {
  if (ts != "") {
    # If we moved forward in the history...
    if (ts > max) {
      # If more than 1 day passed since the last update...
      if (max != "" && (ts - max > (60 * 60 * 24))) {
        # Tell rrdtool how much code was there in the time inbetween
        # Technically it did not change, but it expects to be fed data.
        cursor=max+1
        while (cursor < ts) {
          print cursor":"count;
          cursor+=60*60*12
        }
      }
      print ts":"count;
      max=ts
    }
  };
  ts=$1
};
$5 ~ /insertions/ {
  count+=$4;
};
$5 ~ /deletions/ {
  count-=$4
};
$7 ~ /deletions/ {
  count-=$6;
}

and let’s take a look:

Progress!

Improving the data

Okay, let’s be honest here, we’re not using “lines of code” as a measurement here, rather “lines of anything”. Let’s count the lines in Erlang, C and C++ source files instead. For that, we first need to update our rrdtool create command to create data sources for the other languages:

rrdtool create activity.rrd \
    --start "$FIRST_COMMIT_TS" \
    --step "1d" \
    DS:erlang:GAUGE:1d:0:U \
    DS:c:GAUGE:1d:0:U \
    DS:cpp:GAUGE:1d:0:U \
    RRA:LAST:0.9:1:30y

We also change our git log command to

git log --reverse --numstat --format='%ct'

which allows us more machine-friendly parsing by only outputting the commit time and the added and deleted lines per file. Output looks like this:

$ git log --reverse --numstat --format='%ct' | head
943448520

13	0	AUTHORS
286	0	EPLICENSE
198	0	Makefile.in
157	0	README
-	-	bootstrap/bin/start.boot
330	0	bootstrap/bin/start.script
-	-	bootstrap/bin/start_clean.boot
330	0	bootstrap/bin/start_clean.script

We also need to update our AWK script to deal with this output format:

BEGIN {
  erlang=0
  c=0
  cpp=0
}

$0 != "" && $2 == "" {
  if (ts != "") {
    # If we moved forward in the history...
    if (ts > max) {
      # If more than 1 day passed since the last update...
      if (max != "" && (ts - max > (60 * 60 * 24))) {
        # Tell rrdtool how much code was there in the time inbetween
        # Technically it did not change, but it expects to be fed data.
        cursor=max+1
        while (cursor < ts) {
          print cursor ":" erlang ":" c ":" cpp;
          cursor+=60*60*12
        }
      }
      print ts ":" erlang ":" c ":" cpp;
      max=ts
    }
  };
  ts=$1
};
$1 != "-" && $0 ~ /.(erl|hrl|xrl|yrl)$/ {
  erlang+=$1
};
$2 != "-" && $0 ~ /.(erl|hrl|xrl|yrl)$/ {
  erlang-=$2
};
$1 != "-" && $0 ~ /.c$/ {
  c+=$1
}
$2 != "-" && $0 ~ /.c$/ {
  c-=$2
}
$1 != "-" && $0 ~ /.(cc|cpp|hpp)$/ {
  cpp+=$1
}
$2 != "-" && $0 ~ /.(cc|cpp|hpp)$/ {
  cpp-=$2
}

Let’s also update the graphing command. First off, we want to draw each separate language in a different color, and we also want to stack them. While we’re at it, let’s make it a bit wider as well.

But for our proper refinement, let’s also include the current lines of code (at time of graph generation) in the output. We can do that by instructing the rrdtool graph command to calculate values using VDEF, which takes a name and a reverse polish notation expression. Let’s use VDEFs to store the latest values, via the following:

VDEF:erlang_current=erlang,LAST

We can then include that in the graph using GPRINT, which takes a variable name and a format. There is also PRINT, which allows you to print reports on the command line.1

Our graph command ends up like this:

rrdtool graph \
    activity.png \
    --title="Lines of code in erlang/otp" \
    --width=600 \
    --start "$FIRST_COMMIT_TS" \
    DEF:erlang=activity.rrd:erlang:AVERAGE \
    DEF:c=activity.rrd:c:AVERAGE \
    DEF:cpp=activity.rrd:cpp:AVERAGE \
    VDEF:erlang_current=erlang,LAST \
    VDEF:c_current=c,LAST \
    VDEF:cpp_current=cpp,LAST \
    'AREA:erlang#a46faa:Erlang:STACK' \
    'GPRINT:erlang_current:currently %lg%s' \
    'AREA:c#545653:C:STACK' \
    'GPRINT:c_current:currently %lg%s' \
    'AREA:cpp#f44262:C++:STACK' \
    'GPRINT:cpp_current:currently %lg%s'

Which results in:

One major caveat here. We are counting all lines of the respective files, that is, including comments and blank lines.2 The actual amount of lines of code is a bit lower. For more accuracy, you probably want to use something like cloc instead.

Conclusion

Once you’ve learnt out how to work with it, rrdtool is a fast, lightweight and versatile tool to work with time series data that can be automated and used in scripts to generate reports and graphs. If you’re looking for a more fully-fledged solution for monitoring, you probably want to use Munin, Cacti or SmokePing instead. For more information, you might want to look at the rrdtutorial, Understanding the basics of rrdtool to create a simple graph the rrdtool examples, and Creating pretty graphs with RRDTOOL.

If you want to play with the scripts and data above yourself, the complete script can be found here.


  1. Note that if you want to work with the raw contents of the RRD, you may also be interested in rrdtool dump ↩︎

  2. The other caveat is that my awk script might not be completely accurate in counting the data. I have cross-checked it with cloc and it mostly matches, and this post is about rrdtool, not awk, but if you want to replicate this, you might want to review it. ↩︎

reply via email