shaving CPU-cycles

16 Feb 2007 I've been reading up on how to benchmark firewalls and netfilter in particular.
At first, I came across RFC 2647: Benchmarking Terminology for Firewall Performance which, after reading through it, did not help me much at all. I already know the thing to watch is the throughput in packets per second...

Next I came across a benchmark performed on Netfilter, NF-HiPAC and ipset. I already saw this one before when I was looking for alternatives to NF-HiPAC. The paper makes it appear as if ipset performs equally well to NF-HiPAC, but it's a scam (a scam I tell you! ;)

IPset is nothing more than a datastructure for storing a bunch of IP addresses in. The netfilter match will then check whether an IP address is in that set or not. In itself, not a bad thing to have since we can easily implement early reject rules with this. But to compare it to NF-HiPAC... The test shows that ipset is so great because... the test seems to have been designed specifically for ipset.

To quote from the paper:


The systems were tested using the equivalent of the following pseudo-rules:

  -m state --state ESTABLISHED,RELATED -j ACCEPT
  < N nonmatching IP address rule >
  -p tcp --dport 80 -m state --state NEW -j ACCEPT

where N started at 16 and doubled until we reached 16384 non-matching rules. (In order to test the worst case, iphash type of set was used in the ipset tests.)

This shows of course nothing useful, except that ipset is good at storing and checking IP-addresses against a pile of other IP-addresses.

So anyway, what did I do ?
I went through my code and located the single function that gets called a lot when doing a query (actually, the *only* function that gets called): tree_query
Since it has only 1 loop to decend down the tree, and not much is done inside the loop except decide which branch to go into next, there was not much to optimise.

Still, I managed to shave off a couple CPU cycles by replacing this line:


  branch = !!(ip & (1 << (31 - currentRoot->cidr)));

with


  branch = (ip & (~CIDR2MASK(currentRoot->cidr) & CIDR2MASK(currentRoot->cidr + 1))) > 0;

currentRoot->cidr is the CIDR netmask number of the node we're currently in. (This means that the first [cidr] amount of bits are constant in the IP address) The next branch to decend into in this binary tree is either left or right. It is decided by the next bit in the IP-address, or the [cidr+1]'th bit. The value of this bit it stored in [branch]

By optimising like this, I get an average of 9.60 seconds instead of 9.67 seconds to classify 40 million IP addresses.

Not that much, but it gives me about 30000 more classifications per second.

[Update 10 minutes later]
Hah! OK, I made a mistake...
I forgot some braces which resulted in wrong code (operator order and that stuff). It seems the new code now takes 9.73 seconds for 40 million classifications, so I guess I'd better stay with the old...

[Update 1 minute later]
I'm on a roll I think...

After thinking about what I was actually trying to do, I replaced the line with this


branch = (ip >> (31 - currentRoot->cidr)) & 1;

The runtime is now 9.38 seconds, which is actually about 0.3 seconds better. Resulting in about 130000 extra classifications per second!

Yay for me !

[Update 5 minutes later]
After some code cleanup, removing unneeded variables and renaming/reusing some, I shaved another 0.1 seconds off for a total runtime of 9.27 seconds on 40 million classifications. 180000 extra classifications per second !

Steven Van Acker

shaving CPU-cycles