I’ve been trying to deep-dive TCP Congestion Control algorithms so I can write a kernel module that implements a dummy implementation. Before starting with that, I wanted to be able to compare the existing algorithms. You can see what algorithms are available for you to choose from in Linux by running:
You can switch to a given module by writing its name in the ‘tcp_congestion_control’ file in the same location. For example, switching to TCP Reno (helpful slideset for Reno here) is as easy as:
echo "reno" > /proc/sys/net/ipv4/tcp_congestion_control
Another interesting point to note is that the congestion control algorithm can be selected in code — a given socket can specify the desired control algorithm, rather than using the selected system-wide setting. This opens up some interesting possibilities!
There’s TCP-LP (low priority), a congestion algorithm that has a completely different goal than the others — instead of working hard to maintain fairness among other TCP flows, TCP-LP works to only utilize available excess bandwidth. The paper is available here. I have not gone through this paper as deeply as some of the other more prominent ones. A socket could specifically designate itself to use this protocol — examples might be a long-running backup job that shouldn’t disrupt other TCP flows.
Trying to benchmark
I tried to create a benchmark that would generate lots of traffic under each protocol. I’ve been writing these posts with the knowledge I have at day 3, so some of the following testing methodology is flawed — I wasn’t sure what I was doing yet (still not sure). In my head, something like this would do what I needed:
This generates a file for every test. The only similar comparisons I’ve found are here, and that’s on the 2.6.22 kernel. They have a more elaborate setup that I had constructed at this point. I took away some important modifications that I hadn’t taken into consideration, like setting tcp_no_metrics_save. Without turning this on, Linux would remember the most recent slow start threshold, messing up each new test. There’s other options listed in the previous link, like setting TCP buffers to be the result of the Bandwidth Delay Product.
Testing all of this was like falling down a rabbit hole only to discover more rabbit holes. As I went to calculate the bandwidth delay product, I realized that calculating that for the loopback interface may not be sensical. Then I wondered if my test setup was effective at all. Running the test suite a few times gave varying results — different algorithms took different placement each time. I ran the following against the outputted text files:
fgrep "[ 3] 0" *txt | sort -gk7
One run of the test suite produced this (cleaned up for here) output:
lp.txt: 4.38 GBytes 3.44 Gbits/sec vegas.txt: 5.11 GBytes 4.39 Gbits/sec scalable.txt: 5.30 GBytes 4.56 Gbits/sec veno.txt: 5.41 GBytes 4.65 Gbits/sec westwood.txt: 5.50 GBytes 4.72 Gbits/sec cubic.txt: 5.53 GBytes 4.75 Gbits/sec illinois.txt: 6.27 GBytes 5.39 Gbits/sec bic.txt: 6.58 GBytes 5.65 Gbits/sec yeah.txt: 6.87 GBytes 5.90 Gbits/sec htcp.txt: 8.65 GBytes 7.43 Gbits/sec highspeed.txt: 8.80 GBytes 7.56 Gbits/sec hybla.txt: 9.22 GBytes 7.92 Gbits/sec
The only consistent loser was TCP-LP, which makes sense because it attempts to use excess bandwidth, which would be reduced in an environment where packets are flying around as hard as iperf makes them. There’s no other services to compete with, so I’m not sure if the variation is due to my lack of TCP buffer size tuning, or some other variables I haven’t seen. The consistency of TCP-LP gives me some validation that this benchmark makes some sense, even if I can’t yet explain the other algorithms.
I continued this testing by trying to introduce artificial packet loss via tc and netem. These tools are awesome because you specify that a specific interface should randomly mess things up: delay a percent of packets by 100ms, reorder some packets randomly, etc. I need to make some additional changes to try and see where the variation is coming into play. Being able to compare the existing benchmarks is critical to being able to compare the naive congestion control algorithm I’d like to try to make.
I also tried running against an iperf server on the internet that I controlled, with no difference in stability of results.
In an attempt to better understand where the problem was coming from, I decided that I should look into tcp_probe, a kernel module for getting specific details about every packet coming from a given port. I figured this information would, at the least, give me insight into the different slow-start mechanics of these modules. This also turned into a ball of mud, and might be reserved for an additional part in this series. Stay tuned!
If you haven’t had enough links to tools yet, feel free to let me know the pros and cons of tcptrace.