I work as a freelance contractor for about a year now, I quit a 10+ years regular CTO job to give into more technical subjects, troubleshoot, code, build infrastructures, A-Team style, I help companies with complex matters that might require experience and rigorousness.
One of my last missions was really, really fun to deal with. A rather big company handling scientific, secret-level data, had an issue with their storage system.
They use Infiniband as their communication layer on an HPC environment, it was not a problem while the underlying operating system was CentOS 7.1 with kernel 3.10.0-1160, but since new machines were installed with CentOS 7.7 and up, with kernels 3.10.0-1062 and up, whenever they wrote a file less than 701 bytes long, the file would be corrupted.
For the record and understanding of the following debugging session, the company uses NFS over RDMA, the latter being the technique used by Infiniband to achieve low latency and great throughput.
Very weird bug, very weird number, very weird corruption. The files were really filled with (what seemed to be) random data, while one could notice a pattern here and there.
So I started my tests using
$ dd if=/dev/zero of=/mnt/nfsrdma/test bs=1 count=700 $ hexdump of=/mnt/nfsrdma/test
And here I was contemplating randomly generated files looking like this:
0000000 9dfe a757 0000 0100 0000 0000 0000 0000 0000010 0000 0000 0000 0000 0000 0000 0000 0100 0000020 0000 0100 0000 a401 0000 0100 0000 0000 0000030 0000 0000 0000 0000 0000 bc02 0000 0000 0000040 0000 0002 0000 0000 0000 0000 168c 0083 0000050 4a0f c612 0000 0000 b000 5a5b 1262 1e75 0000060 9d04 bc90 1262 1a75 c233 50e9 1262 1a75 0000070 c233 50e9 0000 bc02 0000 0100 0000 bc02 0000080 0000 0000 000a 5b5a 00b0 0000 0f9d 030c 0000090 0000 2d00 0000 0000 0000 0000 0000 0000 […]
At first, I naively thought this would probably be something involving
mount options, or possibly some sort of
sysctl, and here I was searching for differences between previous and current values. I think I tried literally every single option available. Little did I know that one of them, while not fixing the issue, was the core of this bug:
I began to realize the issue was probably lying in some kernel module, while hoping it was not, knowing what this would involve.
I tried my luck on some classic boards like unix.stackexchange and CentOS forum, both insisted I upgrade to the latest CentOS release, 7.9 kernel 3.10.0-1160, which I did, and of course the bug was still there.
Next up was trying Mellanox’s (bought by Nvidia) official yet Open Source drivers. Mellanox is actually the main company behind Infiniband. The installation is pretty smooth to be honest.
I rebooted the test node and tried my
dd sequence… It worked! Or so I thought. 700 bytes 0-filled files were not corrupted anymore, here I was seemingly victorious. Until I tried a 650 bytes file.
I have to say I had a high amount of false hopes during this project.
I knew where the next step would bring me. Kernel module debugging was inevitable at this point.
I summoned both Mellanox and Nvidia on their specific forums and Twitter, but they didn’t bother, as one can imagine.
Before going any further, I wondered if the patterns I saw on some corrupted files would tell something about the origin of the bug, so instead of only
xxded it (yes I could have used
hexdump -C), and here was a turning point of the project: the files were not only corrupted, they were populated with random parts of kernel’s memory.
This brought the mission to a whole new level, this was not a simple bug, but a major security issue. I updated the previously opened topic on CentOS forum, StackExchange but also Mellanox and Nvidia forums but nobody seemed to care enough.
Debugging session was then opened, I fetched CentOS kernel sources and began to fill nfsrdma module source code with
printk’s, kernel’s equivalent of libc’s
printf. Dissecting and bisecting the code against Linux kernel Github commits.
Why this module in particular you may ask? It turns out NFS over TCP over Infiniband was working without any issue, this bug only occurred when mounting the distant NFS server using the
-o rdma option, so it appeared to me the problem was probably lying in those lines of code.
The issue causing a serious security risk, I took the chance to contact this driver’s main contributor: Chuck Lever. And his input has been really valuable. He confirmed my suspicion that the bug was probably inside the
nfsrdma module, and moreover, confirmed that the function I had in mind (
rpcrdma_inline_fixup() for the record) was the one to look for.
So I dug into the joys of debugging a failing protocol implementation…
At the beginning of the project, my wife, who’s an ex UNIX/Linux sysadmin, told me: “couldn’t this be a header related issue?” and I vaguely replied that, yes, maybe, among 1000 other things, but yes, definitely. And while I was slowly giving up hope, I recalled this discussion, because a header mess up could definitely have an impact on a pointer over some data coming from the card. My eyes were caught by two particular functions: rpcrdma_max_call_header_size() and rpcrdma_max_reply_header_size() because their
return value would be rather easy to change. And so I did, hardcoding a value of 500, and trying out the
xxd combo. Bingo. The corruption appeared at an another offset. Slowly increasing this value I finally had a non corrupted file.
This was obviously not an acceptable fix, I had to understand the root of all this and why tweaking this header size would “fix” this corruption. And the answer was right under my eyes, a few lines after those functions: rpcrdma_results_inline()
This function checks if the received buffered data is inferior or equal to a certain value,
re_max_inline_recv, a value that can be tuned by the
sysctl. If it is, then the data is sent in
rdma mode, i.e. entirely, instead of chunking it. And here’s the trick,
sunrpc.rdma_max_inline_read cannot be set to values inferior to 1024 (
RPCRDMA_MIN_INLINE), so when
rq_rcv_buf.buflen + header size is inferior to
re_max_inline_recv, no matter what, we jump to the so called inline mode, which implementation is clearly buggy.
nfsrdma’s module main contributor, told me that before, there was no inline mode implemented and all data was always chunked, but at some points, some folks did add this “feature”, thinking it would improve performances.
At the end of the day, my fix was trivial, in the
rpcrdma_results_inline() function, we simply always return
false, hence forcing “chunked mode” no matter what.
Why haven’t I dug deeper into the “inline mode” bug you may ask, well, mainly because I was reaching the end of the number of days I committed to my client, but also because I wasn’t brave enough to study the NFS over RDMA RFC.
Of course I have updated every post I made with this fix, and also opened a bug report to RedHat’s Bugzilla (private for now). Chuck told me to open one in Linux kernel’s Bugzilla but they closed it with the “fixed” reason because it was a RedHat kernel, even if I insisted this bug still exists in recent kernels.
And voila, I packaged the whole thing using RedHat’s
rpm framework with the help of this great template https://github.com/elrepo/templates/tree/master/el7, redacted a complete documentation about the bug, how to fix it and how to package a fixed
kmod, delivered the whole thing to my customer, and associated invoice ;)
This was the story of the most annoying
mount option ever.