mpe's tech blog

Kernel Monkey

I spent most of last week chasing a bug in the as yet unreleased 2.6.11 kernel. I hit it originally while testing some code I've been writing to implement a mem=X boot-time option. After 2-3 hours of running LTP the box would drop into xmon.

Just for fun it would rarely crash in the same spot, the only commonality was that we'd generally have some registers full of random bollocks, and on further investigation we'd have a page or two of bollocks as well.

Although we had our suspicions as to which patch might have introduced the bug we still needed to tie it down. So I found my self running the test on everything from 2.6.10-bk1 to 2.6.11-rc4, I haven't counted but that's something like 30 different kernels.

I'm sure anyone who's done any sort of decent testing knows all of what I'm about to say, but for me it was new, and so I'm gonna write it down here so Google can keep track of it for me.

  • Compile all your kernels on one box, not one of the boxes you're trying to crash.
  • Make a directory where all your kernels go.
  • Always name the directory a kernel's in the same as the kernel's name.
  • If you patch a kernel, change its name, eg: 2.6.11-rc4-with-bens-fixes
  • Keep a record of which kernel is running on which box, when it crashes you may not be able to check.

Having said that, if you're in xmon you can usually check with:

1:mon> ls linux_banner
linux_banner: c000000000443d20
1:mon> d c000000000443d20
c000000000443d20 4c696e7578207665 7273696f6e20322e  |Linux version 2.|
c000000000443d30 362e31312d726334 2d6d69636861656c  |6.11-rc4-michael|
c000000000443d40 20286d6963686165 6c40737570657265  | (michael@supere|
c000000000443d50 676f292028676363 2076657273696f6e  |go) (gcc version|

Although this bug had a habit of corrupting the page holding the banner so then you're stuffed.

  • Keep a test matrix. Just keep track of which kernel worked/broke on which machine, it'll keep you sane.
  • It's also handy to record what you expect each kernel to do. Otherwise you might find yourself inappropriately excited when a kernel doesn't crash - ie. when it doesn't have the suspect code and therefore shouldn't crash.
  • Script it, within reason. You don't want to spend 3 hours testing the wrong kernel 'cause you copied the wrong zImage into /tftpboot or something.
  • If you're applying more than one or two patches you need quilt or something similar, otherwise you will get confused (well I did!)

2005