Kernel Live Patching for ppc64le

As part of the 4.6 and 4.7 development cycles, several of us from IBM & SUSE finally got all the pieces lined up to enable kernel live patching for ppc64le.

If you're not familiar with kernel live patching, there's a pretty good write-up at Wikipedia. But the basic idea is that you can, without rebooting, replace one function in the kernel with a new version of that function (or something else entirely). Typically you do this because there is a bug in the original version of the function, and this allows you to fix or mitigate that bug.

If you think that sounds like it could be tricky, then you're right.

FTRACE_WITH_REGS

Live patching is implemented on top of ftrace, which is Linux's mechanism for attaching trace functions to other functions. Although powerpc has had ftrace support for ~8 years, we didn't have support for a particular feature of ftrace, called FTRACE_WITH_REGS.

So the first step on the way to live patching support was to implement support for FTRACE_WITH_REGS.

The way ftrace is implemented is that we build the kernel with some special GCC flags. These tell GCC to generate calls to a special function, _mcount, at the beginning of every function. Typically these calls are nop'ed out, but when ftrace is enabled those call sites are patched to call into ftrace and do something interesting.

In order to support FTRACE_WITH_REGS we first needed to tell GCC to use a different calling sequence for the _mcount calls. This is called -mprofile-kernel, and was the brainchild of Anton Blanchard & Alan Modra.

Using -mprofile-kernel tells GCC to use a very minimal calling sequence when calling _mcount. This has two advantages, firstly it imposes very little overhead when the _mcount calls are nop'ed out, and secondly it means the register state is not perturbed before calling ftrace - which is exactly what we need for FTRACE_WITH_REGS.

This is an example of the code generated for a call to _mcount using the standard mcount ABI, as you can see it involves an mflr, several stores and the creation of a stack frame (stdu):

c00000000000b560 <match_dev_by_uuid>:
c00000000000b560:       ed 00 4c 3c     addis   r2,r12,237
c00000000000b564:       a0 d6 42 38     addi    r2,r2,-10592
c00000000000b568:       a6 02 08 7c     mflr    r0
c00000000000b56c:       f0 ff c1 fb     std     r30,-16(r1)
c00000000000b570:       f8 ff e1 fb     std     r31,-8(r1)
c00000000000b574:       10 00 01 f8     std     r0,16(r1)
c00000000000b578:       d1 ff 21 f8     stdu    r1,-48(r1)
c00000000000b57c:       78 1b 7e 7c     mr      r30,r3
c00000000000b580:       78 23 9f 7c     mr      r31,r4
c00000000000b584:       d9 e7 ff 4b     bl      c000000000009d5c <_mcount>
c00000000000b588:       00 00 00 60     nop

Example calling sequence with -mprofile-kernel, the sequence is reduced to just the mflr, std and bl:

c00000000000b710 <match_dev_by_uuid>:
c00000000000b710:    ea 00 4c 3c     addis   r2,r12,234
c00000000000b714:    f0 d4 42 38     addi    r2,r2,-11024
c00000000000b718:    a6 02 08 7c     mflr    r0
c00000000000b71c:    10 00 01 f8     std     r0,16(r1)
c00000000000b720:    3d e6 ff 4b     bl      c000000000009d5c <_mcount>

In addition to telling GCC to use -mprofile-kernel we also needed a new implementation of ftrace_caller(), which is the function called in place of _mcount when ftrace is enabled. Due to the way it's called, ie. during the prolog of another function, ftrace_caller() has to be carefully hand written in assembler.

To keep things interesting we also had to deal with the fact that old toolchains don't implement -mprofile-kernel correctly, although they do accept the flag. Furthermore there is both a 4 instruction sequence, or in newer toolchains an optimised 3 instruction sequence.

This work was done largely by Torsten Duwe of SUSE, with some initial work by Vojtech Pavlik, and some assistance from me on the module code.

Understanding the TOC & entry points

The TOC or "Table Of Contents", is an area in memory where a program's global data is placed. When code references a global variable, the compiler generates instructions to load that variable from the TOC relative to the TOC pointer, which is expected to be in r2.

On ppc64le, functions typically have two entry points. The first, called the "global entry point", is found at the start of the function's text, and will setup the TOC pointer based on the location of the function, which must be passed in r12. It is the responsibility of the caller to have saved its TOC value somewhere prior to the call.

The second entry point, called the "local entry point", skips over the first two instructions which do the TOC pointer setup, ie. it's at +8 from the start of the function text. In this case the caller needn't save its TOC, because it knows it will not be modified.

An example function, with both global and local entry points:

c000000000d1467c <exit_dns_resolver>:
c000000000d1467c:    19 00 4c 3c     addis   r2,r12,25
c000000000d14680:    84 45 42 38     addi    r2,r2,17796
c000000000d14684:    a6 02 08 7c     mflr    r0

On entry to the global entry point r12 must hold the address of the function, ie. c000000000d1467c. Given that, we can calculate that r2 will have the value c000000000d1467c + 1638400 + 17796 = c000000000ea8c00.

If we enter at the local entry point c000000000d14684, then r2 must already hold the value c000000000ea8c00. This then allows code inside the function to use c000000000ea8c00 as the base address of any loads or stores to or from global variables.

The determination about which entry point to use is made at link time. If the linker can determine that the caller and callee share the same TOC, aka. the function call is "local", then it will generate a call to the local entry point. Calls via a function pointer always use the global entry point.

The last thing we need to know about the TOC pointer is that the instructions to setup the TOC pointer can be omitted entirely, if the function accesses no globals at all, eg:

c00000000065c8e0 <int_to_scsilun>:
c00000000065c8e0:    a6 02 08 7c     mflr    r0
c00000000065c8e4:    10 00 01 f8     std     r0,16(r1)
c00000000065c8e8:    75 d4 9a 4b     bl      c000000000009d5c <_mcount>
c00000000065c8ec:    a6 02 08 7c     mflr    r0

Live patch calling sequence

When you "apply" a live patch, what you're actually doing is loading a module with a new function in it, and diverting calls from the old version of the function to the new version.

To try and get a handle on things, let's give these functions names. We'll call the caller A, this is code somewhere in the kernel or a module which wants to call a function B. But we are live patching B with a new function P (the "patch"). In order to get from B to P we go via the ftrace code, which we'll call F. On the return path we come from P back to F and then to A. eg:

 +---------+           +---------+           +---------+           +---------+
 |         |     +-----+---------+---------->|         |     +---->|         |
 |  A      |     |     |  B      |           |  F      |     |     |  P      |
 |         |     |     |         |           |         +-----+     |         |
 |         +-----+     |         |           |         |<----+     |         |
 |         |<----+     |         |           |         |     |     |         |
 |         |     |     |         |           |         |     |     |         |
 | K / M1  |     |     | K / M2  |     +-----+ Kernel  |     +-----+ Mod3    |
 +---------+     |     +---------+     |     +---------+           +---------+
                 |                     |
                 +---------------------+

The arrow that passes straight through B is meant to indicate that we don't really execute any of the body code in B. However we do execute the prolog and the _mcount calling sequence.

The K / M1 annotation is meant to show that A may be either built-in to the kernel, or a module M1. Similarly B may be either built-in to the kernel, or a module M2, and M1 may or may not equal M2. F is always built-in to the kernel, and P is typically in a new module M3.

Whose TOC?

Looking at our example calling sequence above, A -> B -> F, when we get into F (ie. ftrace_caller()), what value does the TOC pointer have?

We can eliminate one possibility straight away: it's not the TOC pointer for F. Because ftrace_caller() is not a normal function, callers don't setup r12 to hold its entry point, and so it can't do a normal global entry point style TOC setup.

It should be the TOC pointer for B, because the call to _mcount (actually ftrace_caller()) is made after the TOC pointer is setup. But as we discussed above, the TOC pointer can be omitted if B doesn't need to use the TOC.

So we don't know whether we have A or B's TOC pointer in r2, and we also don't know whether that is a module's TOC pointer or the kernel's.

Luckily we have one thing in our favour. At all times we have a pointer in r13 which points to a per-cpu structure called the paca which contains the value of the kernel's TOC pointer. This means in ftrace_caller() we can save whatever value was in r2 and load the kernel's TOC pointer into r2, which then allows the ftrace code to access kernel globals.

Local calls that become global

As we described above, the linker decides at link time whether calls to a function are local or global, and calls the appropriate entry point. However live patching breaks this logic, by retrospectively diverting the call from A to B, which may have been local, to a call to P.

The patch function P is always in a newly loaded module, and in the common case the caller A will not be in that same module. So we must always call the global entry point of P.

If P needs to use its TOC then its global entry point will do the required TOC pointer setup, and TOC accesses in P will work just fine.

So what's the problem? P will do the TOC pointer setup, but it will not save the existing TOC pointer - that is up to the caller A. But A didn't save the TOC, because it knew it was calling a local function which shared the same TOC pointer.

The end result is that P will clobber A's TOC pointer, and on return to A the TOC pointer will be incorrect and the code will either crash or silently corrupt data.

Solution? Save the TOC pointer in F

So the problem we have is that A is not saving its TOC pointer, because it rightly doesn't think it needs to, but then P is clobbering the TOC pointer, because it rightly doesn't think it needs to save it.

Hopefully the obvious solution is that F needs to save A's TOC pointer before calling P. The question is where does it save it?

The obvious answer is "on the stack", because that's normally where functions save values before calling other functions, in fact that's basically the entire purpose of the stack.

This was our initial solution, in F we create a stack frame, save the TOC pointer and LR, then branch and link to P. On the return path P returns to F, which restores LR and the TOC pointer from the stack, deallocates its stack frame, and returns to A.

For the simple case this works fine, and it passed a surprising amount of testing. Luckily Torsten reminded us that it doesn't work if B needs some of its arguments passed on the stack.

For a typical function all the arguments are passed in registers. However in some cases some of the parameters must be passed on the stack, for example if the function takes more than 8 arguments, or is a varargs function. Crucially these parameters will be saved in the callers stack frame, ie. A's stack frame, and B will know that it can find them there.

Because P shares the same function prototype as B, it too will expect to find its parameters in A's stack frame. But because of live patching we have inserted a new stack frame in between, ie. F's stack frame.

                                                                +-----------------+
                                                                |                 |
                                                                | A's stack frame |
                                                                |                 |
                                                                |                 |
                                                                | +-------------+ |
                    +-----------------+                         | |             | |
                    |                 |                         | | Parameters  | |
                    | A's stack frame |                         | |             | |
                    |                 |                         | +-------------+ |
                    |                 |                         +-----------------+
                    | +-------------+ |                +------->|                 |
                    | |             | |                |        | F's stack frame |
           +------->| | Parameters  | |                |        |                 |
           |        | |             | |                |        |                 |
           |        | +-------------+ |        Wrong   |        |                 |
  Known    |        +-----------------+                |        +-----------------+
  offset   |        |                 |                |        |                 |
  from SP  |        | B's stack frame |                |        | P's stack frame |
           |        |                 |                |        |                 |
SP ------> +------+ +-----------------+     SP ------> +------+ +-----------------+

We could solve this by forcing all patch functions to be hand-written so that they know about this difference in stack layout. However that would seriously limit the usefulness of live patching, as all patches would have to be hand-written specifically for powerpc.

The solution: You need a stack in your stack

At this point we know F needs to save state between A and P, but it can't save that state on the stack. Where else could we put it?

After considering a few possibilities, all of which were unappealing, I proposed that we create a new "live patching" stack. That is, when F needs to save state before calling P, it allocates a stack frame on the livepatch stack, and stores its state there. It can then call P without perturbing the offset from P's stack frame to the parameters potentially stored in A's stack frame.

Additionally I proposed that we place the live patch stack in the existing stack, along with the existing thread_info:

+-------------------+
|                   |
|   Regular Stack   |  High addresses
|                   |
|         +         |
|         |         |
|         v         |
|                   |
|                   |
|      .......      |
|                   |
|                   |
|         ^         |
|         |         |
|         +         |
|                   |
|  Livepatch Stack  |
|                   |
+-------------------+
|                   |
|    thread info    |  Low addresses
|                   |
+-------------------+

The key point is that the livepatch stack grows upwards from just above the thread_info, whereas the regular stack grows down from the top of the stack.

The advantage of this is there are no extra allocations required, and there are no synchronisation problems, ie. the livepatch stack always exists if the thread exists, and it is not shared between threads.

The one downside is that it consumes stack space, meaning a kernel with a live patch enabled will use more stack space than one without. However in practice the kernel runs with a large overhead of free stack space, precisely because running out of stack space is already fatal. Hopefully in the medium term we can improve the kernel's handling of stack exhaustion, and at that point we can revisit separating the livepatch stack from the regular stack.

The final piece of the puzzle is the code that runs in F between A and P. If you've made it this far you should be pretty comfortable just reading the code, reproduced below.

Update: Kamalesh found and fixed a bad bug (written by me) in the original version, updated assembly listing below.

livepatch_handler:
    CURRENT_THREAD_INFO(r12, r1)

    /* Allocate 3 x 8 bytes */
    ld  r11, TI_livepatch_sp(r12)
    addi    r11, r11, 24
    std r11, TI_livepatch_sp(r12)

    /* Save toc & real LR on livepatch stack */
    std r2,  -24(r11)
    mflr    r12
    std r12, -16(r11)

    /* Store stack end marker */
    lis     r12, STACK_END_MAGIC@h
    ori     r12, r12, STACK_END_MAGIC@l
    std r12, -8(r11)

    /* Put ctr in r12 for global entry and branch there */
    mfctr   r12
    bctrl

    /*
     * Now we are returning from the patched function to the original
     * caller A. We are free to use r11, r12 and we can use r2 until we
     * restore it.
     */

    CURRENT_THREAD_INFO(r12, r1)

    ld  r11, TI_livepatch_sp(r12)

    /* Check stack marker hasn't been trashed */
    lis     r2,  STACK_END_MAGIC@h
    ori     r2,  r2, STACK_END_MAGIC@l
    ld  r12, -8(r11)
1:  tdne    r12, r2
    EMIT_BUG_ENTRY 1b, __FILE__, __LINE__ - 1, 0

    /* Restore LR & toc from livepatch stack */
    ld  r12, -16(r11)
    mtlr    r12
    ld  r2,  -24(r11)

    /* Pop livepatch stack frame */
    CURRENT_THREAD_INFO(r12, r1)
    subi    r11, r11, 24
    std r11, TI_livepatch_sp(r12)

    /* Return to original caller of live patched function */
    blr

The end result

The ftrace changes were merged into 4.6, and the rest of the live patching support was merged into 4.7. This means in 4.7 you can enable CONFIG_LIVEPATCH and start patching your kernel. It's likely some distros will also backport the changes into older kernels.

In total the diffstat wasn't all that impressive:

 25 files changed, 778 insertions(+), 140 deletions(-)

But hopefully this article has made some of the complexity of the implementation clear.

Big thanks to everyone who helped write the code and/or did reviews:

Torsten Duwe
Vojtech Pavlik
Petr Mladek
Jiri Kosina
Miroslav Benes
Josh Poimboeuf
Balbir Singh
Kamalesh Babulal