As part of the 4.6 and 4.7 development cycles, several of us from IBM & SUSE finally got all the pieces lined up to enable kernel live patching for ppc64le.
If you're not familiar with kernel live patching, there's a pretty good write-up at Wikipedia. But the basic idea is that you can, without rebooting, replace one function in the kernel with a new version of that function (or something else entirely). Typically you do this because there is a bug in the original version of the function, and this allows you to fix or mitigate that bug.
If you think that sounds like it could be tricky, then you're right.
Live patching is implemented on top of ftrace, which is Linux's mechanism for
attaching trace functions to other functions. Although powerpc has had ftrace
support for ~8 years, we didn't have support for a particular feature of ftrace,
So the first step on the way to live patching support was to implement support
The way ftrace is implemented is that we build the kernel with some special GCC
flags. These tell GCC to generate calls to a special function,
_mcount, at the
beginning of every function. Typically these calls are nop'ed out, but when
ftrace is enabled those call sites are patched to call into ftrace and do
In order to support
FTRACE_WITH_REGS we first needed to tell GCC to use a
different calling sequence for the
_mcount calls. This is called
-mprofile-kernel, and was the brainchild of Anton Blanchard & Alan Modra.
-mprofile-kernel tells GCC to use a very minimal calling sequence when
_mcount. This has two advantages, firstly it imposes very little
overhead when the
_mcount calls are nop'ed out, and secondly it means the
register state is not perturbed before calling ftrace - which is exactly what we
This is an example of the code generated for a call to
_mcount using the
standard mcount ABI, as you can see it involves an
mflr, several stores and
the creation of a stack frame (
c00000000000b560 <match_dev_by_uuid>: c00000000000b560: ed 00 4c 3c addis r2,r12,237 c00000000000b564: a0 d6 42 38 addi r2,r2,-10592 c00000000000b568: a6 02 08 7c mflr r0 c00000000000b56c: f0 ff c1 fb std r30,-16(r1) c00000000000b570: f8 ff e1 fb std r31,-8(r1) c00000000000b574: 10 00 01 f8 std r0,16(r1) c00000000000b578: d1 ff 21 f8 stdu r1,-48(r1) c00000000000b57c: 78 1b 7e 7c mr r30,r3 c00000000000b580: 78 23 9f 7c mr r31,r4 c00000000000b584: d9 e7 ff 4b bl c000000000009d5c <_mcount> c00000000000b588: 00 00 00 60 nop
Example calling sequence with
-mprofile-kernel, the sequence is reduced to
c00000000000b710 <match_dev_by_uuid>: c00000000000b710: ea 00 4c 3c addis r2,r12,234 c00000000000b714: f0 d4 42 38 addi r2,r2,-11024 c00000000000b718: a6 02 08 7c mflr r0 c00000000000b71c: 10 00 01 f8 std r0,16(r1) c00000000000b720: 3d e6 ff 4b bl c000000000009d5c <_mcount>
In addition to telling GCC to use
-mprofile-kernel we also needed a new
ftrace_caller(), which is the function called in place of
_mcount when ftrace is enabled. Due to the way it's called, ie. during the
prolog of another function,
ftrace_caller() has to be carefully hand written in
To keep things interesting we also had to deal with the fact that old toolchains
-mprofile-kernel correctly, although they do accept the
flag. Furthermore there is both a 4 instruction sequence, or in newer toolchains
an optimised 3 instruction sequence.
This work was done largely by Torsten Duwe of SUSE, with some initial work by Vojtech Pavlik, and some assistance from me on the module code.
Understanding the TOC & entry points
The TOC or "Table Of Contents", is an area in memory where a program's global
data is placed. When code references a global variable, the compiler generates
instructions to load that variable from the TOC relative to the TOC pointer,
which is expected to be in
On ppc64le, functions typically have two entry points. The first, called the
"global entry point", is found at the start of the function's text, and will
setup the TOC pointer based on the location of the function, which must be
r12. It is the responsibility of the caller to have saved its TOC
value somewhere prior to the call.
The second entry point, called the "local entry point", skips over the first two
instructions which do the TOC pointer setup, ie. it's at
+8 from the start of
the function text. In this case the caller needn't save its TOC, because it
knows it will not be modified.
An example function, with both global and local entry points:
c000000000d1467c <exit_dns_resolver>: c000000000d1467c: 19 00 4c 3c addis r2,r12,25 c000000000d14680: 84 45 42 38 addi r2,r2,17796 c000000000d14684: a6 02 08 7c mflr r0
On entry to the global entry point
r12 must hold the address of the function,
c000000000d1467c. Given that, we can calculate that
r2 will have the
c000000000d1467c + 1638400 + 17796 =
If we enter at the local entry point
r2 must already
hold the value
c000000000ea8c00. This then allows code inside the function to
c000000000ea8c00 as the base address of any loads or stores to or from
The determination about which entry point to use is made at link time. If the linker can determine that the caller and callee share the same TOC, aka. the function call is "local", then it will generate a call to the local entry point. Calls via a function pointer always use the global entry point.
The last thing we need to know about the TOC pointer is that the instructions to setup the TOC pointer can be omitted entirely, if the function accesses no globals at all, eg:
c00000000065c8e0 <int_to_scsilun>: c00000000065c8e0: a6 02 08 7c mflr r0 c00000000065c8e4: 10 00 01 f8 std r0,16(r1) c00000000065c8e8: 75 d4 9a 4b bl c000000000009d5c <_mcount> c00000000065c8ec: a6 02 08 7c mflr r0
Live patch calling sequence
When you "apply" a live patch, what you're actually doing is loading a module with a new function in it, and diverting calls from the old version of the function to the new version.
To try and get a handle on things, let's give these functions names. We'll call
A, this is code somewhere in the kernel or a module which wants to
call a function
B. But we are live patching
B with a new function
"patch"). In order to get from
P we go via the ftrace code, which we'll
F. On the return path we come from
P back to
F and then to
+---------+ +---------+ +---------+ +---------+ | | +-----+---------+---------->| | +---->| | | A | | | B | | F | | | P | | | | | | | +-----+ | | | +-----+ | | | |<----+ | | | |<----+ | | | | | | | | | | | | | | | | | | K / M1 | | | K / M2 | +-----+ Kernel | +-----+ Mod3 | +---------+ | +---------+ | +---------+ +---------+ | | +---------------------+
The arrow that passes straight through
B is meant to indicate that we don't
really execute any of the body code in
B. However we do execute the prolog
_mcount calling sequence.
K / M1 annotation is meant to show that
A may be either built-in to the
kernel, or a module
B may be either built-in to the kernel, or
M1 may or may not equal
F is always built-in to the
P is typically in a new module
Looking at our example calling sequence above,
A -> B -> F, when we get
ftrace_caller()), what value does the TOC pointer have?
We can eliminate one possibility straight away: it's not the TOC pointer for
ftrace_caller() is not a normal function, callers don't setup
r12 to hold its entry point, and so it can't do a normal global entry point
style TOC setup.
It should be the TOC pointer for
B, because the call to
ftrace_caller()) is made after the TOC pointer is setup. But as we discussed
above, the TOC pointer can be omitted if
B doesn't need to use the TOC.
So we don't know whether we have
B's TOC pointer in
r2, and we also
don't know whether that is a module's TOC pointer or the kernel's.
Luckily we have one thing in our favour. At all times we have a pointer in
which points to a per-cpu structure called the
paca which contains the value
of the kernel's TOC pointer. This means in
ftrace_caller() we can save
whatever value was in
r2 and load the kernel's TOC pointer into
then allows the ftrace code to access kernel globals.
Local calls that become global
As we described above, the linker decides at link time whether calls to a
function are local or global, and calls the appropriate entry point. However
live patching breaks this logic, by retrospectively diverting the call from
B, which may have been local, to a call to
The patch function
P is always in a newly loaded module, and in the common
case the caller
A will not be in that same module. So we must always call
the global entry point of
P needs to use its TOC then its global entry point will do the required TOC
pointer setup, and TOC accesses in
P will work just fine.
So what's the problem?
P will do the TOC pointer setup, but it will not save
the existing TOC pointer - that is up to the caller
A didn't save the
TOC, because it knew it was calling a local function which shared the same TOC
The end result is that
P will clobber
A's TOC pointer, and on return to
the TOC pointer will be incorrect and the code will either crash or silently
Solution? Save the TOC pointer in F
So the problem we have is that
A is not saving its TOC pointer, because it
rightly doesn't think it needs to, but then
P is clobbering the TOC pointer,
because it rightly doesn't think it needs to save it.
Hopefully the obvious solution is that
F needs to save
A's TOC pointer
P. The question is where does it save it?
The obvious answer is "on the stack", because that's normally where functions save values before calling other functions, in fact that's basically the entire purpose of the stack.
This was our initial solution, in
F we create a stack frame, save the TOC
LR, then branch and link to
P. On the return path
P returns to
F, which restores
LR and the TOC pointer from the stack, deallocates its
stack frame, and returns to
For the simple case this works fine, and it passed a surprising amount of
testing. Luckily Torsten reminded us that it doesn't work if
B needs some of
its arguments passed on the stack.
For a typical function all the arguments are passed in registers. However in
some cases some of the parameters must be passed on the stack, for example if
the function takes more than 8 arguments, or is a varargs function. Crucially
these parameters will be saved in the callers stack frame, ie.
B will know that it can find them there.
P shares the same function prototype as
B, it too will expect to
find its parameters in
A's stack frame. But because of live patching we have
inserted a new stack frame in between, ie.
F's stack frame.
+-----------------+ | | | A's stack frame | | | | | | +-------------+ | +-----------------+ | | | | | | | | Parameters | | | A's stack frame | | | | | | | | +-------------+ | | | +-----------------+ | +-------------+ | +------->| | | | | | | | F's stack frame | +------->| | Parameters | | | | | | | | | | | | | | | +-------------+ | Wrong | | | Known | +-----------------+ | +-----------------+ offset | | | | | | from SP | | B's stack frame | | | P's stack frame | | | | | | | SP ------> +------+ +-----------------+ SP ------> +------+ +-----------------+
We could solve this by forcing all patch functions to be hand-written so that they know about this difference in stack layout. However that would seriously limit the usefulness of live patching, as all patches would have to be hand-written specifically for powerpc.
The solution: You need a stack in your stack
At this point we know
F needs to save state between
P, but it can't
save that state on the stack. Where else could we put it?
After considering a few possibilities, all of which were unappealing, I proposed
that we create a new "live patching" stack. That is, when
F needs to save
state before calling
P, it allocates a stack frame on the livepatch stack, and
stores its state there. It can then call
P without perturbing the offset from
P's stack frame to the parameters potentially stored in
A's stack frame.
Additionally I proposed that we place the live patch stack in the existing
stack, along with the existing
+-------------------+ | | | Regular Stack | High addresses | | | + | | | | | v | | | | | | ....... | | | | | | ^ | | | | | + | | | | Livepatch Stack | | | +-------------------+ | | | thread info | Low addresses | | +-------------------+
The key point is that the livepatch stack grows upwards from just above the
thread_info, whereas the regular stack grows down from the top of the stack.
The advantage of this is there are no extra allocations required, and there are no synchronisation problems, ie. the livepatch stack always exists if the thread exists, and it is not shared between threads.
The one downside is that it consumes stack space, meaning a kernel with a live patch enabled will use more stack space than one without. However in practice the kernel runs with a large overhead of free stack space, precisely because running out of stack space is already fatal. Hopefully in the medium term we can improve the kernel's handling of stack exhaustion, and at that point we can revisit separating the livepatch stack from the regular stack.
The final piece of the puzzle is the code that runs in
If you've made it this far you should be pretty comfortable just reading the
code, reproduced below.
Update: Kamalesh found and fixed a bad bug (written by me) in the original version, updated assembly listing below.
livepatch_handler: CURRENT_THREAD_INFO(r12, r1) /* Allocate 3 x 8 bytes */ ld r11, TI_livepatch_sp(r12) addi r11, r11, 24 std r11, TI_livepatch_sp(r12) /* Save toc & real LR on livepatch stack */ std r2, -24(r11) mflr r12 std r12, -16(r11) /* Store stack end marker */ lis r12, STACK_END_MAGIC@h ori r12, r12, STACK_END_MAGIC@l std r12, -8(r11) /* Put ctr in r12 for global entry and branch there */ mfctr r12 bctrl /* * Now we are returning from the patched function to the original * caller A. We are free to use r11, r12 and we can use r2 until we * restore it. */ CURRENT_THREAD_INFO(r12, r1) ld r11, TI_livepatch_sp(r12) /* Check stack marker hasn't been trashed */ lis r2, STACK_END_MAGIC@h ori r2, r2, STACK_END_MAGIC@l ld r12, -8(r11) 1: tdne r12, r2 EMIT_BUG_ENTRY 1b, __FILE__, __LINE__ - 1, 0 /* Restore LR & toc from livepatch stack */ ld r12, -16(r11) mtlr r12 ld r2, -24(r11) /* Pop livepatch stack frame */ CURRENT_THREAD_INFO(r12, r1) subi r11, r11, 24 std r11, TI_livepatch_sp(r12) /* Return to original caller of live patched function */ blr
The end result
The ftrace changes were merged into 4.6, and the rest of the live patching
support was merged into 4.7. This means in 4.7 you can enable
and start patching your kernel. It's likely some distros will also backport the
changes into older kernels.
In total the diffstat wasn't all that impressive:
25 files changed, 778 insertions(+), 140 deletions(-)
But hopefully this article has made some of the complexity of the implementation clear.
Big thanks to everyone who helped write the code and/or did reviews:
- Torsten Duwe
- Vojtech Pavlik
- Petr Mladek
- Jiri Kosina
- Miroslav Benes
- Josh Poimboeuf
- Balbir Singh
- Kamalesh Babulal