As part of the 4.6 and 4.7 development cycles, several of us from IBM & SUSE finally got all the pieces lined up to enable kernel live patching for ppc64le.
If you're not familiar with kernel live patching, there's a pretty good write-up at Wikipedia. But the basic idea is that you can, without rebooting, replace one function in the kernel with a new version of that function (or something else entirely). Typically you do this because there is a bug in the original version of the function, and this allows you to fix or mitigate that bug.
If you think that sounds like it could be tricky, then you're right.
FTRACE_WITH_REGS
Live patching is implemented on top of ftrace, which is Linux's mechanism for
attaching trace functions to other functions. Although powerpc has had ftrace
support for ~8 years, we didn't have support for a particular feature of ftrace,
called FTRACE_WITH_REGS
.
So the first step on the way to live patching support was to implement support
for FTRACE_WITH_REGS
.
The way ftrace is implemented is that we build the kernel with some special GCC
flags. These tell GCC to generate calls to a special function, _mcount
, at the
beginning of every function. Typically these calls are nop'ed out, but when
ftrace is enabled those call sites are patched to call into ftrace and do
something interesting.
In order to support FTRACE_WITH_REGS
we first needed to tell GCC to use a
different calling sequence for the _mcount
calls. This is called
-mprofile-kernel
, and was the brainchild of Anton Blanchard & Alan Modra.
Using -mprofile-kernel
tells GCC to use a very minimal calling sequence when
calling _mcount
. This has two advantages, firstly it imposes very little
overhead when the _mcount
calls are nop'ed out, and secondly it means the
register state is not perturbed before calling ftrace - which is exactly what we
need for FTRACE_WITH_REGS
.
This is an example of the code generated for a call to _mcount
using the
standard mcount ABI, as you can see it involves an mflr
, several stores and
the creation of a stack frame (stdu
):
c00000000000b560 <match_dev_by_uuid>:
c00000000000b560: ed 00 4c 3c addis r2,r12,237
c00000000000b564: a0 d6 42 38 addi r2,r2,-10592
c00000000000b568: a6 02 08 7c mflr r0
c00000000000b56c: f0 ff c1 fb std r30,-16(r1)
c00000000000b570: f8 ff e1 fb std r31,-8(r1)
c00000000000b574: 10 00 01 f8 std r0,16(r1)
c00000000000b578: d1 ff 21 f8 stdu r1,-48(r1)
c00000000000b57c: 78 1b 7e 7c mr r30,r3
c00000000000b580: 78 23 9f 7c mr r31,r4
c00000000000b584: d9 e7 ff 4b bl c000000000009d5c <_mcount>
c00000000000b588: 00 00 00 60 nop
Example calling sequence with -mprofile-kernel
, the sequence is reduced to
just the mflr
, std
and bl
:
c00000000000b710 <match_dev_by_uuid>:
c00000000000b710: ea 00 4c 3c addis r2,r12,234
c00000000000b714: f0 d4 42 38 addi r2,r2,-11024
c00000000000b718: a6 02 08 7c mflr r0
c00000000000b71c: 10 00 01 f8 std r0,16(r1)
c00000000000b720: 3d e6 ff 4b bl c000000000009d5c <_mcount>
In addition to telling GCC to use -mprofile-kernel
we also needed a new
implementation of ftrace_caller()
, which is the function called in place of
_mcount
when ftrace is enabled. Due to the way it's called, ie. during the
prolog of another function, ftrace_caller()
has to be carefully hand written in
assembler.
To keep things interesting we also had to deal with the fact that old toolchains
don't implement -mprofile-kernel
correctly, although they do accept the
flag. Furthermore there is both a 4 instruction sequence, or in newer toolchains
an optimised 3 instruction sequence.
This work was done largely by Torsten Duwe of SUSE, with some initial work by Vojtech Pavlik, and some assistance from me on the module code.
Understanding the TOC & entry points
The TOC or "Table Of Contents", is an area in memory where a program's global
data is placed. When code references a global variable, the compiler generates
instructions to load that variable from the TOC relative to the TOC pointer,
which is expected to be in r2
.
On ppc64le, functions typically have two entry points. The first, called the
"global entry point", is found at the start of the function's text, and will
setup the TOC pointer based on the location of the function, which must be
passed in r12
. It is the responsibility of the caller to have saved its TOC
value somewhere prior to the call.
The second entry point, called the "local entry point", skips over the first two
instructions which do the TOC pointer setup, ie. it's at +8
from the start of
the function text. In this case the caller needn't save its TOC, because it
knows it will not be modified.
An example function, with both global and local entry points:
c000000000d1467c <exit_dns_resolver>:
c000000000d1467c: 19 00 4c 3c addis r2,r12,25
c000000000d14680: 84 45 42 38 addi r2,r2,17796
c000000000d14684: a6 02 08 7c mflr r0
On entry to the global entry point r12
must hold the address of the function,
ie. c000000000d1467c
. Given that, we can calculate that r2
will have the
value c000000000d1467c + 1638400 + 17796
= c000000000ea8c00
.
If we enter at the local entry point c000000000d14684
, then r2
must already
hold the value c000000000ea8c00
. This then allows code inside the function to
use c000000000ea8c00
as the base address of any loads or stores to or from
global variables.
The determination about which entry point to use is made at link time. If the linker can determine that the caller and callee share the same TOC, aka. the function call is "local", then it will generate a call to the local entry point. Calls via a function pointer always use the global entry point.
The last thing we need to know about the TOC pointer is that the instructions to setup the TOC pointer can be omitted entirely, if the function accesses no globals at all, eg:
c00000000065c8e0 <int_to_scsilun>:
c00000000065c8e0: a6 02 08 7c mflr r0
c00000000065c8e4: 10 00 01 f8 std r0,16(r1)
c00000000065c8e8: 75 d4 9a 4b bl c000000000009d5c <_mcount>
c00000000065c8ec: a6 02 08 7c mflr r0
Live patch calling sequence
When you "apply" a live patch, what you're actually doing is loading a module with a new function in it, and diverting calls from the old version of the function to the new version.
To try and get a handle on things, let's give these functions names. We'll call
the caller A
, this is code somewhere in the kernel or a module which wants to
call a function B
. But we are live patching B
with a new function P
(the
"patch"). In order to get from B
to P
we go via the ftrace code, which we'll
call F
. On the return path we come from P
back to F
and then to A
. eg:
+---------+ +---------+ +---------+ +---------+
| | +-----+---------+---------->| | +---->| |
| A | | | B | | F | | | P |
| | | | | | +-----+ | |
| +-----+ | | | |<----+ | |
| |<----+ | | | | | | |
| | | | | | | | | |
| K / M1 | | | K / M2 | +-----+ Kernel | +-----+ Mod3 |
+---------+ | +---------+ | +---------+ +---------+
| |
+---------------------+
The arrow that passes straight through B
is meant to indicate that we don't
really execute any of the body code in B
. However we do execute the prolog
and the _mcount
calling sequence.
The K / M1
annotation is meant to show that A
may be either built-in to the
kernel, or a module M1
. Similarly B
may be either built-in to the kernel, or
a module M2
, and M1
may or may not equal M2
. F
is always built-in to the
kernel, and P
is typically in a new module M3
.
Whose TOC?
Looking at our example calling sequence above, A -> B -> F
, when we get
into F
(ie. ftrace_caller()
), what value does the TOC pointer have?
We can eliminate one possibility straight away: it's not the TOC pointer for
F
. Because ftrace_caller()
is not a normal function, callers don't setup
r12
to hold its entry point, and so it can't do a normal global entry point
style TOC setup.
It should be the TOC pointer for B
, because the call to _mcount
(actually
ftrace_caller()
) is made after the TOC pointer is setup. But as we discussed
above, the TOC pointer can be omitted if B
doesn't need to use the TOC.
So we don't know whether we have A
or B
's TOC pointer in r2
, and we also
don't know whether that is a module's TOC pointer or the kernel's.
Luckily we have one thing in our favour. At all times we have a pointer in r13
which points to a per-cpu structure called the paca
which contains the value
of the kernel's TOC pointer. This means in ftrace_caller()
we can save
whatever value was in r2
and load the kernel's TOC pointer into r2
, which
then allows the ftrace code to access kernel globals.
Local calls that become global
As we described above, the linker decides at link time whether calls to a
function are local or global, and calls the appropriate entry point. However
live patching breaks this logic, by retrospectively diverting the call from A
to B
, which may have been local, to a call to P
.
The patch function P
is always in a newly loaded module, and in the common
case the caller A
will not be in that same module. So we must always call
the global entry point of P
.
If P
needs to use its TOC then its global entry point will do the required TOC
pointer setup, and TOC accesses in P
will work just fine.
So what's the problem? P
will do the TOC pointer setup, but it will not save
the existing TOC pointer - that is up to the caller A
. But A
didn't save the
TOC, because it knew it was calling a local function which shared the same TOC
pointer.
The end result is that P
will clobber A
's TOC pointer, and on return to A
the TOC pointer will be incorrect and the code will either crash or silently
corrupt data.
Solution? Save the TOC pointer in F
So the problem we have is that A
is not saving its TOC pointer, because it
rightly doesn't think it needs to, but then P
is clobbering the TOC pointer,
because it rightly doesn't think it needs to save it.
Hopefully the obvious solution is that F
needs to save A
's TOC pointer
before calling P
. The question is where does it save it?
The obvious answer is "on the stack", because that's normally where functions save values before calling other functions, in fact that's basically the entire purpose of the stack.
This was our initial solution, in F
we create a stack frame, save the TOC
pointer and LR
, then branch and link to P
. On the return path P
returns to
F
, which restores LR
and the TOC pointer from the stack, deallocates its
stack frame, and returns to A
.
For the simple case this works fine, and it passed a surprising amount of
testing. Luckily Torsten reminded us that it doesn't work if B
needs some of
its arguments passed on the stack.
For a typical function all the arguments are passed in registers. However in
some cases some of the parameters must be passed on the stack, for example if
the function takes more than 8 arguments, or is a varargs function. Crucially
these parameters will be saved in the callers stack frame, ie. A
's stack
frame, and B
will know that it can find them there.
Because P
shares the same function prototype as B
, it too will expect to
find its parameters in A
's stack frame. But because of live patching we have
inserted a new stack frame in between, ie. F
's stack frame.
+-----------------+
| |
| A's stack frame |
| |
| |
| +-------------+ |
+-----------------+ | | | |
| | | | Parameters | |
| A's stack frame | | | | |
| | | +-------------+ |
| | +-----------------+
| +-------------+ | +------->| |
| | | | | | F's stack frame |
+------->| | Parameters | | | | |
| | | | | | | |
| | +-------------+ | Wrong | | |
Known | +-----------------+ | +-----------------+
offset | | | | | |
from SP | | B's stack frame | | | P's stack frame |
| | | | | |
SP ------> +------+ +-----------------+ SP ------> +------+ +-----------------+
We could solve this by forcing all patch functions to be hand-written so that they know about this difference in stack layout. However that would seriously limit the usefulness of live patching, as all patches would have to be hand-written specifically for powerpc.
The solution: You need a stack in your stack
At this point we know F
needs to save state between A
and P
, but it can't
save that state on the stack. Where else could we put it?
After considering a few possibilities, all of which were unappealing, I proposed
that we create a new "live patching" stack. That is, when F
needs to save
state before calling P
, it allocates a stack frame on the livepatch stack, and
stores its state there. It can then call P
without perturbing the offset from
P
's stack frame to the parameters potentially stored in A
's stack frame.
Additionally I proposed that we place the live patch stack in the existing
stack, along with the existing thread_info
:
+-------------------+
| |
| Regular Stack | High addresses
| |
| + |
| | |
| v |
| |
| |
| ....... |
| |
| |
| ^ |
| | |
| + |
| |
| Livepatch Stack |
| |
+-------------------+
| |
| thread info | Low addresses
| |
+-------------------+
The key point is that the livepatch stack grows upwards from just above the
thread_info
, whereas the regular stack grows down from the top of the stack.
The advantage of this is there are no extra allocations required, and there are no synchronisation problems, ie. the livepatch stack always exists if the thread exists, and it is not shared between threads.
The one downside is that it consumes stack space, meaning a kernel with a live patch enabled will use more stack space than one without. However in practice the kernel runs with a large overhead of free stack space, precisely because running out of stack space is already fatal. Hopefully in the medium term we can improve the kernel's handling of stack exhaustion, and at that point we can revisit separating the livepatch stack from the regular stack.
The final piece of the puzzle is the code that runs in F
between A
and P
.
If you've made it this far you should be pretty comfortable just reading the
code, reproduced below.
Update: Kamalesh found and fixed a bad bug (written by me) in the original version, updated assembly listing below.
livepatch_handler:
CURRENT_THREAD_INFO(r12, r1)
/* Allocate 3 x 8 bytes */
ld r11, TI_livepatch_sp(r12)
addi r11, r11, 24
std r11, TI_livepatch_sp(r12)
/* Save toc & real LR on livepatch stack */
std r2, -24(r11)
mflr r12
std r12, -16(r11)
/* Store stack end marker */
lis r12, STACK_END_MAGIC@h
ori r12, r12, STACK_END_MAGIC@l
std r12, -8(r11)
/* Put ctr in r12 for global entry and branch there */
mfctr r12
bctrl
/*
* Now we are returning from the patched function to the original
* caller A. We are free to use r11, r12 and we can use r2 until we
* restore it.
*/
CURRENT_THREAD_INFO(r12, r1)
ld r11, TI_livepatch_sp(r12)
/* Check stack marker hasn't been trashed */
lis r2, STACK_END_MAGIC@h
ori r2, r2, STACK_END_MAGIC@l
ld r12, -8(r11)
1: tdne r12, r2
EMIT_BUG_ENTRY 1b, __FILE__, __LINE__ - 1, 0
/* Restore LR & toc from livepatch stack */
ld r12, -16(r11)
mtlr r12
ld r2, -24(r11)
/* Pop livepatch stack frame */
CURRENT_THREAD_INFO(r12, r1)
subi r11, r11, 24
std r11, TI_livepatch_sp(r12)
/* Return to original caller of live patched function */
blr
The end result
The ftrace changes were merged into 4.6, and the rest of the live patching
support was merged into 4.7. This means in 4.7 you can enable CONFIG_LIVEPATCH
and start patching your kernel. It's likely some distros will also backport the
changes into older kernels.
In total the diffstat wasn't all that impressive:
25 files changed, 778 insertions(+), 140 deletions(-)
But hopefully this article has made some of the complexity of the implementation clear.
Big thanks to everyone who helped write the code and/or did reviews:
- Torsten Duwe
- Vojtech Pavlik
- Petr Mladek
- Jiri Kosina
- Miroslav Benes
- Josh Poimboeuf
- Balbir Singh
- Kamalesh Babulal