posted

0 Comments

The Linux kernel, like all quality operating systems, strives to be the most efficient as possible. This is because the kernel is master of the machine, and all applications can be bogged down if the kernel is taking too much time to perform a system task. That could be a latency in scheduling between tasks, handling external interfaces such as the network card, or the hard drive. Or any other service that the kernel provides for applications. Thus, the kernel is coded in a way that it can accomplish tasks as fast as possible, but still be maintainable and reliable in its design.

One optimization that is utilized in the kernel are a couple of macros called likely() and unlikely(). These macros are hints to the compiler (namely gcc) such that it can know the probable path of a conditional statement. These macros are defined as:

#define likely(x)     __builtin_expect(!!(x), 1)
#define unlikely(x)   __builtin_expect(!!(x), 0)
An example of its use is something like:
ret = some_routine();
if (unlikely(ret < 0))
      return -ERROR;
The function (some_routine) is very likely to not error out. But in the unlikely case that it does, the code returns an error. The unlikely() macro around the condition tells the compiler to optimize the layout such that it expects the variable ret to always be positive.
So, why does all this matter? Well, it matters a lot when it comes to laying out the code such that it can optimize the cache. When a CPU is executing code, the pipeline will pull in memory by cache lines.
Ideally, we want the memory to stay in that cache. The more cache lines the CPU requires to use, the more often some other cache line is going to be flushed out just to be loaded in again. And that takes a lot of time compared to the execution of code that is already in the cache.
The official Linux kernel tracer ftrace was able to increase its performance by as much as 20% by adding in proper likely() and unlikely()s to the code. That’s because the compiler was able to keep the logic that is most often used tightly together, and this cut down on the cache lines needed to execute the recording of the ftrace data.

Now you may think that something as simple as “return -ERROR;” would not need such an annotation. But lets say it was the likely case.

Perhaps we had “return 0”. A function return on an x86 machine in C actually requires quite a bit of work. It must restore saved registers and load a return value before it can exit the function. For example the return of pick_next_task_fair() (a commonly called schedulerfunction) has this:

  movq $0x0, -0x48(%rbp)
  mov -0x48(%rbp), %rax
  add $0x68, %rsp
  pop %rbx
  pop %r12
  pop %r13
  pop %r14
  pop %r15
  pop %rbp
  retq
The above takes up a total of 27 bytes. With a cache line size of 64 bytes, that return takes up 42% of the cache line.
For the likely case of a simple “if (x) return 0;”, it would be best to keep that code right after the compare of x. If it is not likely then it should be moved away from the rest of the function code, and have the positive branch case have to jump to it.
Now you can see that adding likely() and unlikely() macros around branch conditions can improve the performance of the kernel. But the question then arises, how do you really know if your likely() and unlikely() placements are correct? Doing micro-benchmarks may help, but those rarely demonstrate real world activity, and benchmarks on real world activity is difficult to pin point what branches are being taking or not. This is where I created the likely/unlikely branch profiler.
By enabling CONFIG_PROFILE_ANNOTATED_BRANCHES in the kernel build config file, all the likely() and unlikely() macros will be recorded to see how many times they were correct or not. Then while the machine is running, examining the tracefs file in:
/sys/kernel/debug/tracing/trace_stat/branch_annotated
This will show what branches are correct or not. For example, doing this on one of my test machines with a distribution config, but with profiling enabled, we have:
Correct Incorrect % Function File Line
0 766059 100 audit_syscall_exit Audit.h 2608
0 204147 100 Sched_info_switch Stats.h 156
0 162121 100 Sched_info_queued Stats.h 108
0 106475 100 Sched_info_dequeued Stats.h 73
0 43415 100 Pick_next_task_dl Deadline.c 1155
0 1515 100 Pte_alloc_one_map Memory. 2909
0 1460 100 Percpu_up_read_preempt_enable Percpu-rwsem.h 95
0 1460 100 Percpu_down_read_preempt_disab Percpu-rwsem.h 47
The above shows that there’s several cases that seem to use the likely() and unlikely() macros 100% incorrectly. Now why is this?
There are several reasons that these are incorrect. One is simply because of human error. Over the years I have enabled this tracer on two boxes of mine that I consider production machines. One of these machines runs various services like a web server, a mail server and even my compiling server. The other is the one I work on on a daily basis, which has me writing and compiling code, web browsing, and reading email. After about 3 weeks I collect all the statistics done by this profiler and start analyzing the ones that are most incorrect. One year, after submitting a patch to turn a unlikely into a likely, the author of the original code told me that they meant to put in “likely” and simply put in “unlikely” by mistake.
One thing to note, do not take the profiler as the reason to change the code. Before sending any patch to update the code, I always examine why it is wrong. There may be a config option I have enabled that will make it always incorrect that is normally not enabled. One example of this is if I enabled CONFIG_SLUB_DEBUG_ON, it will enabled the memory slub algorithm debugging at boot up. This will cause several paths to be hit that would normally not be hit, and that will cause several likely() and unlikely() branches to do the opposite of what it usually does, and this profiler will flag them a such. Never use this profiler by itself as justification to send a patch. Always dig deeper and understand the code, or the history of the code before submitting a patch.
One unlikely() path that was incorrect was simply due to the code calling that function had changed over time. There was a case in the scheduler that the task being processed was pretty much guaranteed to be sleeping, and the path for the running case was unlikely(). But over time, there was two changes made to code that called the function that made it much more likely for it to be running. The final decision was to simply remove the unlikely() and not to replace it with likely(), even though it was now a likely case. But as that branch is volatile to how it is called, it was best to not make a decision one way or the other.

One time, I reported an incorrect unlikely() without a patch, because the code looked like that case was suppose to be an exception, but it was constantly being hit. The author of that code found a bug somewhere that was causing this path to be taken when it should not have been. The unlikely stayed but the reporting of it being hit was able to uncover a bug elsewhere.

Looking at the cases above, first being “audit_syscall_exit”, this code gets called at system call exit when there’s some “work” to do. This slow path of system call exit happens when system calls are being traced, or even if a task is performing a “ptrace” on another task (like gdb). Since the auditing code is trying to be “nice”, it doesn’t assume it is the fast path, and places an “unlikely()” in the test case if it is enabled or not. The machine running this happens to have selinux enabled and that enables the auditing of system call exit.

Here’s one example where a 100% incorrect “unlikely” is the right answer! Note, if I disable auditing in the kernel by passing in “audit=0” to the kernel command line, this branch becomes 100% correct.

The sched_info_*() functions all have 100% unlikely, because the code there is something similar to the audit code. If scheduling statistics are disabled, then the sched_info_*() calls should all be ignored. But this is under discussion because when statistics are disabled, the code that has the “unlikely()” is not compiled in. Which begs the question, why is this incorrect? When statistics are compiled into the kernel, there’s a variable used to enable or disable statistics which the unlikely() macro is around. The discussion over this unlikely() is that the variable is enabled by default when statistics are enabled at build time, and can only be disabled by a kernel command line to turn it off. There is a good argument to change the unlikely, but the owner of that code needs to be convinced first.

Using the likely/unlikely profiler is a good way to understand parts of the kernel. Every time I look at an incorrect macro placement that is located in a part of the kernel I’m unfamiliar with, I learn a bit about the surrounding code. That’s one of my motivations to continue to periodically see where likely and unlikely fail to meet expectations.

It’s also a good excuse to keep my “#define if” still in the kernel. But that’s for another topic.