The Linux kernel, like all quality operating systems, strives to be the most efficient as possible. This is because the kernel is master of the machine, and all applications can be bogged down if the kernel is taking too much time to perform a system task. That could be a latency in scheduling between tasks, handling external interfaces such as the network card, or the hard drive. Or any other service that the kernel provides for applications. Thus, the kernel is coded in a way that it can accomplish tasks as fast as possible, but still be maintainable and reliable in its design.
#define likely(x) __builtin_expect(!!(x), 1)#define unlikely(x) __builtin_expect(!!(x), 0)An example of its use is something like:ret = some_routine();if (unlikely(ret < 0))return -ERROR;
Now you may think that something as simple as “return -ERROR;” would not need such an annotation. But lets say it was the likely case.
Perhaps we had “return 0”. A function return on an x86 machine in C actually requires quite a bit of work. It must restore saved registers and load a return value before it can exit the function. For example the return of pick_next_task_fair() (a commonly called schedulerfunction) has this:
/sys/kernel/debug/tracing/trace_stat/branch_annotated
Correct | Incorrect | % | Function | File | Line |
0 | 766059 | 100 | audit_syscall_exit | Audit.h | 2608 |
0 | 204147 | 100 | Sched_info_switch | Stats.h | 156 |
0 | 162121 | 100 | Sched_info_queued | Stats.h | 108 |
0 | 106475 | 100 | Sched_info_dequeued | Stats.h | 73 |
0 | 43415 | 100 | Pick_next_task_dl | Deadline.c | 1155 |
0 | 1515 | 100 | Pte_alloc_one_map | Memory. | 2909 |
0 | 1460 | 100 | Percpu_up_read_preempt_enable | Percpu-rwsem.h | 95 |
0 | 1460 | 100 | Percpu_down_read_preempt_disab | Percpu-rwsem.h | 47 |
One time, I reported an incorrect unlikely() without a patch, because the code looked like that case was suppose to be an exception, but it was constantly being hit. The author of that code found a bug somewhere that was causing this path to be taken when it should not have been. The unlikely stayed but the reporting of it being hit was able to uncover a bug elsewhere.
Looking at the cases above, first being “audit_syscall_exit”, this code gets called at system call exit when there’s some “work” to do. This slow path of system call exit happens when system calls are being traced, or even if a task is performing a “ptrace” on another task (like gdb). Since the auditing code is trying to be “nice”, it doesn’t assume it is the fast path, and places an “unlikely()” in the test case if it is enabled or not. The machine running this happens to have selinux enabled and that enables the auditing of system call exit.
Here’s one example where a 100% incorrect “unlikely” is the right answer! Note, if I disable auditing in the kernel by passing in “audit=0” to the kernel command line, this branch becomes 100% correct.
The sched_info_*() functions all have 100% unlikely, because the code there is something similar to the audit code. If scheduling statistics are disabled, then the sched_info_*() calls should all be ignored. But this is under discussion because when statistics are disabled, the code that has the “unlikely()” is not compiled in. Which begs the question, why is this incorrect? When statistics are compiled into the kernel, there’s a variable used to enable or disable statistics which the unlikely() macro is around. The discussion over this unlikely() is that the variable is enabled by default when statistics are enabled at build time, and can only be disabled by a kernel command line to turn it off. There is a good argument to change the unlikely, but the owner of that code needs to be convinced first.
Using the likely/unlikely profiler is a good way to understand parts of the kernel. Every time I look at an incorrect macro placement that is located in a part of the kernel I’m unfamiliar with, I learn a bit about the surrounding code. That’s one of my motivations to continue to periodically see where likely and unlikely fail to meet expectations.
It’s also a good excuse to keep my “#define if” still in the kernel. But that’s for another topic.