Community

The Road to Real Time Linux

The term “real time” has many connotations. A simple search on Google for “real time” gives everything from Real Time Twitter to “HBO: Real Time with Bill Maher.” You may have used it for real-time updates on a package coming to your home via FedEx or UPS. All of which has to do something about actual time for real, but generally speaking, the term is quite ambiguous. When it comes to operating systems, the term has a much narrower meaning, but still people argue about what exactly it means.

I like to refer to “Real Time Operating Systems” (RTOS) as more about determinism than anything to do with actual time. Timing is still very much important, but being able to know how the system will behave and base algorithms on that behavior which will give consistent results, is much more important. And that is the essence of an RTOS, its deterministic behavior.

One place you hear about RTOS being used is in safety critical applications, such as, software that runs airplane engines, medical devices, industrial robotics, etc. Here, “determinism” is not just about the design of the software, but the actual quality itself.

When I worked at Lockheed Martin, I worked on code that was over 10 years old, because of all the quality assurance that had to be done on it. The code required mathematical proofs to show that it was correct. Anyone with a computer science degree knows that correctness algorithms are NP complete and become practically impossible to prove as the size of the code increases. This explains why a “hard” RTOS is very small in size and has limited functionality.

The first attempts at bringing Real Time to Linux consisted of a micro-kernel (something very similar to a hypervisor), which would run Linux as an application to execute the non critical parts of the system. The actual RT programs would interact directly with this micro kernel to achieve its hard real time response. The idea was that the micro-kernel being very small could possibly achieve the quality that was required in the field. But this had its own problems. It required that it must be ported to each board and device that in ran on. It also meant that any application that required real time access, must have a separate set of system calls to communicate to this micro-kernel and could not run on native Linux.

Ingo Molnar and Thomas Gleixner felt that they could achieve a hard RTOS design with the native Linux kernel. My former employer TimeSys had already a working native Linux RTOS of their own. When I left them, I joined Ingo and Thomas in their effort to create one that could one day be merged into the Linux kernel itself. Although the Linux kernel is much too big to be mathematically proven correct, that didn’t mean that the kernel couldn’t be changed to have a hard Real Time design. That is, all design decisions are made to make sure the system behaves deterministicly, but treat all bugs like any other operating system.

Try the best to be bug free, but don’t worry about proving that the system actually is.

Now some critics of this approach to bring Real Time to the Linux kernel would call it “Soft Real Time.” Soft Real Time has a completely different meaning. It means that the average response time and execution is deterministic, but the system wont fail on a few outliers. A good example of a Soft Real Time environment is video display.

Having 30 frames a second allows someone to watch a video nicely. If one or two frames are missed, the user will most likely never notice. The system only fails if there is a large number of missed frames. Hence, this would be consider a case to use a Soft Real Time system. The difference between the work we were doing to make Linux into an RTOS vs having it be a Soft RTOS is that missed deadlines or dropped frames would be considered a bug. We treated such cases as a system failure which required a fix. If it happened due to a design, then the design needed to be updated. Hence, the term Hard Real Time design.

The work to make Linux into a hard Real Time designed kernel has been going on for over a decade now. It’s the largest out of tree patch series that has been constantly maintained for this long of a period. People always ask when will it finally be merged into the Linux proper. The answer to that is surprising, it already has. It’s been merged multiple times. But it is going in very slowly. A deterministic operating system has lots of requirements. Luckily, these requirements are also very useful for any general purpose operating system. Here’s an example list of items that have been merged into the upstream Linux kernel that were a direct result of the Real Time kernel work:

  • The mutex implementation
  • Lockdep: The deadlock detector analyzer
  • High resolution timers: Before this, timers would only go off at “ticks”
  • The generic interrupt handler code: Before this, every architectre basically cut and pasted another architecture’s work. This left lots of duplicate code, and worse yet, duplicate bugs. A bug fixed in one architecture would still exist in many others.
  • Interrupt threads: Used to interrupt the CPU to let it know that you hit a key on the keyboard or that a packet came in on the network card, etc. Stops the kernel asynchronously to do some given work for a device. Some devices such as hard drives can take a large amount of time to handle.Converting them to threads that can schedule allows user tasks to continue without much jitter. This gives a smoother user experience on the desktop, as well.
  • Priority inheritance: This requires a blog in itself.
  • Tracing – ftrace and by extension perf were efforts by the Real Time developers to find origins of latency.
  • Updates to the Real Time scheduler
  • Simple wait queues: These are light weight and less expensive to execute. Not to mention, they do not hold onto locks for a non-deterministic amount of time.
  • “NO HZ FULL” or tickless execution: This actually never existed in the RT patch set, but since the necessary changes were already in mainline the RT developers decided to work on this feature directly on the mainline Linux kernel.
  • SCHED_DEADLINE: This scheduling class, like NO HZ FULL, was also added directly into mainline Linux, but it would never have been added without the coming of full RT.

This doesn’t include countless number of bugs that were fixed because of the RT patch. RT kernels can cause more race conditions to trigger due to their higher preemptive nature. The RT patch was able to trigger bugs that would only happen on large CPU count machines. The RT patch helped make Linux work on those machines before they even existed.

Today, mainline Linux is much more suitable to work as a soft real time system. With some of the hard real time design added to the kernel, tasks can now make the majority of their deadlines, with much fewer outliers.

Of course, there’s still work to be done. That really comes down to two more features. One is called local locks. These will be used to replace places that disable preemption. Disabling preemption in the kernel is a black box.

Adding local locks will add annotation to these locations and give developers an idea to why preemption was disabled.

The other change that is needed is the big one. That is the conversion of spinning locks into sleeping locks (or mutex). This is the key to allow high priority tasks to preempt lower priority tasks and have quick reaction times. When this gets into the kernel, then we can say that Linux kernel itself can be a hard real time designed operating system.