Motivated by the response to the first driver, I decided to write another driver. This time, to access information like the Raspberry Pi’s board revision, MAC address, and temperature. Which, naturally, we will get from the Pi’s GPU.
This is all completely normal and there is absolutely nothing to worry about.
Out In The Wild
Over the past week or so, quite a few people have installed the resulting pimon driver, which you can download from my GitHub: https://github.com/thebel1/thpimon.
The output of the Python utility used to interface with the driver:
There are now a few great blog articles on how to use the driver to monitor your Pi’s temperature, and even load the data into Grafana:
Starting with The VideoCore Processor
The Raspberry Pi is a board with some connectors and IO pins soldered onto it, and a System-on-Chip (SoC) that forms the brains of the operation. The SoC model used in the Pi 4B is Broadcom’s BCM2711. The SoC contains an ARM Cortex-A72 CPU and (among other things) the VideoCore VI graphics processor. The VideoCore processor is responsible not only for graphics, but also for providing information about the SoC and board.
Why it was done this way, I’m not sure, presumably the functionality was already present in the VideoCore and adding it separately might have incurred unreasonable engineering costs for such a low-cost board.
Documentation, Documentation, Documentation
Or the lack thereof. While the datasheets for previous iterations of the VideoCore processor are floating around the internet, a datasheet for the Pi 4B’s VideoCore VI is still missing. Luckily, there is some documentation on the Raspberry Pi’s firmware wiki on how to communicate with the processor.
The basic process for communicating with the VideoCore is:
- write a buffer to some location in memory
- write the address to that buffer into a hardware register
- wait for confirmation that the request has been processed
- access the same buffer to acquire the response
Simple enough, really.
This mode of communication is called a mailbox interface, since it works like a mailbox. Well, sort of. Perhaps more like a pigeonhole. “Pigeonhole” doesn’t roll off the tongue quite as easily, though.
There’s a few questions we need to answer:
- Where do we allocate the buffer?
- How do we build a request?
- How do we check whether it’s okay to read back the response?
- What sort of information can we access using this interface?
I was easily able to find the answer to the last question here. In theory, the same page also answers the second question. Frankly, I found the explanation rather confusing. Luckily, there are existing GitHub projects that contain code that communicates with the VideoCore, such as the Raspberry Pi UEFI project. From the same link we can also find out how to know when to read back the value.
So far, so good.
Direct Memory Access
The question remains, where to allocate the buffer so that the VideoCore can access it? The answer to this question is slightly more tricky. The VideoCore uses Direct Memory Access (DMA) to access the buffer, meaning it circumvents the CPU. This may come with some constraints as to the addresses the VideoCore supports. One of them being that the address needs to be 32 bits and 16-byte aligned. This means the address must be located beneath the first 4GB of physical memory (2^32 bytes = 4GB). Secondly, the 16-byte alignment implies that the lower 4 bits of the address are zero. This is relevant, since those bits are used to store the VideoCore channel. Each channel provides access to different data about the Raspberry Pi.
There’s another constraint as well, which is that, based on my research, the VideoCore may only be able to access addresses under 1GB. Also, the memory needs to be physically contiguous, since the VideoCore will be accessing physical memory, not virtual memory.
Okay, so how do we meet these constraints?
The answer is, to create a heap for use with DMA that matches the above constraints. Simple, right? Well, sort of.
Creating a DMA Heap
We can create a heap that satisfies the contiguity constraint and effectively satisfies the address range constraint:
How does the above ensure that the DMA buffer will be below 1GB? It doesn’t. However, what I’ve found out during testing is that the heap will be allocated beneath 1GB anyway. Not a great solution, but as long as this is the first driver loaded with this restriction, we should be fine.
To increase the probability that we won’t end up allocating over the 1GB threshold, I re-use the same allocation for each mailbox buffer:
As for the comment, I haven’t found documentation on whether the VideoCore is a bus master or not. I’m presuming this is the case, since I didn’t have to explicitly program the Pi’s DMA controller to communicate with the VideoCore. The BCM2711 datasheet was not super helpful in this area either. The bottom line is, that I’m looking for a way to make this more robust in case other drivers are installed later on that require sub-1GB space, e.g. for a DMA bounce buffer. From a practical perspective, this may not be strictly necessary, since the DMA aperture for non-bus masters can be moved to an arbitrary location up to 16GB.
Barriers and Cache Maintenance
One of the features that makes the Arm architecture distinct from x86 is its approach to memory ordering. While x86 employs a strongly ordered model, Arm is weakly ordered. This means that for interactions with peripherals, one must employ memory barriers to enforce memory ordering at relevant locations in the code. As per the BCM2711 data sheet, section 1.3, this should occur when switching between peripherals. Based on my research and some unreliability in my own driver, barriers should be placed around every read or write to ensure we are interacting with the correct version of the data.
Furthermore, the DMA access by the VideoCore is not cache-coherent. This means that the CPU does not see reads and writes to memory when they are performed by the VideoCore. This is a problem, since we may get stale data when we read back the response to our mailbox request. So, as a result, we need to invalidate the cache before we read the data from the buffer.
An example of a barrier before and after a write to the mailbox buffer as well as a cache invalidation:
Will all of the above in place, accessing the mailbox queue works like a charm. I’ve tested it myself tens of thousands of times and, so far, it has worked well out in the wild.
Say Cheese: Take a Screenshot
Since the VideoCore is a GPU, surely it’s possible to copy the framebuffer and save it somewhere. Well, it turns out, this does work. In fact, the screenshot of the Python utility’s output in the previous section was taken using the screenshot utility.
I acquired the frame buffer works in much the same way as the temperature: using the VideoCore’s mailbox interface. Copying the frame buffer wasn’t difficult, and, in fact, neither was converting it into a bitmap. Thankfully, the bitmap file format specification is ancient and well documented. So, really, all I had to do was allocate a buffer of the correct size and shovel data into it. Once that was done, I could copy it back into user space so that Python could write it to disk.
Here’s a screenshot of my Raspberry Pi’s screen taken using the Python utility:
You can download the VIB from the fbuf branch and play with it on your own Pi: https://github.com/thebel1/thpimon/tree/fbuf