The ultimate goal in user interface (UI) remoting is to make the remoted end-user experience as close as possible to local application execution. This is a challenging goal that becomes increasingly more feasible as connection latency (RTT) drops under 50 milliseconds. In addition, there is still much room for innovation on how to efficiently determine changed pixels on a server, encode, transport, present those pixels on the user device and obtain user input in response.
VMware Blast is the VMware UI remoting technology in VMware Horizon. Blast uses standardized encoding schemes, including JPG/PNG and H.264 for pixel encoding, and Opus for audio. Unlike proprietary encoding schemes, these standard formats are supported natively, hence efficiently, in browsers and mobile devices.
Blast-JPG/PNG shipped in the Fall of 2013 in support of browser clients and in early 2015 in support of Linux virtual machines. Blast-H.264 shipped in March 2016 with Horizon 7, as Blast Extreme, with feature and performance parity with PCoIP. Much was written about Blast Extreme since. Here, we provide background and more in-depth technical details.
[Catch up on the latest news: Advances in VMware Blast Extreme & Our Latest Blast Feature Update]
Good Fit for UI Encoding
In early 2011, a small team at VMware started developing the technology that became Blast. Early on we established that H.264 is an effective encoding scheme for UI remoting, at a time where there was broad skepticism about the applicability of video technology for this use case. Our early experiments suggested otherwise; for example, this video from Feb 2011 demonstrated that H.264 is indeed a good fit for UI encoding, including for the high frequency text regions!
H.264 is an advanced compression scheme that is widely deployed. It has been in the works since the early 2000s, was standardized in 2003 and has a mature ecosystem of robust hardware and software encoders and decoders that have been through years of refinement and optimization and are broadly available.
In contrast to proprietary encoding schemes, leveraging a mature standard like H.264 has a long list of benefits.
- For example, hardware H.264 decoders are ubiquitous in phones, tablets, PCs, TVs, etc. Blast-H.264 leverages hardware decode for substantial reduction in both CPU load and battery drain, thus providing a significant benefit for mobile devices.
- Also, H.264 decoders are a standard component of system on chips (SoC), which paves the way for low-cost (below $100) Horizon Blast thin clients. VMware recently announced more than 70 Blast-certified clients at VMworld 2016.
- Furthermore, Blast-H.264 leverages hardware encode on Nvidia GPUs for reduced server side CPU load, and has potential to do so on other GPUs. For example, even integrated GPUs from Intel come ready with dedicated H.264 encoders.
H.264 compression includes: 1) advanced single frame compression mechanisms, 2) facilities to reuse pixels from previous frames and 3) pixel cumulative/compositing techniques for progressive refinement, which gives it a significant compression advantage over the encoding schemes in Horizon 6.
Figure 2 shows how the macroblocks (MBs) are encoded as a result of a user moving the Calculator app by a few pixels. The yellow MBs are unchanged from the previous frame (so they are skipped), the red MBs are intra-frame (i.e., reference prior regions in the frame being encoded) and the blue MBs are inter-frame (i.e., reference a region in the previous frame). The encoder chooses intra- or inter-frame based on a “rate-distortion” balancing act/optimization technique, where “rate” relates to number of bits used and “distortion” relates to pixel quality.
In addition, H.264 is throughput efficient due to its differential encoding scheme where a frame gets encoded based on a temporal/spatial prediction plus a residual MB (RMB); see Figure 3. An encoded MB involves a reference to a carefully selected prior pixel region (either intra- or inter-frame), and then an encoding of the difference with the desired pixels as a RMB. The selection of the pixel region is made to lower the size of the RMB. Furthermore, Blast utilizes RMB for progressive enhancement of pixel regions according to bandwidth availability.
Our experiments encoding video content with Blast-H.264 vs. Horizon 6 showed that the former is about three times more efficient in throughput. Hence, if the connection is bandwidth bound, the user gets about three times the frame rate. Alternatively, if the connection is wideband (practically unlimited), Blast-H.264 delivers the same frame rate at about a third of the throughput; see Figure 4.
UI-Specific Encode Optimizations
For the mechanics of H.264 encoding, Blast uses a highly optimized and mature off-the-shelf solution; however, it adds a layer of optimization that is unique to UIs. The sequence of frames generated by a UI is quite different from typical video. Our early investigations indicated that significant opportunities exist to reduce the encoding cost and improve the quality of the encoded stream based on exploiting these unique aspects of a UI, as we discuss below. These optimizations target lower CPU utilization, throughput and UI latency. Also, to keep latency low, Blast uses only P frames that reference the prior frame. While this discussion focuses on H.264, these optimizations apply to other video technologies since they are based on similar constructs as H.264.
UIs exhibit high frame coherence (i.e., a small portion of pixels varies from one frame to another). For example, text input, scrolling or pull-down menus all result with localized pixel changes. So, the large portions of the screen that remain unchanged do not need to be reexamined by the encoder. For a given frame, Blast rapidly determines changed regions and plumbs the resulting change-map information into the encoder. Typical H.264 encoders do not take advantage of this optimization, since in common video content most pixels change in every frame.
H.264 encoding is dominated by motion estimation, the search in previous pixel information for representing the MB being encoded. Guided with user input events (e.g., scroll wheel or arrow key events), Blast uses efficient bit-blit detection algorithms to determine this common occurrence of large displaced rectangles of pixels (e.g., a scroll or a window move); see Figure 5. From a bit-blit determination, the motion vectors (represented as red lines) of a large number of MBs is deduced. Bit-blit detection yields MBs with a 100-percent match (i.e., with empty RMBs. Also, motion vectors are differentially encoded, so are represented efficiently since identical across the blitted region). This yields significant efficiencies in both processing and compression.
Another important aspect of video encoding for UI remoting is that the encoding is in real time. It targets a dynamic connection with the end-user device and has the challenge and the goal of lowering UI latency. The Blast encoder has many provisions for adapting to this situation. One of the most important decisions the encoder makes is the frequency of generated frames, so not to overload the connection. Blast monitors the connection carefully, and has extensive code to model its state in order to emit frames just as the network and client device are ready to receive them. Generating frames faster than the network can transmit, or the client can consume, will result in image data going stale in connection queues.
This approach yields the notion of a “bit budget,” which represents the available capacity of the connection. In addition, Blast uses bit budget weighting for the different regions of the screen, depending on content (e.g., high frequency regions like text carry higher weight).
Furthermore, in bandwidth-restricted situations, Blast uses an elaborate technique for progressive refinement of pixel regions using RMBs. It uses age maps and knowledge of temporal masking to guide it on how to best prioritize the available bandwidth, with the goal of maximizing interactivity and minimizing the perception of intermediate image distortion.
HTML5 Browser Clients
Blast uses two important HTML5 features: the <canvas> tag and WebSockets. The <canvas> tag provides a pixel-addressable graphics context, which can be used for drawing graphical primitives and compositing images. WebSockets provide a persistent, asynchronous network connection between the server and the browser, similar to a regular TCP connection. A critical feature of WebSockets is full asynchronous communication (i.e., the ability to send data in either direction without a prior request). In the case of UI remoting, this allows interactive framerates on high latency connections.
Blast uses virtual channels to multiplex multiple data streams at different priorities between server and client. Clearly, time-sensitive data, such as pixel information and audio, is prioritized higher than bulk data, such as print and USB redirection.
To be perceptively acceptable, audio needs to be contiguous, without breaks or stutter. This represents a different scheduling requirement to pixel updates, where it is okay to drop frames and the main goal is to minimize latency. Audio data is transmitted at a higher priority than video, and video updates are time-sliced so audio is tended to at the required temporal granularity.
VMware will continue to focus innovation efforts on how to run an end-user application in the cloud and deliver a snappy UI on a variety of devices of the user’s choosing. Getting the remoted UI user experience close to that of local execution poses key challenges, but it also creates significant opportunities for moving more end-user workloads to the cloud, and the benefits that come along with cloud computing.
Is there anything you wanted to know about Blast that we missed? Leave your question in the comments below, and we’ll answer!
Keith Whitwell and Sandro Moiron, VMware End-User Computing, contributed to this blog.