X performance is extremely difficult to quantify, and there is a large amount of work to be done in this space. The following are some collected notes on what needs doing.
Discussion and questions about work in these areas should be held on the xorg-devel mailing list.
Transport Performance
In General
X is a fairly compact protocol, with the exclusion of image transport which is done uncompressed. The dominating factor in many aspects of X performance is transport latency - how long it takes for a response to make a round trip. This is exacerbated by the design of Xlib, which is effectively synchronous for most of its operation.
XCB is an effort to rearchitect the network layer of X to hide latency by providing an asynchronous interface to the protocol. It does not make the network any faster, but it does allow the application to do more work in parallel with the network. This requires modification of the application to make much of a difference, but can be dramatically faster when effective. xlsatoms became about 50x faster in some modes when converted to XCB; most real applications will see less benefit than this.
Over the Network
While XCB addresses the latency requirements of X, there is no great solution to the bandwidth requirements. Several good (and not so good) solutions do exist.
- The SSH suite allows for tunnelling X sessions over a secure channel, which can also be compressed. This is adequate for technical users, but it is suboptimal for terminal servers and novice users.
- The LBX protocol is generally inadequate for modern X usage. See the linked paper for details. LBX is no longer built by default in 7.1.
- Xpra acts as compositing window manager to forward individual windows from a virtual server to the client that connects to it (over sockets, tcp or ssh).
- NX is a software compression suite based on the earlier DXPC protocol. It also leverages ssh to provide security services. There are open issues regarding integrating it into the base suite - licensing being among them, as much of NX is LGPL, version 4 is now closed-source.
On the Local Machine
Several commercial X servers use shared memory transports on the local machine to improve performance. Rik Faith researched shared memory transports for the DRI project several years ago. The conclusion then was that it would improve performance for those operations where the time to render the request was dominated by the transport latency, and then by less than 10%.
This may not be true anymore. The balance of the typical machine's memory architecture has shifted, and many operating systems provide advanced high performance synchronization primitives (like futexes on Linux) that may address some of the sync overhead he experienced. This would be an excellent research project.
Driver Performance
2D Rendering
The Important Bits
Most of the core X protocol's rendering routines simply do not get used very often. This is not so much because they are slow, but because they aren't useful on the modern desktop. Empirically, better than 90% of the drawing operations that X sees today are solid fills, blits, and Render operations. Accelerating 2D operation outside this set is important for legacy applications, but in many cases only needs to be "good enough".
XAA
Once upon a time there was an acceleration architecture called XAA. It was fine for 2000, but was largely inadequate by 2005.
- XAA went to great lengths to accelerate operations like patterned fills and Bresenham lines, which are rarely used.
- XAA's support for accelerating the Render extension was poor, because the design of XAA's memory manager only allowed for offscreen pixmaps in exactly the format of the displayed screen. (There was also a special case for 8-bit alpha sources, ie, fonts.)
- Render acceleration in XAA only worked with the source image in host memory and the destination in card memory; to be truly performant it needs to be able to handle the case where both source and destination are in card memory.
- XAA would attempt to upload pixmaps to card memory optimistically, which made the above point worse. This was disabled in 2008 to make xcompmgr performance vaguely tolerable, and provide more texture memory to DRI1-style 3D drivers for GL compositors.
XAA was removed from the X.org sample server in 2012. It was replaced by...
EXA/UXA
EXA started from the lessons learned from KAA, the kdrive acceleration architecture. The basic theme of KAA was to accelerate what a modern session actually used: solid fills, scrolls, and fonts. EXA (dating to 2005) is essentially a port of KAA to the xfree86 server design, and has since been enhanced to support kernel-side memory management, pixmap migration, and a few other things. UXA is a variation on the EXA theme that assumes a unified memory architecture and kernel memory management support.
The ExaStatus page contains the current driver support status.
EXA and its derivatives work well enough, but their implementations in drivers typically work by explicitly building GPU-specific rendering command sequences. To address this, we wrote...
Glamor
Glamor is an acceleration architecture that implements X rendering in terms of OpenGL. This allows the X server to leverage the GL driver already available in Mesa, meaning we only need to write the acceleration code once, and enable each new GPU's acceleration only in Mesa instead of also in the X driver.
This section, and glamor itself, are quite promising, but under construction. For more details see GlamorPerformance.
Framebuffer Layout
Most modern graphics cards can be run in either linear or tiled framebuffer modes. Linear modes are simple, you start in the top-left corner and move to the bottom-right, all the way across a single row before changing rows. In tiled modes the framebuffer is broken up into a series of small tiles, usually 8x8 or so, and memory is laid out such that the first 64 pixels belong to the first tile, then the next 64 to the second tile, etc. You can think of linear framebuffer being a tiled framebuffer where each tile is 1x1.
Tiled framebuffers have a performance benefit because they better model the layout of objects on the screen. They give better locality of reference because each tile is packed tightly in memory, where in a linear framebuffer you might have to skip a thousand pixels ahead to get to the same horizontal offset one line down. Since your spatial locality is better with a tiled framebuffer, your working set fits in your cache better.
Despite this, X's framebuffer core uses linear access. even if the framebuffer appears to be tiled from the GPU's perspective. There may be a performance benefit to making the system framebuffer shadow match the GPU's tile layout. (The wfb software renderer is designed to allow this, but no (open) driver is seriously using it at the moment.) On the other hand, there may not be: the CPU overhead of compensating for the tile layout might outweigh any cache locality benefit.
Framebuffer Access
In general, framebuffer reads absolutely kill performance; we try to do as much work as possible in the write direction only. When CPU readbacks are unavoidable, it is usually more performant to tranfer large blocks of data in and out of framebuffer memory rather than operating on single pixels at a time.
Thrashing can occur when mixing operations that the hardware can accelerate with ops it can't. It remains an open question as to how to best deal with this. EXA and glamor take the attitude that the card can accelerate pretty much anything you throw at it, which seems pretty reasonable.
Algorithmic Issues
EXA's Render acceleration is adequate, but lacks support for a few things. External alpha is basically unaccelerated.
Trapezoid rasterisation in Render is not hardware accelerated. It almost certainly can't be done on fixed-function GL hardware. The software implementation has been reasonably well tuned, but could probably be better.
3D Rendering
DRI Drivers
TODO: Fill me in.
Mesa Core
The observation about tiling for 2D also applies to Mesa's software rasteriser, although by time you're doing 3D in software you're already in a world of hurt.
Interactive Performance
Prior to 1.19, the X server is single threaded (in 1.19 and later, input runs on a thread instead of from the SIGIO handler). Any operation in the server that takes a significant amount of time to complete will make the server feel laggy. This is common for the Mesa software renderer and the software Render code, but any part of the server could trigger this in theory. We should work to maintain fast execution of all code paths, as the X protocol is fairly hostile to multithreaded implementations. The one exception is the Xinerama case, where we could reasonably want one thread per GPU.
One of the worst performance issues X has is making opaque resizes fast. Since the window manager is in a separate process from the application, there are two round trip cycles involved, which makes the latency issues described above worse. There are several possibilities for working around this. One is to move responsibility for window decorations into the client; gtk3 implements this. Another would be to load some portion of the window manager in-process with the X server.
Perceptual Performance
Most X drivers do not synchronize their drawing to the vertical retrace signal from the monitor. (To be fair, very few windowing systems do this consistently, even MacOS X.) This leads to a tearing appearance on some drawing operations, which looks slow. If the vertical retrace signal could be exposed through the SYNC extension, applications could defer their rendering slightly and reduce or eliminate tearing. This requires extending each driver to support this, as well as adding a little support code to the server itself.
The un-Composited model of X operation requires many round trip operations to redraw areas when they are exposed (window move, etc.). using a trivial compositor is almost always a more pleasant experience than uncomposited X.