This page documents improvements to the i965 driver that we would like to make in the future (time permitting). To see things that we don't intend to fix (e.g. known hardware bugs), see I965Errata.

gen6 (Sandy Bridge) and newer

Use SSA form for the scalar backend (hard)

At some point, we want to use SSA form for the scalar backend. Some thoughts for that have beeen collected at I965ScalarSSA.

Improve performance of ARB_shader_atomic_counters

In the fragment shader, if all channels doing the atomic add are to the same address, then an atomic add of the number of channels active and manually producing the per-channel result from that should be more efficient than asking the hardware to do each atomic operation (even though there is only the one SEND instruction for the atomic operation).

Improve code generation for if statement conditions (easy)

Something like "if (a == b && c == d)" produces:

cmp.e.f0(8)     g32<1>D         g6<8,8,1>F      g31<8,8,1>F     { align1 WE_normal 1Q };
cmp.e.f0(8)     g34<1>D         g5<8,8,1>F      g33<8,8,1>F     { align1 WE_normal 1Q };
and(8)          g35<1>D         g32<8,8,1>D     g34<8,8,1>D     { align1 WE_normal 1Q };    null            g35<8,8,1>D     1D              { >align1 WE_normal 1Q };
(+f0) if(8) 0 0                 null            0x00000000UD    { >align1 WE_normal 1Q switch };

when it would be better to produce something like:

cmp.e.f0(8)           g32<1>D         g6<8,8,1>F      g31<8,8,1>F     { align1 WE_normal 1Q };
(+f0) cmp.e.f0(8)     g34<1>D         g5<8,8,1>F      g33<8,8,1>F     { align1 WE_normal 1Q };
(+f0) if(8) 0 0                       null            0x00000000UD    { align1 WE_normal 1Q switch };

Recognize DPH structures and generate them. (moderate).

Right now we have ir_binop_dot, which translates to a DP2, DP3, or DP4 instruction in the hardware during brw_vec4_visitor.cpp. But if you've got a DP4 where just one of the channels is 1.0 (or, equivalently, DP3 plus adding one of the channels of one of the two vectors to the result), then we could avoid the immediate 1.0 move by using the DPH opcode.

While it may not be a complete solution, you can probably get some positive results in shader-db by making a new ir_binop_dph ir_expression operation (see src/glsl/README and git log on ir.h to find some other examples of what goes with new expression operations), recognizing the pattern in ir_algebraic.cpp (since that's the easiest place we have for doing changes like this), then adding brw_DPH to the generator.

Return-from-main using HALT (easy)

Right now when there's a "return" in the main() function, we lower all later assignments to be conditional moves. But, using the HALT instruction we can tell the hardware to stop execution for some channels until a certain IP is reached. We use this for discards to have subspans stop executing once they're discarded (for efficiency), and we basically that on a channel-wise basis for return-from-main. Take a look at FS_OPCODE_DISCARD_JUMP, FS_OPCODE_PLACEHOLDER_HALT, and patch_discard_jumps_to_fb_writes()

Loop invariant code motion (hard)

When there's a for loop like

for (int i = 0; i < max; i++) {
    result += texture2D(sampler0, offsets[i]) * texture2D(sampler1, vec2(0.0));

It would be nice to recognize that texture2D(sampler1, vec2(0.0)) doesn't depend on the loop iteration, and pull it outside of the loop. This is a standard compiler optimization that we lack.

Hardware conditional rendering on gen7+ (easy).

We don't know of any apps using conditional rendering, but this could be a win for any that do. Check for the "Predicate Enable" in "3D PRIMITIVE Command" docs, and replace mesacheck_conditional_render() usage with setting up the predicate and flipping that 3DPRIMITIVE bit.

32-wide dispatch for regular fragment shaders (hard)

This might involve double-emitting each operation at the LIR level, or might involve making 32-wide (4-register instead of 2-register) vgrfs. Both seem like big changes.

Experiment with VFCOMP_NO_SRC in vertex fetch.

Right now, if a VS takes two vec2 inputs (say, a 2D position and 2D texcoord), it will get put in the VUE as two vec4s, each formatted as (x, y, 0.0, 1.0).

The VUE could be shrunk if we could notice that and pack the two into one VUE slot, using the VFCOMP_NO_SRC in the VERTEX_ELEMENT_STATE to write half of the VUE slot with each vec2 input. This is assuming VFCOMP_NO_SRC works like we might hope it does (the "no holes" comments are concerning).

[Note that on ILK+ the destination offset in the VUE is no longer controllable, so only things which can share a VERTEX_ELEMENT_STATE can be packed. -chrisf]

Use vertex/fragment shaders in meta.c (easy)

This is partially done now, but using fps and vps for metaops lets us push/pop less state and reduces the cost for mesa and 965 to calculate the state updates that result.

Full accelerated glBitmap() support. (moderate)

You'd take the bitmap and upload it to a texture and put it in a spare surface slot in the brw_wm_surface_state.c-related code. Use meta.c to generate a primitive covering the area to be rasterized by glBitmap(). Set a flag in the driver across the meta.c calls that we're doing bitmap, then in brw_fs_visitor.c when the flag is set you'd prepend the shader with a texture sample from the bitmap and a discard.

Full accelerated glDrawPixels() support (moderate).

Like the glBitmap() above, except you're replacing the incoming color instead of doing a discard.

Full acccelerated glAccum() support. (easy)

Using FBOs in meta.c, this should be fairly easy, except that we don't have tests.

Full accelerated glRenderMode(GL_SELECT) support (moderate).

This seems doable using meta.c and FBO rendering to pick the result out.

Full accelerated glRenderMode(GL_FEEDBACK) support. (hard)

This would involve transform feedback in some way.

Trim down memory allocation.

Right now running a minimal shader program takes up 24MB of memory. There's a big 16MB allocation for swrast spans, then some more 1MB or so allocations for TNL, and 1.5MB for the register allocator, then a bunch of noise.

On a core/GLES3 context, we skip the swrast and tnl allocations, but most apps aren't core apps. If we could delay the swrast/tnl allocations until needed, that would save people a ton of memory. The bitmap/drawpixels/rendermode tasks above are motivated by making it possible to not initialize swrast at all.

Pre-gen6 (Iron Lake and older)

Port fast color clears from gen6+ to gen5/4 (moderate).

While it's a very minor win itself, this would be the first step in getting blorp ported so that we can do efficient glBlitFramebuffer() and glCopyTexSubImage() on older gens.

Port GL_ARB_blend_func_extended to gen4/5 (easy).

While we don't use it in our 2D driver or cairo-gl yet (no glamor support), it should be a significant win when we do.

Port GL_EXT_transform_feedback to gen4/5 (hard).

You have to run the geometry shader and have it write the vertices out to a surface, like the gen6 code does. [You also have to do the accounting yourself, as the SVBI hardware support only exists on gen6+ -chrisf]

Port HiZ support to gen5 (hard).

It should be worth a 10-20% performance boost on most apps, but expect a lot of work in getting piglit fbo-depthstencil tests working.

Use transposed URB reads on g4x (moderate).

This would cut the URB size from sf->wm, allowing more concurrency. g45-transposed-read branch of ~anholt/mesa

Port ARB_uniform_buffer_objects to gen4/5 (moderate)

The gen6 code generation may just work on gen4/5, but there will probably be a little bit of work to get the brw_wm_surface_state.c code updated.

Port ARB_texture_buffer_object/ARB_texture_buffer_object_rgb32/ARB_texture_buffer_range (easy)

The gen6 code generation may just work on gen4/5, but there will probably be a little bit of work to get the brw_wm_surface_state.c code updated.