This page is a collection of thoughts on how to best do SSA form in the scalar backend IR on i965. There are a variety of different issues discussed here. I've tried my best to put them into a reasonable order.

General Considerations

There are a number of general things that will probably need to change in order to nicely handle SSA form.

  1. Register numbers. Right now, we "allocate" an integer name for each register and treat each value as if it comes out of a magical variable-length register file. For SSA values, we will probably want to do something similar to NIR where we just point back to the instruction that generates the value.

Validation

One of the nicest things in all of NIR is the validator. A validation layer cannot check that the shader is correct, but it can check that it is in the right form. Having a good validator has been a massive help in NIR and we really should repeat it in the backend.

Registers

Most of the interesting bits of SSA have to do with how you handle registers/values.

Intel GEN registers

On Intel graphics, we 128 registers each of which holds 32 bytes. These 32 bytes can be carved into 8 floats, 16 half-floats, 32 1-byte values, or 4 doubles. Since we are working on a SIMD architecture, each register corresponds to a single float value in the source language and the 8 float values in the register correspond to the 8 SIMD channels. Our architecture can also execute in SIMD16 and SIMD32 mode where you need 2 or 4 registers respectively for a single float value.

There are some cases (such as messages) where an instruction reads up to 14 consecutive registers. Therefore, we can't just allocate registers one-at-a-time. We can also indirectly index registers. This allows us to implement arrays by allocating a block of contiguous registers.

Our hardware also has a register regioning system that allows you to access up to two physical registers with a single instruction (so you can get 8 doubles or 16 floats).

More information about register regioning can be found in any Intel graphics PRM. The one for Ivy Bridge is pretty easy to navigate.

Register sizes

Every register has a concept of size. Hardware registers (physical or otherwise) and messages will be specified in terms of a number of physical registers. Other registers have a three-dimensional size; a number of bits, a width, and a length. The number of bits is fairly obvious: 32 for float/int, 16 for hf, 64 for double, etc. The width is the number of SIMD channels stored in the register.

Execution masks and interference

Most values, in particular anything in a GLSL variable, exist in a per-SIMD-channel way. These registers can be treated with the normal SSA-based interference rules. Consider the following example:

float a = 1.0;
if (foo) {
    b = 0.5;
    use(b);
} else {
    use(a);
}

If foo has different values for different SIMD channels, then the only channels of b that are set to 0.5 and used will be the channels where foo is true. Similarly, the only channels of a that will be used are the channels where foo is false. Therefore, we can freely assign a and b to the same register without any harm.

The situation is quite different if the assignment to b happens to all channels regardless of the execution mask. In this case, assigning a and b to the same register will cause the values we want in a to be squashed by the assignment to be so they have to be in separate registers.

For the sake of clearer discussion, will use the term SIMD-local to refer to a register that is only used in a per-SIMD-channel way and SIMD-global to refer to anything else.

Bit sizes and exec offsets.

These issues become even more interesting when you throw in quarter control and non-32-bit types. With these added complexities, the execution masks do not always apply to the same 32-bit slice of any given register. Instead, it may apply to 16-bit or 64-bit slices. For alternate quarter controls, you can have the second half of the mask apply to the given set of 32-bit slices instead of the first.

Register files

In the current backend IR's, we have a variety of different register files that indicate different things. This has worked pretty well, and we probably want to do the same type of thing in a new IR. I think we want the following register files:

  • ATTR: Push attribute registers. These registers will be pushed by the vertex fetching hardware and are considered to be "defined" at the start of the program. These are per-SIMD-channel and as such, have a width equal to the dispatch width.

  • UNIFORM: Push uniform registers. These registers store uniform push constants that are loaded by the hardware for use and are considered to be "defined" at the start of the program. These are uniform values and as such always have a width of 1.

  • HW_REG: An actual physical hardware register. These are needed for a variety of setup tasks. Usually, we only use g0 and g1, but there are other times when we need a particular physical register. Hardware registers don't really have a width per se but we do need some way of specifying what hardware registers get touched. Hardware registers will allow full register regioning. Reads/writes to physcal hardware registers cannot be reordered.

  • Allocated HW_REG: An allocated hardware register is similar to a hardware register in that it allows multiple writes, partial writes, full register regioning, and are SIMD-global. The primary difference between allocated and physical hardware registers is that allocated hardware registers get their actual register number assigned by the allocator. In order to allow for SSA-based allocation, any given allocated hardware register may only be used in a single basic block. (It will be considered to be defined at the first use/def and all other use/defs will be considered uses.) We may implement this by simply using a register number of -1 to indicate that it should be allocated.

  • MRF: Message registers. These are only ever defined by a LOAD_PAYLOAD type operation and used by send messages. On SNB-, these will be allocated out of a special pool of 16 registers (24 on SNB?). On IVB+, these will be allocated out of the same pool as the rest of the registers. Messages are special in that they are partially SIMD-local and partially SIMD-global. How will we handle this in the allocator? I have no idea.

  • IMM: Immediate values. We could do what NIR does and handle immediates by using a constant load instruction. This may be a good idea, but it doesn't map very well to the hardware. It's probably better to use a special register file/type for these.

  • FLAG: Boolean flag values. These store boolean-only SSA values and get allocated into one of the available flag registers. Since we have very few flag registers, we may have to "spill" a flag value to a regular SSA value by inserting a move-to-flag shortly before the uses of the flag value. However, we do want the ability to track flag values explicitly rather than the current "writes flag" boolean. There is a tricky bit here because the CMP instruction always writes the flag whether you want it to or not so we can't quite use SSA interference for this.

  • SSA: These are the regular values. They have a number of bits, a width (1, 8, 16, or 32), a quarter (first half, second quarter, etc.) and a length. In general, the width will be 1 for values that are uniform and 8, 16, or 32 for non-uniform values. Non-uniform value with a width less than the dispatch width will also have a specified quarter to denote which SIMD channels it corresponds to. Unlike hardware registers, partial writes and interesting regions are not allowed on SSA values. Also, SSA values have exactly one definition and this definition must dominate its entire live-range.

One more we may want, but I'm not sure what to do with it:

  • ACCUM: We also have an accumulator register. Some instructions just write to this register by default. If we use the accumulator value, we need to be very careful with re-ordering of instructions. I don't know if we want to track this as an SSA value or not, but we do need to track it.

Register Allocation

One of the big advantages to SSA in the backend IR is the ability to use SSA for doing register allocation. More than the allocation itself SSA-based allocation schemes gives you the ability to know the exact register pressure at any point in the program and make decisions based on it. The benefits of this apply to many areas of the backend compiler such as value numbering, scheduling, spilling, and many more. If we can, we really want an SSA-based scheme.

SSA-based allocation

The problem here is that most SSA-based register allocation schemes are fairly thoroughly CPU-focused. In our setting, there are some additional constraints:

  1. Register sizes. In the usual CPU setting, each register is a single atomic entity. In the GPU setting, however, we have to deal with allocating blocks of contiguous registers. This means that we have to solve a packing problem.

  2. SIMD masking. In the usual CPU setting, if two registers don't interfere in the SSA sense of interference, you can assign them to the same register. In the SIMD setting, SIMD-global registers interfere with more than just what the dominance tree tells you. Also, registers with different bit-sizes and different quarters can interfere even if they don't interfere in the usual SSA sense.

Connor and I (Jason) have discussed this quite a bit and have come up with two different schemes that seem to at least start to address these problems. However, the problem is very far from being solved.

In the mean time...

Before we actually solve the SSA-based register allocation problem, we can probably do better than we are today by using a hybrid approach. This would be to combine both SSA-based interference with the interval-based interference we use today. One way to do this would be as follows:

bool
regs_interfere(reg *a, reg *b)
{
    if (regs_ssa_interfere(a, b))
        return true;

    if (!regs_interval_interfere(a, b))
        return false;

    return a->width != b->width ||
           a->bits != b->bits ||
           a->quarter != b->quarter ||
           a->simd_global || b->simd_global;
}

This would get us an interference graph with fewer edges than the purely interval-based approach but would also allow us to avoid the sharp edges given by working on a SIMD architecture. We would then use the same register allocator that we have today.

The problem with this approach is, of course, that we wouldn't get the nice register pressure properties of real SSA-based register allocation. However, it may save us a few registers and gain us some SIMD16 programs.