Dependent Instructions
A vertex shader program can contain any number of instructions up to a certain limit (128 instructions for VS1.1, higher for later versions). Now 2 instructions can be “dependent†or “independent†of each other. Let me explain this with a simplified example:
MUL R2, R0, R1 ADD R3, R2, C1 |
The above program multiplies 2 vectors, then adds a constant to the result of the multiply (ignore the existence of a MAD instruction). This is an example of 2 dependent instructions the ADD is dependent on the result of the MUL since the output of the first instruction is immediately used by the second instruction. This can cause a pipeline stall. Assume the following Vertex Shader pipeline:
Data Fetch => Arithm. 1 => Arithm. 2 => Arithm. 3 => Data Store |
On the first clock the data fetch is executed, during the following 3 clocks the pipelined arithmetic unit does its work, during the fifth clock the result is written to the destination register. Now remember with a pipelined structure the different stages work on different data. So during clock 1 the data fetch for the first instruction is done, during the second clock the pipeline wants to fetch the data for the second instruction… and that’s where things go wrong! The data fetch includes a register that is not valid yet since the previous instruction – which is still in the pipeline – still has to update it with the result. If the hardware handles this incorrectly the wrong data is fetched and the result of your vertex shader will be wrong, if the data dependency is detected and handled correctly the pipeline fetch has to stall until the 6th clock when the data has been written to the register… that would be 4 wasted cycles (3 because of the arithmetic unit and 1 because of the Data Store. Latency suddenly turned into something really evil. The stall time can be reduced by 1 clock by providing a quick loop back which links the arithmetic output directly to the input; this direct path saves the stall due to the Data Store. The stall is now 3 clocks equal to the latency of the arithmetic unit.
Does this now mean that the single stage complex arithmetic solution is actually better than the pipelined solution? Ideally we want the pipelined one since it allows high clock speeds, but how can we solve the nasty side effect of latency?
Hiding Latency
Latency becomes a problem when the output of one instruction is immediately used by the next instruction, with other words there is instruction dependency. The problem can be hidden by making sure that dependencies are separated by non-dependent instructions. For example:
A=B+C Z=D+E F=G+E H=D+G I=A+C |
The first and the last instruction are dependent, since the last instruction uses the result of the first instruction. But because 3 other instructions are placed between the dependent instructions the latency of the first instruction is effectively hidden. Those 3 instructions create the time needed for A to finds it way through the pipeline so that the input for the last instruction is immediately available. A smart compiler that is aware of the latency of the hardware design can do a smart kind of instruction re-ordering where instructions, whenever possible, are executed in a different order as originally planned to hide latency and increase efficiency.
Sounds like a good solution but… it’s not guaranteed to work in all cases. Is the compiler good enough to spot these problems and solve them? And in some cases it might simply be impossible to change the instruction order or there might not be enough instructions to be moved. Obviously this is an area where driver improvements over time can make a difference, if a piece of hardware suddenly becomes much faster in a shader benchmark then it most quite possibly means that the compiler was improved to handle the latencies in a better, more clever, way.
Ideally we want a solution that “always†works and does not require excessive amounts of compiler work (less work in the driver means a reduced risk of problems and bugs!). To solve latency we need to remove the dependency, now how can we make sure that the instructions pushed sequentially into the pipeline are “never†dependent on each other?
The answer is again SIMD… same instruction but different data… execute the same instruction but on completely different data, completely different data means it can never clash due to a data dependency. And for vertex shading different data translates to a “different†vertex. Rather than execute the program sequentially for a single vertex, and then go and do the same for the next vertex, why not execute the same program instruction for various vertices sequentially, like this:
Instr. 1 => Vertex 1 Instr. 1 => Vertex 2 Instr. 1 => Vertex 3 Instr. 1 => Vertex 4 Instr. 2 => Vertex 1 Instr. 2 => Vertex 2 Instr. 2 => Vertex 3 Instr. 2 => Vertex 4 ... Instr. n => Vertex 1 Instr. n => Vertex 2 Instr. n => Vertex 3 Instr. n => Vertex 4 |
This structure solves dependency; if instruction 1 and 2 are dependent the latency is no problem since the latency is hidden by working on data that is never dependent, data for different vertices.
While this solution is ideal for hiding the latency problems it comes with a large cost. The vertex shader, in the example above, works on 4 different vertices at the same time. This means that the shader has to keep track of the data flow of 4 vertices rather than just one. This means that the non-constant register count has to be multiplied by a factor 4. After all the shader now needs not one single R0 register but 4 of them, one for each of the 4 vertices in flight. The same is true for input and output registers, 4 different vertices means 4 sets of inputs and outputs. All of this translates into more silicon space, something that’s expensive. So again a hardware developer needs to weight the advantages against the disadvantages and decide: more silicon space and higher speed, or less silicon space – lower chip price – and also a bit less performance… again these different options result in vastly different Vertex Shader implementations and performance characteristics.