“Constant†Branching and Loops
I assume that everyone is aware of what branching (if, else, etc) and loops (for, while, etc) are. This first category makes its decisions based on constants; the advantage of constants is that they are constant for a large set of vertices. This means that the behaviour of the Vertex Shader is identical for a large set of vertices and this works very well with the SIMD nature of the Vertex Shader. Branching and loops can quite simply be implemented using instructions that conditionally change the program counter. Since the values used for branching and looping are constant over a large number of vertices there is no problem with multiple vertices being processed by the same pipeline to avoid latency impact. Things get tricky when the Vertex Shader has to allow…
Dynamic Branching and Loops
This second category makes its branching/looping decisions based on the contents of temporary registers; these registers can contain completely different values for each vertex. For example we can decide to execute a different number of loops based on the position of a vertex; since the position of each vertex we process is different it also means that the number of loops executed for each vertex can be different. This kind of behaviour breaks SIMD, remember that SIMD stands for Single Instruction Multiple Data, we still have multiple data elements (vertices) but we do not necessarily have the same Single Instruction. Supporting dynamic branching and looping means that ideally every vertex requires its own program counter, since different vertices execute different paths through the program (more less loops, branches executed or not). For example take the following program:
Loop (X) A End Loop If (Y) B Else C End If D |
Depending on the value of X and Y this program can result in different code being executed. Some different possibilities are:
AAABD AAAACD ABD CD … |
The basic issue is that each vertex can execute different instructions (execute code block B or C) and different amounts of instructions (Block A can get executed zero, one, two… N times).
This behaviour causes problems with the SIMD structure of the Vertex Shader that we have built. The ideal solution is to break away from SIMD and allow every vertex processed to do its own thing - but the cost associated with this is huge.
First of all we’d need multiple program counters, multiple instruction decodes, multiple read ports to instruction memory but all of this pales in comparison with the complexity of the control logic since the first vertex processed might be complex and require hundreds of instructions, the following vertices might be quite simple and require only tens of instructions. This results in a complete unbalance in the pipeline structure and keeping track of which vertex is completed, which is being processed becomes a nightmare… remember that at the back-end data is fed into primitive building and if the first vertex takes a long time all other vertices have to wait until this first triangle is complete. While not impossible to support this the complexity is huge and would translate into considerably more silicon being required.
So is there no other solution to support dynamic branching but maintain the fairly simple control logic and structure of the SIMD model? To avoid complicating the control logic we have to remain in SIMD mode, meaning that every vertex that’s in the pipeline processes the same instructions. Due to the dynamic branching this means that some vertices might have to undergo instructions that are invalid and thus have to be disabled. Assume that to absorb latency we push 4 vertices into the processing pipeline. These 4 vertices might have to execute the following instructions:
Vertex 1: AAABD Vertex 2: ABD Vertex 3: AACD Vertex 4: BD |
Since the design is SIMD the same instructions are applied to all 4 vertices. Code block A has to be executed 3 times to allows Vertex 1 to process correctly even though Vertex 2, 3 and 4 do not require this. Code Block B and C both have to be executed since both are selected by at least one vertex being processed. Code Block D is non conditional and thus has to be executed once for every vertex. This means the following code path is executed for all 4 vertices:
AAABCD |
To make sure each vertex produces the correct result some of these instruction blocks have to be enabled (E) / disabled (D):
AAABCD Vertex 1: EEEEDE Vertex 2: EDDEDE Vertex 3: EEDDEE Vertex 4: DDDEDE |
It’s pretty clear that this mechanism is fairly inefficient if the vertices processed in the same batch have very different code paths. The advantage of this technique is the lower complexity, reduced silicon area and chip cost, but it comes with potentially low efficiency, meaning that cycles are “wasted†executing non-needed instructions.
Usually vertices follow similar code paths so things will usually not be quite as bad as illustrated here – but developers should be aware of the potential performance impact of using dynamic branching. Where possible static/constant based branching should be used.
Another thing to keep in mind is that SIMD is usually extended to multiple shader pipelines, so not only do you have a number of vertices being processed by the same pipeline at the same time to hide latency, you also have 4 of these pipelines possibly hooked into the same SIMD model thus increasing the dependencies and reducing the efficiency when executing dynamic branching.
Its essential that developers are aware of the potentially huge impact on performance when using dynamic branching, whenever possibly use static branching – only resort to dynamic branching if the branch is truly dynamic and different for almost every vertex processed.