Instruction Storage, Fetch and Decode

The instruction storage might at first look pretty obvious and trivial to implement, but even for this part of the Vertex Shader there can be variations resulting in performance differences.

As explained in a previous section the instruction size depends on functionality supported and the number of registers that can be addressed. Older Vertex Shader Specs and hardware supports less functionality and less registers and as such the instruction size is small compared to hardware that supports the latest Vertex Shader Specs. This difference in instruction size has an impact on the upload speed of programs. A simple program can be supported on both older and newer hardware but due to the difference in instruction size there will be differences in upload time. Basically newer hardware might end up slower at loading a new Vertex Shader program than older hardware because of the extra bits dragged around to expose new functionality and extra registers.

Another factor to look at is efficient use of the instruction buffer. For new hardware this buffer stores up to 256 instructions but if the average Vertex Shader only uses 20 instructions then there is a lot of valuable silicon space being wasted. To avoid this waste and to speed up performance an improved implementation might allow storage of multiple Vertex Shader programs. So rather than a big buffer containing just a single program, the big buffer can be split into sections each containing a Vertex Shader program. This can be achieved quite simply by allowing the driver to set different start and end values for the program counter, e.g. Shader 1 would go from instruction 1 to 23, Shader 2 from 24 to 52, shader 3 from 53 to … and so on. The advantage of this technique is that shaders can be uploaded once and switching between various shaders becomes quite cheap. The most basic implementation of this principle would allow double buffering of the instructions, one buffer contains the vertex shader currently in use and the second buffer can be used to upload the next shader. So while one set of vertices is being processed, the shader for the next batch of vertices already gets uploaded to the hardware thus effectively hiding the time needed for the upload and thus avoiding a nasty hardware stall and lost cycles. The most advanced form would allow X-level buffering where X vertex shaders are stored in the instruction memory and one is active and another is being uploaded.

In summary I have proposed already 3 different approaches to the instruction memory. The trivial implementation has one buffer storing one program which would result in stalls when the hardware switches from one program to another. A more advanced double buffered solution which allows 2 sections one active shader and one buffer ready to store the shader for the next set of vertices, for very large shaders the hardware can merge the 2 buffers to form one huge program area. And a fully flexible solution with a programmable start and end address of the shader in on-chip instruction memory – obviously jumps and loop targets have to be adapted automatically to the variable start address of the vertex shader program.

Ultimate flexibility would be achieved if the hardware supports loading of sections of programs on the fly, effectively transforming the instruction memory into a cache that can be filled (using prediction) on the fly with instructions. Such a cache based structure would effectively allow infinitely long shader programs but at the risk of reduced performance since as indicated before the load bandwidth of shader programs can be quite high, but then again when you are executing 1000s of instructions per vertex speed might not be the ultimate concern.

Data Fetch and Store

Each instruction can have up to 3 source registers, this means up to 3 values have to be fetched for each instructions. Memory and buffers come at different costs, obviously the bigger the more expensive in silicon area, but the number of read and write ports also has an impact on cost. A single ported memory allows one location to be accessed in the buffer, dual ported memory allows 2 locations, triple ported allows 3 locations etc. As indicated each instruction can have up to 3 source registers and one destination registers and all can be for the same buffer or different buffers. Most used is the temporary register buffer area since it allows read/write there is a good chance that at the same time 3 reads and 1 write is required. These demands result in a quite expensive buffer structure and some implementations might want to reduce the cost by reducing performance. Most instructions do not use 3 source registers so why not design for the average case with 2 source registers rather than the worst case with 3 source registers? Implement 2 read ports and 1 write port and if 3 reads are required simply stall for a single cycle so the data can be fetched. For real budget solutions one could even consider a single read and a single write port thus stalling the whole pipeline for instructions with 2 or 3 source arguments.

The performance of the data fetch and data storage is directly linked to how much silicon area is available for the design -- can the hardware vendor afford to implement a complex buffer structure capable of allowing 3 reads and 1 write at the same time thus maximising performance or is the hardware optimised for the average case resulting in the occasional stall when more reads are required than supported by the hardware? Again different hardware different implementations…