Complex Operations
When I give you a piece of paper and ask you to add two numbers or multiply them then most of you should be able to come up with the answer pretty quickly. Now if I ask you to find the answer to “1/34.687†or even “1/SQRT(2.5486)†them most of you will be kicking and screaming for a calculator. As said before if something is difficult for a human it usually is just as tricky for a bit of silicon. Now the Vertex Shader supports “RCP†(Reciprocal – also known as “inverseâ€) and “RSQ†(Reciprocal Square Root – also known as the inverse of the square root) so how can our Vertex Shader implementation calculate these quickly and accurately?
The answer is to use a Newton-Raphson approximation. I am not going to explain in detail what this beast exactly is, suffices to know that it is a step-by-step approximation system to find the approximate answer for a numerical problem. In lay mans terms you make an initial guess and by repeatedly doing some maths your guess gets closer and closer to the actual answer.
The Newton-Raphson approximation formulas are listed below:
1/A | Xn+1 = 2*Xn-A*Xn*Xn |
1/SQRT(A) | Xn+1 = 1.5*Xn-0.5*A*Xn*Xn*Xn |
In this table Xn is the initial approximation (guess) and Xn+1 is the improved approximation. Meaning that if you repeatedly apply this formula, to an initial guessed start value, you should end up with a value that is pretty close to the real answer. The Newton-Raphson method approximately doubles the number of significant digits for each iteration if the initial guess is close to the actual answer. So if we want to make optimal use of this technique and avoid long costly looping (each loop would be a clock cycle) we need initial start values that are close to the correct answer. The way to do this is to have an on-chip table of start values, obviously the more values in this table the quicker the answer can be found (less loops needed to have sufficiently accurate result) but at the cost of silicon space. The ultimate version of this relies almost purely on a look-up table with basic interpolation, this method is very fast and trivial to implement but a fairly large look-up table can be needed. Obviously for some functions there is plenty of table re-use possible (think cos/sin where the same table values can be re-used but with different signs).
So again different hardware vendors might have different implementations one might opt to have a smaller table and do more mathematics to find the answer, another might have a bigger table and but need less maths (usually means more performance). The usual trade-off between silicon space and speed and obviously great care has to be taken to avoid mistakes and guarantee enough bits of precision.
The same principle of tables and approximations can also be used to handle other complex instructions like Sine, Cosine, Log, Pow, etc. These instructions are most likely handled by a special function block which sole purpose is this kind of table based approximations.
Function Calls
A function call for the Vertex Shader is essentially a technique to use the same bit of code multiple times, so you save instruction space. It works like this:
... Instruction X CALL Normalise Instruction Y ... Normalise Maths to do normalise ... RET |
The CALL instruction will cause the program counter value to be pushed onto the stack, the stack being a buffer of a limited depth. The program counter is then loaded with the instruction address indicated by the “label†that is called. By changing the program counter different instructions will be fetched and executed, in this example the instructions performing a normalisation. At the end of the function is the RET command, this command takes the last value of the stack and loads it into the program counter, with other words the execution continues right after the CALL command (instruction Y). Multiple nested calls results in multiple values on the stack. If the number of nested calls is larger than the depth of the stack buffer things will go wrong, you can’t store the return addresses correctly and things go horribly wrong: stack overflow.
As usual there are other ways to handle this, one option to improve speed is to actually inline the function calls, this means that instead of changing the program counter the instructions of the function call replace the call instruction, so you get this:
... Instruction X Maths to do normalise ... Instruction Y ... |
This technique costs instructions slots if the normalise function is called multiple times, but it saves the clock cycles needed for the CALL and RET. So again the Vertex Shader compiler in the driver will play an important roll: if enough instruction space is available inline the function calls thus not wasting clocks on doing program counter changes.