Introduction
As I explained before, adding T-Buffer effects and moving from 16 to 32-bit color is not for free. The T-Buffer effect requires (at least) 4 sub-samples per-pixel, so we need 4 times as much output as we have today, moving from 16 to 32-bit output also increases the bandwidth needs (like a higher quality mode for the machine). Let me illustrate this last increase with some numbers:
Today we render in full 16-bit mode. This means we use 16-bit textures, 16-bit Z-Buffer, 16-bit frame buffer (read/write and RAMDAC access). In full 32-bit mode we use 32-bit textures, 32-bit Z-Buffer and a 32-bit frame-buffer so basically all our information goes from 16-bit to 32-bit. In our machine analogy this would be similar to our machine using twice as much raw materials and producing final products that are twice the size of the original ones. So, all transport doubles; its pretty obvious that this can't happen without problems unless you increase the bandwidth.
Now, some of you might say: but my xxxx-card does 16-bit and 32-bit at the same frame rate. Yes, this might be true but that's because there are other limitations: assume your machine is rather slow, actually, so slow that the conveyor belt runs empty sometimes… This means that if you increase quality, the flow increases, but you have space left on the belt so there is no slow down. In 3D chips the main cause for these empty belts is the main CPU, the CPU is not fast enough in sending commands to the machine (something like make a blue object a red object). If the machine doesn't get commands, it can't produce; if it doesn't produce, the belt is empty. This problem is known as being CPU limited. The problem we described is bandwidth limited and you can almost always see it at higher resolutions (1024x768 and above). Right now, there is no product that can render 1600x1200x32bits at the same speed as 1600x1200x16bits, simply, because the bandwidth is too low (the belt isn't fast enough). And, I mean FULL 32-bit, not just 32-bit frame buffers.
So, now 3dfx needs to come up with a solution for the bandwidth problem because they need 4 times the output (and thus 4 times the input) and they need to move from 16 to 32 bits, which, again, doubles the problem. Basically, the increase in bandwidth is, thus, a factor 8!
Now, truly increasing the bandwidth with 8, so running the band 8 times faster or making it 8 times wider is not necessary because of the re-use of data in the buffers. Exactly how much more bandwidth we need depends on the efficiency of re-use. Guessing this re-use is very difficult since it depends on the game/application. Because of this companies, like 3dfx, run simulations. During these simulations (where they run actually games on simulated hardware) they can see exactly how efficient the re-use of data is. Based on these tests, they can decide how much bandwidth they really need. Naturally, they can't test every game, so there is always the risk of underestimating the need. A good example of such a failure is TNT1/2 when running the Original Unreal. Unreal was programmed in a way that clashed with the re-use system of the TNT1/2 resulting in very low performance. The re-use system, or cache management, decides what data should be kept in the local buffer and what data should be replaced with new data transported in from the large storage. What happened was that 3dfx managed to get a lot of re-use and, thus, no bandwidth problems while the system of NVIDIA failed to get good re-use of buffer data meaning that the bandwidth failed and the processor had to wait, giving the well known poor frame rate numbers... Yet today Unreal runs fine on the TNT1/2 boards. This is not due to new caching algorithms or driver/hardware changes. It's due to the fact that Unreal was changed to be more NVIDIA cache friendly. This kind of problems can occur with all cache algorithms, breaking cache performance is very easy so games coders have to pay a lot of attention to make sure that caching doesn't break… Epic learned this the hard way.