NVIDIA GT200 GPU and Architecture Analysis
Monday 16th June 2008, 02:15:00 PM, written by Arun
It's finally there: GT200 is the chip NVIDIA is pitting to dethrone its former 18 months old champion, the G80. We won't look into real-world performance just yet, but we've got our usual brand of architecture analysis up right away with more coming in the next few days and weeks.
Tagging
internallink ± nvidia, gt200


I seem to remember reading some comparisons of bandwidth/pin where Rambus held a very substantial advantage with their next-generation technology (that I wouldn't expect to be eliminated completely via GDDR5).
I did, and I'll have to humbly disagree. A couple of points:
- Triangle Setup can already be done in the shader core in the ATTILA and the 'shader' they use for it is public in one of their papers; it's nothing extraordinary.
- It's not like the entire shader core had to be able to handle triangle setup; even just one or two clusters (or one SM per cluster) would already be very fast.
- As for texture filtering, it's not really about making things faster as much as it is about improving programmability. It could even be half-speed for a given datasize and that'd more than good enough.
Arun, you need to get a handle on the size of arithmetic logic. Remember that R420, at 160M transistors, has the same setup rate, HiZ rejection rate, and scanline rasterization rate of RV770. You don't need full FP32 arithmetic for most of the operations. You don't need access to a shader program or pick values from 64K of register space.
The tough part is not the computational resources. It's managing the data flow. You have a list of primitives and they index into a post-transform cache. If you want to do multiple triangles per clock, you need to multiple ports for this cache. You need multiple ports in the HiZ ram, and have to worry about updating some tiles twice per clock. Handling 16 different 4x4 tiles per clock is much easier than 32 tiles that aren't necessarily unique. You have to worry about the first triangle's Z/stencil affecting the second. There's probably a ton of other issues that I can't think of.
Die cost isn't the issue here. It's just a matter of tiptoeing through the minefield of parallelizing a part of the graphics pipeline that has always been serial. There are so many endcases that cause incorrect output if you aren't perfect.