Introduction

A couple of days ago HardOCP discovered that in some of the T&L tests in 3DMark 2000, a GeForce can be beaten by high-end P3s and Athlons. Honestly, this is something that most people already knew but were afraid to mention or make public. The facts revealed by HardOCP are repeated below:

 

Hardware T&L

Intel Pentium III 742Mhz

AMD Athlon 700MHz (*)

High Polygon Count (1 Light)

4444 KTriangles/s

6752 KTriangles/s

3969 KTriangles/s

High Polygon Count (4 Lights)

3164 KTriangles/s

5453 KTriangles/s

3317 KTriangles/s

High Polygon Count (8 Lights)

1711 KTriangles/s

4300 KTriangles/s

2653 KTriangles/s

(*) Using a TNT2Ultra

Now, what you notice is that both the PIII and the Athlon beat the Hardware T&L of NVIDIA's GeForce256-the Athlon with TNT2 Ultra for only 4 and 8 lights, and the PIII with GeForce (with disabled hardware T&L) using all settings. Now, HardOCP thought to themselves: "This is not right. How come a custom hardware implementation is slower than a software implementation running on a general purpose CPU?" So they wrote a questioning article which can be found here. Naturally, they asked both NVIDIA and MadOnion for some feedback on this at-first-sight weird situation. NVIDIA promptly responded. But, before I talk about their explanation for this matter, let me explain just what 3Dmark2000 is doing.

The Benchmark Scene

The High Polygon Count scene contains a number of Torus-shaped Objects, which are translated, rotated, and scaled and lighted. The translating, rotating and scaling is what the T-part of T&L does: Transformations. The lighting is what the L part of T&L does: determining the influence of lights on the object's appearance.

Now a Torus is a very interesting object since it can be generated quite efficiently. Actually, it's theoretically possible to create a torus from one long strip (strips are series of triangles that are linked together, each new triangle added to a strip shares two vertices with the previous triangle). Now backface-culling (removing triangles at the backside of an object that are invisible) will split this long strip up in many smaller strips. Nevertheless, it is very easy to create optimal vertex lists for a torus object. Vertex lists are required for optimal T&L performance. The nice thing is that the scene contains several torus-objects; and the even nicer thing is that all these objects have the same basic shape so you can re-use the same vertex data for every torus-object. The only difference between the various tours-objects is their orientation, location in space, and their size. Orientation and location changing can be done using the T-part of the hardware, while the size is just scaling, allowing the T&L part to do this in hardware. So, essentially the whole scene can be created from one single object definition. In order to create this scene, you would do something like this:

           

           

           

It looks like the scene is ideal for testing T&L throughput since this single object definition can be stored in the local memory of hardware supporting T&L.

Now, a big misconception is that people think this is a benchmark for testing in-game performance. It's pretty obvious that this scene is not at all representative of a game. A normal game is not built from one single object repeated several times. A real game has numerous different objects, some much more dynamic (more than just translating, rotating and scaling) than others. If this scene is not a representative for a game, then what is this? Well, this is a peak throughput test, something that tests the maximum T&L performance of a system. So, it is essentially a synthetic test, just like a fill-rate test is. The fill-rate test in 3Dmark also contains a scene that is not at all representative for a game. It contains a scene that allows you to test the maximum peak fill-rate of an accelerator.