Having seen what the Wildcat's SuperScene FSAA was capable I quizzed 3Dlabs Director of Product Marketing, Jeff Little to see if he could give us a little more insight, and fill in some of the things that 3Dlabs SuperScene description page leaves out.

My questions are in blue, and Jeff's replies in italics.


nVIDIA defines the size of their Multisample buffer as follows:

Video_memory = sizeof(front_buffer) + sizeof (back_buffer) + num_samples * (sizeof(front_buffer) + sizeof(Z_buffer))

This basically means that the multisample buffer is a multiple size of the front buffer, i.e. if the number of FSAA samples were 4 (4X AA) and the target display resolution was 800x600 then their multisample buffer would use 4 times the quantity of RAM than that of a of the front or back or front buffer (i.e. rendering 1600x1200 pixels). The 3Dlabs Super Scene documentation states that: "The multisample buffer is the same size as the screen and contains multisample pixels just as the image contains image pixels" - which indicates that the multisample buffer is the same size as that of the target display resolution, but how can this be considering that potentially 16 times the number of samples are being stored?

The general idea is that multisample memory is divided into two memory spaces:  pre-allocated and dynamic.

Some details (for background understanding): Pre-allocated Memory. Wildcat pre-allocates a certain amount of memory corresponding to each pixel in the multisample buffer.  We can pre-allocate 2, 4, 8, or 16 slots per pixel.  Of course, the amount of pre-allocated memory increases as more and more slots are pre-allocated.

A slot is enough storage to hold the color, depth, and stencil information, and its size is independent of the number of samples actually taken.  That is, we can pre-allocate 2 slots per pixel for either 4 samples or 16 samples, and it would take the same amount of space either way.

Studies have shown that typically, the majority of pixels are covered by a small number of fragments.  So, why force all pixels to have the worst-case sample storage allocated for them when many pixels only require 1 - 2 fragments.  (If two triangles cover a pixel, then that pixel will require two slots of multisample storage, provided each triangle hits some samples within the pixel.)

So, in this sense, the multisample storage requirements and performance is roughly independent from the number of samples actually taken.  However, the blend and stencil modes may change this and cause more slots to be created.

Dynamic Memory.This is the amount of memory we use where it's needed most.  For a pixel that needs 7 slots (because we're doing 8- or 16-sample multisampling) and for which only 2 slots where pre-allocated, 6 slots from dynamic memory will be requested.  Slots are allocated and freed in pairs.  Typically, we only need 10% extra memory to handle all of the dynamic memory - for example if a 1280 x 1024 window, 2-preallocated samples corresponds to 1280 x 1024 x 43 = 54 Mbytes, I think roughly 5.4 Mybtes would be allocated for dynamic memory.

By default we pre-allocated 2 slots per pixel.  In the driver applet, we allow control of how much dynamic memory to allocate.  So, the user may make tradeoffs between memory usage and scene quality.  However, scene quality typically isn't an issue.

The comment in the SuperScene AA documentation "Pseudo-random noise generator" suggests that Wildcat doesnt actually employ truly randomly stochastic FSAA, is this the case? How does this operate - for example ATi have a system that is (or at least should be) able to apply a sample location map from a limited predefined set that can be selected from per pixel dependant on various parameters for that pixel; is this a similar system that you use?

I'm not familiar with ATI's technique, but from the supplied description, it sounds like we're able to do a similar thing.

A note of caution:  If you take different sample patterns from one pixel to the next, that can introduce a shimmering effect along edges unless adequate post-sample filtering is done.

The documentation also states: "Fine tuning of where sample points are located"; what does this mean?

Also from there we say "Each multisample pixel is effectively divided into a 16 by 16 grid from which 2, 4, 8, or 16 samples are taken."

Various tables stored in Wildcat define where these samples are and how sample locations may be perturbed from one pixel to the next.  The sample pattern may be defined as a regular grid or a stochastic pattern per pixel. Millions of different sample patterns are possible.  Currently, all multisample pixels share the same sample pattern.

By the way, this paper needs to be updated as we now only support either 8 or 16 samples - there was no real improvement in performance by using 2 or 4 samples, so we removed the options.  Why offer lower quality FSAA if it doesn't buy you a noticeable performance boost?

What is the configuration of the FSAA sampling on Wildcat? For instance NVIDIA employs multiple Z check units per pixel pipe resulting virtually fill rate free multisample FSAA (only and extra cycle is required per colour change on edges); does Wildcat feature 16 Z units such that all 16 samples can be output in one cycle, or does it take each sample individually requiring 16 cycles per multisampled pixel?

We have some tricks we apply here, but I'd rather not talk about that - we need to keep some of our magic secret!


Although Jeff states that presently Wildcat takes the same sample pattern, if Henrick's AA testing program's outputs are correct then it would appear that Stocastic sampling is used in some cases, presumably if there is insufficient information to determine the most appropriate sampling pattern.

Also we should note that basically Wildcat's AA is only storing the extra sample information for the pixels that require extra samples as, under multisampling, pixels internal to a polygon  will display no difference with AA enabled or disabled. This can bring big advantages in terms of memory requirements and, potentially, performance as well. However with the Wildcat's mechanism it still behaves exactly the same way as a normal multisample-buffer implementation.