Here's a "from the geeks wet dreams file" question: Is there any chance that there might be an interface to allow third parties (and knowledgeable enthusiasts) to write their own CFAA algorithms and run/implement them on their R6xx? Even distribute them for others to run?

It's certainly possible, and in DX10.1, applications will have access to the fragment data themselves and they can certainly do it quickly. But the issue here is access to the compression data. I'm not sure we'd expose that; so we would be limited in offering people DX10.1 style functionality, which could be done. Again, it would be a question of making tools available and supporting users. Perhaps through CTM or OGL this will be possible too.

Horrific geeks that we are, here's another -- we really loved previous Radeon flagship microscope die shots. Any chance of getting one for R600?

I got one of our engineers to generate a microscope picture of the die, where you can see the top layer of metal and the RDL layer. There's a funny story on this. This engineer, Pierre, actually has a $15k microscope and photo setup at his house, and he uses it to do microscopic pictures of bugs and such; now and then we ask him to do pictures of dies too. He took this one with a polarizer to emphasize the metal layers. Amazing guy.

Click for a bigger version

You've added a programmable tessellation unit, outside the current DX10 spec. As we're sure you understand, going outside the spec is always a risk that sometimes pays off and sometimes doesn't. What did you see as making it worth the risk here?

Well, it was already present in Xbox360, and it's likely to stick around for the foreseeable future. On top of that, our solution uses only a small amount of die while being very flexible. It seemed a no-brainer to add it in. In fact, the SW support was a bigger deal than including it into the HW! But the benefits of it in the 360 and the ease of porting 360 games to a PC product which also supported that, were a deal clincher!

"Improved handling of problematic texture filtering cases" says one point of an AMD slide. What does that mean, exactly, and where would it be noticeable?

There's a couple of cases, but the main on has to do with min/mag filtering. One of the things we offered in the R5xx was that an application could select a different type of filtering for minification and magnification. Our competition used the same type for both. This was a caps bit in DX9. The problem occurred that many apps specified minification set to aniso and then left mag as linear (i.e. untouched), for example. This worked fine on competing solutions (since they only kept aniso), but actually looked worse for us, since we ended up using worse looking filters, which were also less cache efficient! (sometimes lower performance). This affected some of the R5xx texture quality reviews; in fact, it was the most common application issue we saw on R5xx, wrt to quality filtering (particularly frustrating since we offered the apps more flexibility, and we were punished for it!). We fixed that for R6xx, and I think some sort of SW fix is going to migrate back on R5xx.

Another AMD slide bullet point notes "geometry shader performance shown to be up to 50x faster than competing implementations". Can you talk a bit more about what in the silicon makes that kind of performance delta vs the competition possible?

The number was simply measured, running an application we wrote. But running some of the Microsoft SDKs (the vines DX10 demo comes to mind) shows similar deltas. It appears, though I hate speculating about other's HW, that when running in data expansion, their GS execution loses a lot of its parallelism; my guess would be that that is their way of dealing with the data expansion – If you have hundreds of threads each one running dozens of elements in parallel, each generating, say, hundreds of bytes of data, you can quickly grow to MB's of data. Keeping that on chip is not easily feasible. We overflow into DRAM, but maintain the full capabilities and performances; and with our memory bandwidth, this is a good thing. Our competition appears to reduce the data to a trickle, to keep the data on chip. But, it could simply be a driver bug ;-)