Do you feel that because the shared memory is such a big part of how G8x works, and works well for CUDA, that you'll keep that basic architectural trait and its implementation going forward with new generations?

From a compatibility standpoint, that's a capability, so we either need to provide that or find a way to emulate it. But I think it's a really valuable part of the programming model, the extra benefit of the shared memory brings a lot, even if you just use it as an extra set of registers, because registers are very valuable.

As far as CUDA goes, if it sees you're not using all of the shared memory, can it use what you don't as GPRs, or do you never want to take away the expectation that it'll be there and a certain size from the programmer?

I think that's an interesting idea. Right now, I think the compiler would have to do that and recognise you're not using it, and allocate it for registers. I don't know how hard that is, but certainly it's not impossible. That's a good idea actually and I'll give that idea feedback to the compiler team.

In CUDA was it important for you to expose the memory heirarchy in the programming model, where you say "here's your shared memory, here's constant memory, here's global memory", etc, and they're all separate pools and very visible that way to the programmer? Were you conscious of that?

Yes, but there's some debate about it, since as you know we don't always expose the details of the architecture if we don't have to.

Because it makes it easier for the other guy to figure it out and learn how you do things?

Right, but in this case we're moving from a graphics market to being in a processor market, where you need to give your programmers pretty good visibility in to what's happening. So because the different areas of memory available have different performance characteristics, the programmer has to understand how they work. So say you have a region of memory that you'll only ever read from, well that maps to constant memory. And you might have this other huge array that you might read and write, well that maps to global memory. So depending on where you put things it can make a big difference. Remember the MRI demo yesterday? I think they got another almost factor of 10x increase, just by using the right type of memory for the right thing. They'd put everything in global memory at the beginning, and it was fast, and it wasn't bad, but by making use of the memories in the right way, they got another big factor.

So is that one of the keys to getting the best performance out of the chip, and being intelligent about the memories available and how you use them?

I think there's a lot of things, and that's one of them. The other big thing is structuring your program so that you have the right balance of computation. You do different things than you do on the CPU, where compute can be more expensive than using memory. Whereas on the G80, compute is cheap but using device memory is expensive.