NVIDIA Fermi: new GPU architecture, starting with GF100

Wednesday 30th September 2009, 11:43:00 PM, written by Rys

At their Graphics Technology Conference earlier this evening, NVIDIA announced their next-generation graphics architecture, codenamed Fermi.  Graphics seems like it's not the primary focus of the first implementation of Fermi, though, with GF100 going for everyone else's jugular in the general purpose GPU compute industry.

With a brand new shader core, Fermi's compute clusters comprise a single shader multiprocessor (SM) this time.  Each SM is capable of dual issuing two independent instructions per clock to two different warps, across two clocks, with each instruction run by a 16-way SIMD block capable of single precision FMAs at full rate, and doubles at half rate.

The memory heirarchy is new, with a new coherent and unified L2 cache serving all SMs with no partitions, and a new unified memory space allows each SM to talk to not just it's own local registers and shared memory, but L2 and beyond, all the way out into system memory (up to 1 TiB, backed by a hardware TLB).

Various other compute-friendly facets of performance are improved versus their last Tesla architecture chips, with GT200/T10 the pinnacle at the moment.  Atomic instruction throughput is up, everything is backed by ECC, and the hardware can sustain peak SP and DP FMA instruction throughput.

We've got a short look at GF100, including speculation on some of the graphics features, in the forums, pending a proper look at things, and our friends at The Tech Report and Real World Tech have pieces talking about things, by virtue of early briefings.
Discuss on the forums

Tagging

nvidia ± fermi, dx11, gf100


Latest Thread Comments (3800 total)
Posted by pcchen on Wednesday, 03-Feb-10 18:36:33 UTC
Quoting FenderBender
Are the memory controllers independent and can access any desired bit of the device memory, or does each one handle a bank of memory and the controller has to queue up requests for each needed bank?
A memory controller can access only those memory devices which attached to it. Generally there are some sorts of "spreading" algorithms to avoid congestion on a single memory controller.

Posted by MfA on Wednesday, 03-Feb-10 19:29:45 UTC
Quoting FenderBender
So if Fermi has 6 64 bit memory controllers, is the smallest read transaction a mere 64 bits?
Burst length is 8, so 64 bytes (you can do a burst of 4, but that's only to reduce power consumption ... it still takes 8 cycles till the next burst).

Posted by PSU-failure on Thursday, 04-Feb-10 12:59:28 UTC
Quoting trinibwoy
GPU memory buses have been built from multiple narrower and independent memory controllers since 2001 :)
Even back in 1998 (Matrox G200), in fact, and even ATI released something like this with R100 with the exact same goal as nVidia with the GeForce 3.

And we don't know everything, perhaps that was the case of other processors before and we didn't hear it back then because there wasn't anything to hide by promoting it.

I wonder if it's still efficient with billions transistors chips though, a write buffer, a decoupled RAM controller and some scheduling could improve performance. It's not as if some kB SRAM still required a substantial die area.

Posted by Fusion on Friday, 05-Feb-10 04:39:18 UTC
Is Fermi's Architecture considered von Neumann or Harvard ?And what about the RV870 ?

Posted by pcchen on Friday, 05-Feb-10 04:51:58 UTC
Quoting Fusion
Is Fermi's Architecture considered von Neumann or Harvard ?

And what about the RV870 ?
I think on current GPUs, code and data share the same physical memory (i.e. the video memory). However, programs running on the GPU don't have the ability to access the code memory (e.g. a program can't "generate" another program on the GPU). So in this sense it behaves more like Harvard rather than von Neumann.

Posted by 3dcgi on Saturday, 06-Feb-10 06:31:16 UTC
Ati chips from the past few years have had instruction caches that are distinct from data caches. There's no reason code must be in video memory. It could be loaded straight from system (CPU) memory if the latency can be tolerated.I would guess Nvidia is the same.

Posted by OlegSH on Saturday, 06-Feb-10 10:36:26 UTC
Quoting 3dcgi
Ati chips from the past few years have had instruction caches that are distinct from data caches. There's no reason code must be in video memory. It could be loaded straight from system (CPU) memory if the latency can be tolerated.I would guess Nvidia is the same.
NV have instruction cashes since nv40 with SM3.0 caused by demand of huge maximum instruction count per shader. This is very small caches that can effectively caching only small area of local gpu memory - buffer of instructions and constants, don't see opportunity to caching system memory by these caches

Posted by MfA on Tuesday, 09-Feb-10 20:28:16 UTC
Quoting MfA
Why don't they just let the intel chipset handle all the display tasks but intercept 3D rendering calls to render frames on the discrete GPU? Then you can just blit the framebuffer into the integrated graphic's chip memory when doing 3D, and when not doing 3D turn off the discrete graphics chip without a care in the world.
And that's exactly what they did :)

Posted by DavidGraham on Tuesday, 09-Feb-10 21:10:41 UTC
Quoting MfA
And that's exactly what they did :)
:grin: Nice predictions .. , of course assuming you were talking about Optimus !:razz:

Posted by Joshua Luna on Wednesday, 10-Feb-10 02:51:43 UTC
Are there any good overviews of the architecture and potential performance yet? I just recently read about the quasi 4 triangle/clock arrangement and the monster (?) tessellation performance. I haven't been following Fermi much due to the vaporware/high pitch fud so please excuse my disconnect--sounds like Fermi has some neat tricks up the sleeve. Maybe NV has something for SLI as well? (I must admit I am excited about their laptop dock with the Gateway, I hope that catches on!)


Add your comment in the forums

Related nvidia News

So long, Chris, and thanks for all the fish
NVIDIA GF100 graphics architecture details
NVIDIA release OpenCL GPU drivers for Linux and Windows
NVIDIA GeForce GTX 275 at $250 to fight HD 4890
A look at NVIDIA's SLI Multi-OS and new Quadros
Ahead Nero gets CUDA support for video encoding
G92b renamed again, this time for notebooks
NVIDIA GeForce GTS 250 announced
New NVIDIA display driver for Windows 7 beta
NVIDIA Q4: Revenue as awful as expected, margins/income miss