Intel Larrabee @ SIGGRAPH 2008

Monday 02nd June 2008, 09:25:00 AM, written by Arun

Starting in August, part of the shroud of mystery around Larrabee is going to dissipate: A paper called 'Larrabee: A Many-Core x86 Architecture for Visual Computing' will be presented at SIGGRAPH by its authors, which include Doug Carmean, Tom Forsyth, Michael Abrash, Pat Hanrahan and many others.

The paper's abstract describes Larrabee as using 'multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as fixed-function co-processors. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads and greatly increases the flexibility and programmability of the architecture as compared to standard GPUs.'

Nothing revolutionary or that we didn't know before there, but we'll definitely be looking forward to this. No promise that I/we go to SIGGRAPH this year, but it's still relatively likely - plus, this likely won't be the only event where Intel presents Larrabee this year. It's worth pointing out that Larrabee will be competing head-on against NVIDIA and AMD's DX11 GPUs, not their current ones; sadly it seems unlikely that either company will be willing to disclose anything substantial about their next-generation architectures until well into 2009.

[Thanks to nAo for the tip!]


Discuss on the forums

Tagging

intel ± larrabee


Latest Thread Comments (544 total)
Posted by 3dilettante on Tuesday, 25-Nov-08 14:56:37 UTC
Quoting Jawed
Isn't it normal for any kind of bus to run over logic - with only repeater logic consuming area in "islands"?
Cell's EIB doesn't route over the local stores. The EIB has a fair amount of dedicated logic sitting right in the center of the die.

The coherent bus in Larrabee might have made the case for distributing it amongst the caches.

Quote
What would happen as process scalings kick in? Would L2 shrink more rapidly than the bus?

That's what I'm curious about.
SRAM compacts pretty well with process. Logic less so. Interconnect beyond the lowest levels scales more slowly, and the higher layers are at higher geometries.
It might depend on just where the bus is running.

There might be a design-specific inflection point where the work in compacting all the signal lines balances with the challenges of running it at speed versus the space savings of keeping the L2 physically small and the desire to have more L2.

Posted by Jawed on Tuesday, 25-Nov-08 15:06:56 UTC
Quoting 3dilettante
Cell's EIB doesn't route over the local stores. The EIB has a fair amount of dedicated logic sitting right in the center of the die.
And probably "saves space" by having EIB control logic passed over by interconnects. All I'm saying is that in terms of the die as a whole, "space saving", by laying a bus over logic is normal. Now it might be that laying a bus over RAM is the easiest of configuration. Don't know, I suppose it's a question of the impact of repeater logic islands on L2 latency (due to increasing the radius of L2). Jawed

Posted by 3dilettante on Tuesday, 25-Nov-08 16:11:51 UTC
Quoting Jawed
And probably "saves space" by having EIB control logic passed over by interconnects.
I wouldn't expect them to route completely around the EIB logic, since the EIB logic deals with the interconnect directly. I don't have a high res shot of Cell's EIB section, but it looks like part of the path the signals go through is in the logic block that takes up die space.

Routing the interconnect over the dedicated EIB logic is a bit different than lofting it over non-dedicated silicon.
Perhaps Cell also routes some of its bus over other silicon, I haven't seen a diagram for that.

Quote

All I'm saying is that in terms of the die as a whole, "space saving", by laying a bus over logic is normal.
But it's also not required. The fact that IBM aggregated the EIB's logic in one place instead of distributing it indicates there are other considerations.

Quote
Now it might be that laying a bus over RAM is the easiest of configuration. Don't know, I suppose it's a question of the impact of repeater logic islands on L2 latency (due to increasing the radius of L2).
Given the likely size of the L2 caches and the relatively simple ring bus scheme, I'm not sure there would be enough to be prohibitive.
My question is what happens when the SRAMs shrink, and if the ring bus will scale accordingly or if Intel will relax a bit and allow the cache capacity to go up to take pressure off of the ring bus designers.

Posted by Jawed on Tuesday, 25-Nov-08 17:51:04 UTC
Quoting 3dilettante
I wouldn't expect them to route completely around the EIB logic, since the EIB logic deals with the interconnect directly. I don't have a high res shot of Cell's EIB section, but it looks like part of the path the signals go through is in the logic block that takes up die space.
I've got a high res shot of Cell, but I don't know how it would help you to discern anything about the routing of the bus lines...
Quote
Routing the interconnect over the dedicated EIB logic is a bit different than lofting it over non-dedicated silicon. Perhaps Cell also routes some of its bus over other silicon, I haven't seen a diagram for that.
A cache is also "non-dedicated silicon", so the question of whether any "area saving" applies is moot.
Quote
But it's also not required.
I'm merely saying that it seems to be normal to route an interconnect over un-related logic.
Quote
My question is what happens when the SRAMs shrink, and if the ring bus will scale accordingly or if Intel will relax a bit and allow the cache capacity to go up to take pressure off of the ring bus designers.
If the cache increases in capacity then that affects latency. Clearly, it's too early to tell how sensitive to cache latency Larrabee will be. Arguably, as the number of cores rises, any slight increase in L2 latency will be overwhelmed by ring-induced cache-coherency latency, and other scaling factors. So maybe it doesn't matter so much? In then end it seems extremely unlikely to me that the bulk of the ring bus in Larrabee is formally restricted to the area of the die occupied by L2 - I don't think there's much value in assigning a 1:1 scaling question-mark over these two components (RAM + bus) of this subsystem. If the bus overspills the RAM in future it will lower the density of the logic it flies over - but this is just another version of the scaling problem for physical I/O, where analogue stuff scales poorly. Jawed

Posted by 3dilettante on Tuesday, 25-Nov-08 18:09:15 UTC
Quoting Jawed
A cache is also "non-dedicated silicon", so the question of whether any "area saving" applies is moot.
That was my point. Cell has a contiguous area of the die devoted to the ring bus and its logic. I'm not privy to the details of the design, but I have a hard time accepting it is wholly made up of repeater blocks.

Quote
If the cache increases in capacity then that affects latency.
The dominant factor for that is area, and latency roughly scales with sqrt2 of the physical area of the cache.
If the SRAMs shrank, we'd expect better latency.
If the cache capacity were expanded to give roughly equivalent area, we'd have the same latency with more capacity.

Quote
In then end it seems extremely unlikely to me that the bulk of the ring bus in Larrabee is formally restricted to the area of the die occupied by L2 - I don't think there's much value in assigning a 1:1 scaling question-mark over these two components (RAM + bus) of this subsystem.
Perhaps I'm reading to much into the part of the slides that said that the ring bus is physically layered on top of the L2.

Quote
If the bus overspills the RAM in future it will lower the density of the logic it flies over
It might require the redesign or rerouting of all the logic it flies over, possibly at the expense of poorer density in logic that already scales worse than SRAM.
Depending on how large an L2 tile is compared to its directly linked compute core, the penalty may be worse if the logic expands.

The SRAMs might not require too many additional layers for their signalling, the more complex logic of the cores might have uses for the interconnect at the altitude of the ring bus, plus whatever margin of safety is needed to keep both layers from interfering with one another.

Posted by Jawed on Tuesday, 25-Nov-08 19:46:35 UTC
Quoting 3dilettante
That was my point. Cell has a contiguous area of the die devoted to the ring bus and its logic. I'm not privy to the details of the design, but I have a hard time accepting it is wholly made up of repeater blocks.
It's a centrally managed bus, so it's definitely more than repeaters. http://www.ibm.com/developerworks/power/library/pa-expert9/
Quote
The dominant factor for that is area, and latency roughly scales with sqrt2 of the physical area of the cache.If the SRAMs shrank, we'd expect better latency.If the cache capacity were expanded to give roughly equivalent area, we'd have the same latency with more capacity.
Did Core 2's L2 latency improve 65nm->45nm? http://www.extremetech.com/article2/0,2845,2208245,00.asp
Quoting Steve Fischer, Lead Architect on Penryn
The latency for accessing the L2 cache increased by 1 core clock cycle (from 14 to 15 clocks) due to the increase in size.
Though those L2s are so big in comparison with what we're talking about in Larrabee (or what's in Nehalem). Nehalem's 256KB L2 is 2 cycles faster than Conroe's 4MB L2, not much of an improvement considering it's 1/16th the size and on a smaller process. Obviously fiddly comparing these as other parameters have been adjusted at the same time.
Quote
It might require the redesign or rerouting of all the logic it flies over, possibly at the expense of poorer density in logic that already scales worse than SRAM.Depending on how large an L2 tile is compared to its directly linked compute core, the penalty may be worse if the logic expands. The SRAMs might not require too many additional layers for their signalling, the more complex logic of the cores might have uses for the interconnect at the altitude of the ring bus, plus whatever margin of safety is needed to keep both layers from interfering with one another.
Agreed with all that. I just suspect it's not a binary design decision, whether interconnects fly over non-interconnect logic. If you look at the Cell die shot the 2MB of SPE LS covers considerably more area than the EIB. I reckon EIB is 17% of the area of this 2MB of memory. This could indicate that the interconnects consume a tiny proportion of the area of L2 in Larrabee, particularly as the ring bus in Larrabee almost has "no protocol" so has little control logic associated with it. But, obviously, we can't see the ring interconnect fabric itself on Cell, so who knows, maybe it covers 8x the area of the EIB logic :???: Overall it seems the scaling question isn't a big deal. Famous last words. Jawed

Posted by 3dilettante on Tuesday, 25-Nov-08 21:02:48 UTC
Quoting Jawed
Did Core 2's L2 latency improve 65nm->45nm?


Though those L2s are so big in comparison with what we're talking about in Larrabee (or what's in Nehalem). Nehalem's 256KB L2 is 2 cycles faster than Conroe's 4MB L2, not much of an improvement considering it's 1/16th the size and on a smaller process. Obviously fiddly comparing these as other parameters have been adjusted at the same time.
I admit the rule of thumb is very simplistic, correlating the cross section of a cache with its response time, neglecting fixed costs of implementation, tag checking, and access order. The base assumption is that when all else is equal, the time it takes for a signal to reach the core from the furthest part of the cache sets a floor value for the cache's response time.

Penryn's cache is 25% smaller in area, but it is not correspondingly faster.

There are a number of possible reasons, one being that Penryn's cache is nearly as long as Conroe's, its sleep transistors add latency, the fraction of the access time taken up by the L1 miss and L2 tag checks is not reduced, associativity was increased by 50%, and Penryn's pipeline targeted a higher clock rate.
Even if the wall-clock time for signals crossing the cache were reduced, the other fixed costs would not go away and the differing cycle target only roughly maps to wall-clock time.

Larrabee's more modest clocks might give it more slack to play with when it comes to fiddling with cache capacity.

Posted by Squilliam on Wednesday, 26-Nov-08 13:09:54 UTC
How significant are Intels claims about off chip bandwidth? Does it have any significant bearing on GPGPU applications or is it just limited to computer graphics, and more importantly console applications which are limited to using much smaller buses than their computer counterparts?

Posted by bowman on Saturday, 29-Nov-08 01:03:11 UTC
Up until now everyone has assumed that the IGP on low-end Nehalems with IGPs on-package will be based on the current IGP architecture, including me, until I randomly stumbled over this tidbit from April 2007:

Quote
# The integrated graphics will be DirectX 10 and offer GPGPU functions if the software used is able to address it -- Intel is currently working to make the software available.
# *Intel is basing the graphics core on a derivative of Intel Architecture (IA) that it uses on its CPUs* since the general purpose processing now done on a unified shader GPU core is very similar to that of a CPU. Intel says it’s capable of not only DX10 but OpenGL and GPGPU as well, although performance information isn’t yet available, but expect it to be still quite basic compared to a discrete solution.
http://www.bit-tech.net/news/2007/04/17/further_details_on_nehalem_idf_spring_2007/1

Eh? Obviously it'll be a Larrabee derivative of some sort, they're not going to design two x86 graphics architectures are they.. So, one 16-wide Larrabee core then?

Posted by MfA on Saturday, 29-Nov-08 10:57:39 UTC
A bit thin evidence, could just be a misunderstanding. Any direct quotes or material from Intel?


Add your comment in the forums

Related intel News

32nm sixsome over at RealWorldTech
Intel Core i3 and i5 processors launched
Analysis: Intel-TSMC announcement more complex than reported
Intel and TSMC join forces to further Atom
Fudzilla: Intel 45nm Havendale MCM replaced by 32nm+45nm MCM
Intel announce Core i7 processors, reviews show up
Intel's Aaron Coday talks to Develop about Larrabee
Larrabee to also be presented at Hot Chips
Larrabee's Rasterisation Focus Confirmed
Nehalem Article @ RWT + 3.2GHz samples(?)