Intel Larrabee @ SIGGRAPH 2008

Monday 02nd June 2008, 09:25:00 AM, written by Arun

Starting in August, part of the shroud of mystery around Larrabee is going to dissipate: A paper called 'Larrabee: A Many-Core x86 Architecture for Visual Computing' will be presented at SIGGRAPH by its authors, which include Doug Carmean, Tom Forsyth, Michael Abrash, Pat Hanrahan and many others.

The paper's abstract describes Larrabee as using 'multiple in-order x86 CPU cores that are augmented by a wide vector processor unit, as well as fixed-function co-processors. This provides dramatically higher performance per watt and per unit of area than out-of-order CPUs on highly parallel workloads and greatly increases the flexibility and programmability of the architecture as compared to standard GPUs.'

Nothing revolutionary or that we didn't know before there, but we'll definitely be looking forward to this. No promise that I/we go to SIGGRAPH this year, but it's still relatively likely - plus, this likely won't be the only event where Intel presents Larrabee this year. It's worth pointing out that Larrabee will be competing head-on against NVIDIA and AMD's DX11 GPUs, not their current ones; sadly it seems unlikely that either company will be willing to disclose anything substantial about their next-generation architectures until well into 2009.

[Thanks to nAo for the tip!]


Discuss on the forums

Tagging

intel ± larrabee


Latest Thread Comments (531 total)
Posted by Gubbi on Friday, 22-Aug-08 12:13:23 UTC
Quoting aaronspink
The whole local store architecture of CELL is one of its biggest drawbacks. In general local stores significantly complicate programming as well.
It's certainly different.

If one looks at a SPE with 'normal' programming model glasses, a SPE is just a processor with a two level register file, the local store forming the second level, - and no cache. I'd suspect a future CELL would do good having a big fat shared cache before the memory interface. I still don't think it beats processors with normal cache hierarchies.

Cheers

Posted by 3dilettante on Friday, 22-Aug-08 14:37:34 UTC
Quoting crystall
Well, the whole description of their software renderer screams "there's no free lunch".
The buzz prior to the release of details screamed something a little different.
I believe the memory model is a good one, but the naive parroting of sound bites about it was nearly as silly as the "OMG realtimeratracer!!!" fluff articles floating around the net.

Posted by nAo on Friday, 22-Aug-08 14:52:33 UTC
Quoting Gubbi
If one looks at a SPE with 'normal' programming model glasses, a SPE is just a processor with a two level register file, the local store forming the second level, - and no cache. I'd suspect a future CELL would do good having a big fat shared cache before the memory interface. I still don't think it beats processors with normal cache hierarchies.
SPU patents describe such a cache, though all CELL variations out there don't have any, afaik.
On the other hand with a cache or without the programming model would still be the same, unless SPUs ISA gets extended.

Posted by Andrew Lauritzen on Friday, 22-Aug-08 18:05:53 UTC
Quoting TimothyFarrar
How about we take this to say a practical example, like say optimal sorting 16 million objects (say {sort key, object id}). Does the cache advantage (Larrabee) vs banking+local store (NVidia) argument still apply?
That's an extremely difficult question to answer in any sort of precise way you know ;) Not only does it depend entirely on the algorithm used (and indeed you might use a different algorithm for it on different architectures), but if you take it all the way to database-land where only IO matters (which is not going to be an unreasonable model in the future I think), the complexities are entirely dependent on your ability to touch non-local data as infrequently as possible. In this land, large caches/local stores win hands-down as increasing block sizes reduces the number of "passes" (in GPU parlance) over the data set.The concept here is similar to many algorithms actually: you want to extract the "minimum" amount of parallelism out of your problem to keep all of the cores busy in general, and run the serial algorithm on the in-core data set. This is precisely how reductions, scans, segmented scans, sorts and many other primitives are implemented on parallel processors, and that's entirely what you're doing when you use CUDA's local store in most cases. It's just that the programming model has you looking out from inside some nested loops so it doesn't make it clear that you're really just writing SIMD code over a block of data in local store.So while multi-banked memories like CUDA's local store are useful, so are bigger caches, more ALUs and any number of other things that you could spend the transistors on :).

Posted by TimothyFarrar on Friday, 22-Aug-08 21:42:27 UTC
Quoting Andrew Lauritzen
The concept here is similar to many algorithms actually: you want to extract the "minimum" amount of parallelism out of your problem to keep all of the cores busy in general, and run the serial algorithm on the in-core data set. This is precisely how reductions, scans, segmented scans, sorts and many other primitives are implemented on parallel processors, and that's entirely what you're doing when you use CUDA's local store in most cases. It's just that the programming model has you looking out from inside some nested loops so it doesn't make it clear that you're really just writing SIMD code over a block of data in local store.

So while multi-banked memories like CUDA's local store are useful, so are bigger caches, more ALUs and any number of other things that you could spend the transistors on :).
In the terms you have described. Seems to me for similar ALU capacity between Larrabee and say an NVidia style GPU, that Larrabee might only have a 1.5 to 2.0 size advantage in the in-core data set, and be at a disadvantage in terms of utilization of out-of-core data bandwidth. So I'm still skeptic on this idea that Larrabee is going to be a huge win. For the problems I would like to use Larrabee to solve, it still seems as if out-of-core bandwidth utilization is more important.

Perhaps I'm way off base here, but taking my other simple example, I don't really expect a huge win for Larrabee in overall time for sorting 16M elements on shipping hardware with similar ALU capacity.

However, there is one area which it seems as if Larrabee might have quite an advantage, in that is general scatter where the scatter as some locality.

NVidia and ATI GPUs have a readable write combined cache for each ROP/OM unit right? Seemed as if NVidia at one time had plans to expose this surface cache in CUDA (.surf in PTX spec), but that never materialized (and neither did programmable blending to which it might have been used). If the Larrabee model (SIMD+scatter/gather+ R/W caching) ends up been the best thing since sliced bread, seems like other GPUs adopting an accessible surface cache might be a good evolution (to give bandwidth reduction on scatter). Of course latency could be far better on Larrabee, but the overall memory bandwidth reduction could be similar.

Posted by Davros on Sunday, 24-Aug-08 20:16:10 UTC
incase anybody hasnt read this
http://www.pcpro.co.uk/news/220947/nvision-larrabee-like-a-gpu-from-2006.html

Posted by nAo on Sunday, 24-Aug-08 21:07:54 UTC
It might be case of history repeating. David Kirk released a similar interview 3 or 4 years ago (we simulated a unified architecture, etc..) and we know how that ended. Perhaps we should expect them to release something very close to Larrabee in 2010 ;)

Posted by Andrew Lauritzen on Monday, 25-Aug-08 01:16:55 UTC
Quoting Article Linked Above
"As [blogger and CPU architect] Peter Glaskowsky said, the 'large' Larrabee in 2010 will have roughly the same performance as a 2006 GPU from Nvidia or ATI."
... enough said about the intelligence of that article. ;)

Posted by Scali on Wednesday, 27-Aug-08 10:37:15 UTC
Some things that jumped out at me in that article:

Quote
"They've put out a certain amount of technical disclosure in the past five weeks," he noted, "but although they make Larrabee sound like it's a fundamentally better approach, it sn't. They don't tell you the assumptions they made. They talk about scaling, but they disregard memory bandwidth. They make it sound good, but we say, you neglected half a dozen things."
I think this is a pretty weird statement to make. The information that Intel released was aimed at the point that they could scale better BECAUSE they reduced bandwidth requirements, among other things.

Quote
"Every GPU we make, we always consider this type of design, we do a reasoned analysis, and we always conclude no. That's why we haven't built that type of machine."
This might go for nVidia, but Intel works within a different environment. For example, nVidia didn't decide to design the G80 10 years ago. The time wasn't right. nVidia didn't have the expertise yet, there was no major API that would require it (nor would the rest of the PC be up to the task of driving it), and it would be impossible to manufacture with the state of chip manufacturing at the time.

Intel is ahead in chip manufacturing, and Intel has expertise in areas that nVidia doesn't have yet (and vice versa). So it might not be the right design for nVidia at this point. But it could be for Intel. It could also be the right design for nVidia some time in the future.

Quote
"ATI did not spend on things like PhysX and CUDA. But we believe that people value things beyond graphics. If you compare only on graphics, that's a relative disadvantage to us, but the notion of what you measure a GPU on will change and evolve," he argued
This is something that I've been saying aswell. Cuda is the real power of G80 and beyond, and we have yet to see what AMD's answer will be to that.

Posted by Jawed on Thursday, 28-Aug-08 17:24:47 UTC
Apparently an EETimes-Asia article includes this snippet:
Quote
Ct was initially geared toward Intel’s general purposed Nehalem quad core chips, but is now up and running on its prototype 16-core Larrabee graphics processors.
But you have to be registered to read it. I came across it here: http://insidehpc.com/2008/08/27/two-paths-to-multicore-intel-and-ms-talk-about-their-tech-at-idf/ Prototype 16-core Larrabee, eh?... Jawed


Add your comment in the forums

Related intel News

Intel's Aaron Coday talks to Develop about Larrabee
Larrabee to also be presented at Hot Chips
Larrabee's Rasterisation Focus Confirmed
Nehalem Article @ RWT + 3.2GHz samples(?)
Opinion: Silverthorne fails but PowerVR impresses (+Montalvo trouble)
Belated Analysis: Intel Atom/Silverthorne
Havok physics software on PC soon-to-be free for non-commercial use
Intel purchases young game development house Offset
Larrabee: Samples in Late 08, Products in 2H09/1H10
Intel results indicate consumer spending strength; investor ignorance