Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 22-Sep-2012, 02:27   #1
Priyadarshi
Junior Member
 
Join Date: Sep 2012
Location: Chapel-Hill, NC
Posts: 17
Default Heterogeneous computing using IVB?

Hello,

Reposting my thread from Intel OpenCL forums here:

I have been testing the Intel's OpenCL SDK for heterogenous computing with the HD2500 iGPU. I ran a few benchmarks to test the memory bandwidth of both CPU and iGPU devices. Here are the results:
---------------------------------------------------------------------------------------------------------------------------------
1. Memory Read [Single] : All threads read from a single physical address.
CPU - 70 GB/s; iGPU - ~5 GB/s

2. Memory Read [Linear] : Thread read data sequentially memory address according to their thread id
CPU - 50 GB/s; iGPU - 5.8 GB/s

3. Memory Read [Uncached] : The reads are offsetted so that the cache thrashing is maximum
CPU - 5.8 GB/s; iGPU - 4.5 GB/s

4. Memory Write [linear] : Threads writing to sequential memory addresses
CPU - 60 GB/s; iGPU - 1.3 GB/s
---------------------------------------------------------------------------------------------------------------------------------

Using vec4 datatype for CPU gives the maximum bandwidth. This is what the optimization guide recommends too. But for GPU, I get the same bandwidth for all datatypes. Few questions I have:

a) How the iGPU's shader core (EU) is laid out? I do know that it has 4 ALUs but do they work on different threads (OpenCL thread i.e a work item) or only on 1 thread like the VLIW4 unit in previous AMD GPUs?

b) Why is the iGPU access to global memory crippled compared to CPU? Ok CPU has big caches but doesnt the IVB has an L1, L2, L3 hiearchy too? This is nearly equal to PCIe transfer speeds, in that case I have much better options to do CPU+GPU compute
Btw I also tested its bandwidth to per compute unit shared memory (part of L3 cache) and I got around 20 GB/s. This seems okay.

c) What is the best way to share data between CPU/GPU which gives the maximum memory bandwidth?
Priyadarshi is offline   Reply With Quote
Old 25-Sep-2012, 23:14   #2
Priyadarshi
Junior Member
 
Join Date: Sep 2012
Location: Chapel-Hill, NC
Posts: 17
Default

Seems like no one is interested in using Intel iGPUs for compute. Come faster Haswell!
Priyadarshi is offline   Reply With Quote
Old 27-Sep-2012, 08:20   #3
Dade
Member
 
Join Date: Dec 2009
Posts: 171
Default

Quote:
Originally Posted by Priyadarshi View Post
Seems like no one is interested in using Intel iGPUs for compute. Come faster Haswell!
Hehe, running OpenCL applications on current generation of Intel GPUs looks like more a technical exercise than a performance win (when you can use any of the discrete AMD GPUs a run zillion of time faster).

IVB and AMD APUs should really shine on hybrid applications (i.e. mixed CPU/GPU tasks). However they don't seem to offer some of the expected benefit (i.e. very low over-head for exchanging data between CPU/GPU, low latency kernel execution, etc.).

I have yet to heard some concrete success story (at least on my field of interest). It looks like discrete GPUs have still a huge advantage even if they talk with the CPU via a slow/high-latency PCIe bus.
Dade is offline   Reply With Quote
Old 27-Sep-2012, 10:46   #4
sebbbi
Member
 
Join Date: Nov 2007
Posts: 941
Default

Quote:
Originally Posted by Priyadarshi View Post
Seems like no one is interested in using Intel iGPUs for compute. Come faster Haswell!
I find integrated GPUs very interesting, as most current CPUs have integrated graphics: Intel Sandy Bridge, Ivy Bridge, lower end Pentiums/Celerons based on Sandy/Ivy, AMD Llano, Trinity, Bobcat, forthcoming: Haswell and Steamroller (etc). Basically every gaming PC has a discrete graphics card, and thus the integrated GPU could be fully used for GPGPU. Integrated GPUs are better suited for (non-graphics related) GPGPU than discrete ones, because they share memory (and even caches) with CPU. There's no need to transfer data/commands over PCIE. Using a discrete GPU for both compute and graphics rendering causes pretty big latencies for compute tasks (as both tasks share a ring buffer on most APIs/hardwares). Integerated GPU provides a separate ring buffer solely for GPGPU compute tasks (thus reducing the compute latency).

Currently Sandy Bridge-E and Bulldozer (AMD FX) are the only CPUs without integrated GPUs, and neither has sold that much (for desktops). It's entirely possible to release a game that is designed to use the integrated GPU for GPGPU. 500+ GFLOP/s solely for low latency GPGPU could for example be used to improve physics simulation dramatically.
sebbbi is offline   Reply With Quote
Old 27-Sep-2012, 17:52   #5
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,102
Default

The use of specialized compute resources may not really catch on until the regular cores and specialized silicon have the tools and architecture to freely (transparently?) move tasks between them, and when the chips are architected such that an APU chip without a GPU is considered as functional as a CPU with the FPU broken.
There are too many SKUs with the IGP flipped off and still too many gotchas and kludges as of yet.

AMD might reach this point simply because its CPUs won't be able to stand on their own. GCN appears to be moving to the point that the FP resources can serve multiple masters. Maybe there could be a point where the GPU portion is indeed inactive, but the SIMDs are not.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 27-Sep-2012, 19:33   #6
Priyadarshi
Junior Member
 
Join Date: Sep 2012
Location: Chapel-Hill, NC
Posts: 17
Default

Quote:
Originally Posted by Dade View Post
IVB and AMD APUs should really shine on hybrid applications (i.e. mixed CPU/GPU tasks). However they don't seem to offer some of the expected benefit (i.e. very low over-head for exchanging data between CPU/GPU, low latency kernel execution, etc.).
Exactly! I guess I had high expectations for memory bandwidth but seems like it will take us a few years to get a truly unified and fast memory. AMD is committed to HSA and might get there faster but Intel also seems to be moving fast enough!

Quote:
Originally Posted by Dade View Post
I have yet to heard some concrete success story (at least on my field of interest). It looks like discrete GPUs have still a huge advantage even if they talk with the CPU via a slow/high-latency PCIe bus..
If the success story means faster performance than a discrete GPU then it might never happen. As they have access to much more power, they can afford to throw more and more flops and also the scaling of memory bandwidth for APU memory controllers is much slower than GDDR. It is (and will be) an unfair comparison. I am currently looking into libraries like Embree\LuxMark and using the complete APU to get faster performance than just the CPU.

Quote:
Originally Posted by sebbbi View Post
Basically every gaming PC has a discrete graphics card, and thus the integrated GPU could be fully used for GPGPU. Integrated GPUs are better suited for (non-graphics related) GPGPU than discrete ones, because they share memory (and even caches) with CPU.
True. The way I look at it that instead of a CPU, we have an APU now which would assist it in applications like fast BVH building for collision detection and ray-tracing. I really don't like the visual physics (PhysX) which don't interact much with the gameplay.

Quote:
Originally Posted by 3dilettante View Post
The use of specialized compute resources may not really catch on until the regular cores and specialized silicon have the tools and architecture to freely (transparently?) move tasks between them, and when the chips are architected such that an APU chip without a GPU is considered as functional as a CPU with the FPU broken.
I don't think transparency is the barrier to achieve good performance from an APU, ofcourse it would nice to have it sometime in the future. OpenCL provides a pretty neat programming interface for heterogeneous devices - we can write the applications today which should scale well with improvements in memory bandwidth.
Priyadarshi is offline   Reply With Quote
Old 06-Oct-2012, 14:14   #7
CarstenS
Senior Member
 
Join Date: May 2002
Location: Germany
Posts: 2,842
Send a message via ICQ to CarstenS
Default

Quote:
Originally Posted by Priyadarshi View Post
I have been testing the Intel's OpenCL SDK for heterogenous computing with the HD2500 iGPU.
Which Version of the SDK? AFAIK it is still very not-performance optimized (using the AMD-APP-SDK on the Intel x86-cores is much faster!) and somebody mentioned he got a twofold speed increase from using the 2013 SDK that is currently in Beta.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts.
Work| Recreation
Warning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration!
CarstenS is offline   Reply With Quote
Old 23-Oct-2012, 04:11   #8
Priyadarshi
Junior Member
 
Join Date: Sep 2012
Location: Chapel-Hill, NC
Posts: 17
Default

I tried the SDK 2013 version and the two-fold speed increase IS there but only for the CPU device.
The GPU gives the same results as the 2012 SDK. Tested performance with LuxMark.
Priyadarshi is offline   Reply With Quote
Old 23-Oct-2012, 10:14   #9
Dade
Member
 
Join Date: Dec 2009
Posts: 171
Default

Quote:
Originally Posted by Priyadarshi View Post
I tried the SDK 2013 version and the two-fold speed increase IS there but only for the CPU device.
The GPU gives the same results as the 2012 SDK. Tested performance with LuxMark.
Old Intel CPU device was slightly slower than AMD OpenCL CPU device. If the new one is 2 times faster, it is a quite impressive result.
Dade is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 12:04.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.