If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 |
|
Junior Member
Join Date: Sep 2012
Location: Chapel-Hill, NC
Posts: 17
|
Hello,
Reposting my thread from Intel OpenCL forums here: I have been testing the Intel's OpenCL SDK for heterogenous computing with the HD2500 iGPU. I ran a few benchmarks to test the memory bandwidth of both CPU and iGPU devices. Here are the results: --------------------------------------------------------------------------------------------------------------------------------- 1. Memory Read [Single] : All threads read from a single physical address. CPU - 70 GB/s; iGPU - ~5 GB/s 2. Memory Read [Linear] : Thread read data sequentially memory address according to their thread id CPU - 50 GB/s; iGPU - 5.8 GB/s 3. Memory Read [Uncached] : The reads are offsetted so that the cache thrashing is maximum CPU - 5.8 GB/s; iGPU - 4.5 GB/s 4. Memory Write [linear] : Threads writing to sequential memory addresses CPU - 60 GB/s; iGPU - 1.3 GB/s --------------------------------------------------------------------------------------------------------------------------------- Using vec4 datatype for CPU gives the maximum bandwidth. This is what the optimization guide recommends too. But for GPU, I get the same bandwidth for all datatypes. Few questions I have: a) How the iGPU's shader core (EU) is laid out? I do know that it has 4 ALUs but do they work on different threads (OpenCL thread i.e a work item) or only on 1 thread like the VLIW4 unit in previous AMD GPUs? b) Why is the iGPU access to global memory crippled compared to CPU? Ok CPU has big caches but doesnt the IVB has an L1, L2, L3 hiearchy too? This is nearly equal to PCIe transfer speeds, in that case I have much better options to do CPU+GPU compute Btw I also tested its bandwidth to per compute unit shared memory (part of L3 cache) and I got around 20 GB/s. This seems okay. c) What is the best way to share data between CPU/GPU which gives the maximum memory bandwidth? |
|
|
|
|
|
#2 |
|
Junior Member
Join Date: Sep 2012
Location: Chapel-Hill, NC
Posts: 17
|
Seems like no one is interested in using Intel iGPUs for compute. Come faster Haswell!
|
|
|
|
|
|
#3 | |
|
Member
Join Date: Dec 2009
Posts: 171
|
Quote:
IVB and AMD APUs should really shine on hybrid applications (i.e. mixed CPU/GPU tasks). However they don't seem to offer some of the expected benefit (i.e. very low over-head for exchanging data between CPU/GPU, low latency kernel execution, etc.). I have yet to heard some concrete success story (at least on my field of interest). It looks like discrete GPUs have still a huge advantage even if they talk with the CPU via a slow/high-latency PCIe bus. |
|
|
|
|
|
|
#4 | |
|
Member
Join Date: Nov 2007
Posts: 945
|
Quote:
Currently Sandy Bridge-E and Bulldozer (AMD FX) are the only CPUs without integrated GPUs, and neither has sold that much (for desktops). It's entirely possible to release a game that is designed to use the integrated GPU for GPGPU. 500+ GFLOP/s solely for low latency GPGPU could for example be used to improve physics simulation dramatically. |
|
|
|
|
|
|
#5 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,121
|
The use of specialized compute resources may not really catch on until the regular cores and specialized silicon have the tools and architecture to freely (transparently?) move tasks between them, and when the chips are architected such that an APU chip without a GPU is considered as functional as a CPU with the FPU broken.
There are too many SKUs with the IGP flipped off and still too many gotchas and kludges as of yet. AMD might reach this point simply because its CPUs won't be able to stand on their own. GCN appears to be moving to the point that the FP resources can serve multiple masters. Maybe there could be a point where the GPU portion is indeed inactive, but the SIMDs are not.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#6 | ||||
|
Junior Member
Join Date: Sep 2012
Location: Chapel-Hill, NC
Posts: 17
|
Quote:
Quote:
Quote:
Quote:
|
||||
|
|
|
|
|
#7 |
|
Senior Member
|
Which Version of the SDK? AFAIK it is still very not-performance optimized (using the AMD-APP-SDK on the Intel x86-cores is much faster!) and somebody mentioned he got a twofold speed increase from using the 2013 SDK that is currently in Beta.
__________________
English is not my native tongue. Before flaming please consider the possiblity that I did not mean to say what you might have read from my posts. Work| RecreationWarning! This posting may contain unhealthy doses of gross humor, sarcastic remarks and exaggeration! |
|
|
|
|
|
#8 |
|
Junior Member
Join Date: Sep 2012
Location: Chapel-Hill, NC
Posts: 17
|
I tried the SDK 2013 version and the two-fold speed increase IS there but only for the CPU device.
The GPU gives the same results as the 2012 SDK. Tested performance with LuxMark. |
|
|
|
|
|
#9 |
|
Member
Join Date: Dec 2009
Posts: 171
|
Old Intel CPU device was slightly slower than AMD OpenCL CPU device. If the new one is 2 times faster, it is a quite impressive result.
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|