If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#76 | |
|
Member
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
|
[offtopic]
Quote:
But this is a bad strategy. A) What is usually needed is 1) 1 to few very powerful cpu cores, for code which does not parallelize well 2) Large number of weak cores for massively parallel code Very few code needs something like 6-12 relatively powerful cores. This is either too few or not enough. B) Transistors are getting "almost free", so sacraficing single-thread-performance to save transistors is just bad tradeoff. Especially when those saved transistors are only used to put more "semi-powerful" cores to the chip. Sacraficing single-thread ipc however might be reasonable in cases where it allows higher total performance by higher clock speed, or considerable power savings. Single-thread performance is still very important, and now when it's harder to get single-thread performance improvements, it just mean cpu developers should concentrate more on it, not give up. I think Intel has the better strategy here. They are concentrating more on single-thread performance, and using SMT to also get improvements with multiple threads. AMD's strategy with fusion would also be very good, if they just had executed it correctly and designed a proper "high performance for single-thread" cpu core for it, instead of having to use either 4 outdated cores, or 4 new semi-powerful "mini-cores". Intel is also bringing the "many weak cores" into play with larrabee/knights line, wonder when they release a single chip with both high-end x86 cores and larrabee cores. [/offtopic] |
|
|
|
|
|
|
#77 |
|
Senior Member
Join Date: Jun 2003
Posts: 2,570
|
microcoded has a lot of rather thorny problems that aren't fun to implement, verify, or validate on a cpu. The coded sequence solution provides a defined bounded operation that can be implemented and verified much easier.
The problem really is one of bounds. Consider that each entry in the gather could potentially point to a different PTE and that PTE may or may not be loaded into TLBs or even cache. And I cannot recall off the top of my head whether the PTE can themselves be virtually/indirectly allocated, etc. So for 1 gather you could be looking at potentially 16+ tlb fills + page faults + memory accesses, etc. We're talking upwards of thousands of cycles in what would be in the microcoded case an atomic operation that has significant implications up and down the architecture and validation stack. If you look at errata for various processors, you are likely to find many entries associated with long complicated atomic memory operations. By implementing it as a load->mask->fill->update instruction, the side effects are significantly restricted and the performance difference should be minimal in a modern core.
__________________
Aaron Spink speaking for myself inc. |
|
|
|
|
|
#78 |
|
Senior Member
Join Date: Jun 2003
Posts: 2,570
|
And as a follow up, doing it as a looped instruction sequence:
A: GATHER Y,X+([IMM]/W),Z B: BNZ Z, A has the benefit of allowing things like offloading to micro engines in the future if desired while not requiring it at the start. It is perfectly possible in the future to stick a microengine off an L1 or L2 in the future and pass the instruction through to the microengine to do the whole gather. Given the common use cases for gather, having a small microengine with 16-64(1-4 element vector aka RGBA/XYZ/XY/etc) cachelines would enable very fast gather generation.
__________________
Aaron Spink speaking for myself inc. |
|
|
|
|
|
#79 | ||
|
Ohio frog
Join Date: Jun 2005
Location: Ohio, USA
Posts: 4,172
|
Quote:
CMT did not deliver on this premise 'more core within the same silicon and power budget". Trintiy modules are tinier than 2 Star cores, they are more featured but not enough to make a significant difference (I made gross measurement). If not for the better power management features in power constrained environment and the use of new instructions the Star core are still better. It's imo the contrary, for an "industrial/production" pov actually the modules approach offer lesser granularity than lesser cores. Back in time AMD could sell 1 and 3 cores variantions, they now longer can't. On the igp integration side of the equation Intel is sadly ahead of AMD. AMD seems completely focus on fixing their modules, the "uncore" progress at really low speed if at all. And the Anandtech seems to confirm that fast memory will make it into Haswell. AMD may lost here too. On the compute side of thing Intel IGP was already arguably better. AMD is putting together mostly its CPU parts and GPUs parts whereas Intel develop its APU as a whole. May be AMD if they were not putting all their efforts in fixing their modules performances they could do better here too. In the mean time a quick list of what they postpone to fix: The L3 (won't be done before at least 2014) Fp/SIMD performances (won't be done before at least 2014) Support for AVX2 (won't be done before at least 2014) Single thread performances should catch up with prior architecture may be in 2013 with streamrollers. Overall I fail to see how AMD could be in a worse situation if they have passed on CMT. They had proven solutions in front of them with SMT and cache hierarchy of CPUs like Nehalem and Power7. They decided to come with their own take and for me it failed. They should have make the bitter and difficult conclusion as soon as BD launched (or no that longer after engineer sample were out) to push BD how (or scrap it) and start something new. A 3 issue std CPU core which would include all the refinement they included in BD and then PD. Such a CPU I fail to understand how it would not completely out perform their previous architectures and as such it would be closer to Intel offering. Such a CPU might have ended bigger than bot Star core or half a BD/PD module but by how much? I suspect not that much not even to significantly change their costs. It may also be a bit more power hungry but it might allow for better power management and turbo. You have more granularity, you could change clock speed, clock gate on a per cores basis vs a module basis (that for coarse grained). If they didn't /couldn't copy IBM or Intel approaches for the cache hierarchy, they may have come with something akin to Jaguar which looks saner. I can't see ( or understand) why AMD that is still doing great things (may be while beating a dead horse...) could not successfully engineer something like that. At least they could fight Intel Dual cores with Tri core instead of quad cores (better usage of salvage parts) and have a chance to actually look good. They are lagging Intel more and more All this sounds a bit like angst but I believe that AMD can do so much better. The sad thing is imho that by 2014 when or if most CMT approach pitfalls have been fixed (while still not bridging the gap with Intel, more the contrary), and depending on the success of Windows8 RT they might be threatened by ARM64 CPUs. ARM is already more advanced in the APU road than AMD, with its mali/a15 CPU. They are to end between a rock and an hard place Quote:
EDIT OOps sorry I just realize that we are indeed in the wrong thread to discuss that matter, sorry for the Ot.
__________________
What's trying to be a bunch of presentations PS360 youtube channel Sebbbi about virtual texturing Tuned EADGCF and liking it :) Last edited by liolio; 13-Sep-2012 at 19:31. |
||
|
|
|
|
|
#80 | |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
|
Quote:
|
|
|
|
|
|
|
#81 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,544
|
Quote:
The load, mask, loop until mask=0 guarantees forward progress, doesn't require any internal state saved on preemption, doesn't throw away work done by subsequent instructions already executed thanks to the OOO machinery. The OOO machinery can also overlap multiple independent gather-loops. And as Aaron points out. There is nothing preventing Intel recognizing the load/mask/branch idiom in future implementations, speeding it up. Cheers
__________________
I'm pink, therefore I'm spam |
|
|
|
|
|
|
#82 | ||
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
|
Quote:
Quote:
And do you also not think adding a branch on vector mask instruction is a pretty big shift? |
||
|
|
|
|
|
#83 | |
|
Senior Member
|
Quote:
All the PTE/TLB related issues could happen just as easily with a load->mask->fill->update instruction. |
|
|
|
|
|
|
#84 | |
|
Senior Member
Join Date: Jun 2003
Posts: 2,570
|
Quote:
__________________
Aaron Spink speaking for myself inc. |
|
|
|
|
|
|
#85 |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
|
But AVX2 already specifies a gather with an arbitrary vector mask to merge the output. Are you now proposing a whole new set of instructions to manipulate and branch on these status registers? You want to turn it into LRBNI full stop?
|
|
|
|
|
|
#86 |
|
Senior Member
|
I think some kind of unification there is on the way.
|
|
|
|
|
|
#87 | |
|
Senior Member
Join Date: Dec 2004
Location: Toulouse
Posts: 4,136
|
Quote:
Bulldozer was often described as a server chips. it makes sense.. only, the power use is too high, leading to clocks too low, and as on the desktop if faces sandy bridge which humiliates it. The logicial conclusion of your post is, we need a cheap with a few powerful cores and many weak ones. this is not easy, even Intel CPU + Knights/Xeon Phi isn't quite there because they run two different x86 instruction sets. The strong+weak mix now exists in the Tegra3, and yet to launch Cortex A15 + Cortex A7 designs, but it's here to save milliwatts, not for performance. It also helps that the device is shipped with a custom linux kernel with adequate scheduler. Workstations happily use up your 6 or 12 or more cores or threads, because that's what is easily available now. In fact, if the Piledriver incremental improvement is good enough I think AMD could pit dual socket desktop boards against single socket 2011. It's versatile. Same hardware can be used to run 40 linux VMs, or to do video rendering or something. "Weak cores" solutions have to get better (the Xeon Phi is a significant milestone and we'll see what nvidia Maxwell is up to, as well as AMD steamroller and post-steamroller). But the software aspect is crucial (an 8-core, 12-core or 16-core machine has the advantage of running regular software, not specially crafted one) I wonder about Intel's next crazy CPU with many powerful cores, the EX variants. They now seem to skip the generations that bring a new archicture but not a new process. So we have Westmere-EX, Ivy Bridge-EX and I suppose Broadwell-EX after that. Last edited by Blazkowicz; 15-Sep-2012 at 16:07. |
|
|
|
|
|
|
#88 |
|
Member
Join Date: Jan 2008
Posts: 460
|
impressive specs on show...doubling of a lot of units, but why are sites saying the only performance doubling is the iGPU and a marked increase in idle watt savings...the CPU side expects to see no more than 10% gains over Ivy Bridge on average?? It is a 95W part vs 77W...have we come to a point that clockspeed is a limiting factor...? AFAIK Intel has not gone over 4Ghz on Turbo for their quadcore parts..for a long time....
From LGA1366 to LGA1155...we saw the average clockspeed went from 2.66/3.2Ghz to 3.4/3.8Ghz, can Hazwell be limited by clockspeed? As the 22nm Ivy Bridge refresh did not clock much higher (by which i mean extreme overclocking - 5Ghz) than Sandy even after the delidding and replacement of heatpaste... |
|
|
|
|
|
#89 | |
|
Member
Join Date: Aug 2011
Posts: 366
|
Quote:
Also, many of the Haswell changes scream "reduction in possible clock speed" to me. Even if it won't clock lower than Ivy, it will certainly clock lower than it would without those changes. Basically, Intel went: "Okay, let's make every pipeline stage 10% longer. What can you do with that?" |
|
|
|
|
|
|
#90 |
|
Member
Join Date: Jan 2008
Posts: 460
|
Why does Intel want to stop the clock?? Is this really really the end of the mhz....? Everyone loved Sandy Bridge, and a lot more after how weak the Ivy Bridge refresh was but LGA1366/1156 to LGA1155 is really down to how fast Sandy Bridge can clock up..if Hazwell desktop quadcore parts dont follow that kind of clock up...what is Intel gameplan for the desktop users....how long will x86 software catch up with Hazwell doubled SIMD units?
|
|
|
|
|
|
#91 | |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
|
Quote:
When you refer to "everyone" and "desktop users" I get the distinct impression that you're actually referring to overclockers, which are still somewhat of a niche, especially when only the highest ends of the product lines can really overclock to begin with. At stock speeds IB is an obvious win, nobody would say that it makes SB look better. That all said, we really don't know if Haswell's changes are forcing clock time to go up, especially vs SB which was on an older process. It's a given that some stages will take longer to check dependencies of and dispatch to those additional ports, but the clock time is only as fast as the slowest pipeline stage. And while CPU designers will do their best to make the stages run as close to the same speed as possible I doubt they really get it perfect, so who knows if there wasn't another slower stage that gave them this headroom. Furthermore, you'd expect IB would reduce cycle time requirements vs SB if only slightly, yet the overclocking potential was less most likely due to power density issues. So they may have had headroom that wasn't even accessible and at that point there's no reason not to trade that for perf/W. And Haswell may be better optimized for the process with regards to power distribution. So I wouldn't say that it's a given that it'll hit peak clocks below what SB could. |
|
|
|
|
|
|
#92 |
|
Member
|
Well AMD had been the ones shouting about the increased use of CAD - created layouts. I think anyway, with process tech advancing its inevitable that automation will be used more and more everywhere.
Maybe the somewhat lack of aggressive frequency increases is also an indication that intel now uses a bit less custom logic? |
|
|
|
|
|
#93 | |
|
Member
Join Date: Jan 2006
Location: France
Posts: 197
|
Quote:
__________________
- I'm french. Sorry if you don't understand what i say - |
|
|
|
|
|
|
#94 |
|
Senior Member
Join Date: Dec 2004
Location: Toulouse
Posts: 4,136
|
I don't remember any benchmark using the special abiilities of Bulldozer (FMA4)
software may be slow moving but maybe the authors don't care unless the new stuff is on Intel CPUs. |
|
|
|
|
|
#95 |
|
Member
Join Date: Jan 2008
Posts: 460
|
Do you think Haswell doubled SIMD units will rock with PC games..? Well..i think a i5 4570K will clock between 3.6Ghz to 4Ghz and the i7 4770K will be between 3.8Ghz to 4.2Ghz...both capable of running stock 1866mhz ram...these numbers from thin air around me i just pulled out off...but at these kind of clocks, where will Haswell stands?? Guesses gentlement?
It is smaller than the jump from LGA1366/1156 to Sandy Bridge... |
|
|
|
|
|
#96 |
|
Senior Member
|
There was a slide back then from the official BD presentation, showing rather impressive results from some OCL kernel with FMA4 support, but nothing more to the date.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#97 |
|
Member
Join Date: Oct 2003
Posts: 320
|
TDP for IVB is 77W and Haswell is 95W. Presumably they're talking about the top end desktop parts, and those don't necessarily have the highest powered GPUs attached so part of that ~18W increase has to go into the CPU. The revisions to the architecture itself, though sizable, can't be consuming all that power, so I'm expecting top clocks to be higher as well.
|
|
|
|
|
|
#98 | ||
|
Member
Join Date: Aug 2011
Posts: 366
|
Quote:
Quote:
For existing software, this will be the biggest gain in IPC since Core -> Core 2. However, I expect the IPC gain to be compensated by lower clocks. Last edited by tunafish; 16-Sep-2012 at 16:31. |
||
|
|
|
|
|
#99 | |
|
Member
Join Date: Jan 2008
Posts: 460
|
Quote:
77W to 95W tdp should account for some more clocks....maybe 3.6ghz to 4ghz for the 4770K....and the 4570K should take the speed of the present 3770K...3.5ghz to 3.9ghz..kinda of sucky if that is all... Could we be waiting for Haswell-E parts with the 8 cores sku..i dont understand why Intel dont want to go higher on the 95W Haswell quad desktop sku..why do we need double the iGPU performance on my gaming PC? I find it irony doing so will actually have a hand in killing off the desktop market....it is like irony fox..../cries profits from desktop is shrinking.../design your next cpu around perf/watt and portability.. |
|
|
|
|
|
|
#100 |
|
Senior Member
Join Date: Dec 2004
Location: Toulouse
Posts: 4,136
|
VRMs are moved onto the CPU package, I believe this accounts for most of the TDP increase.
This also means a piece of crap motherboard with a non-K high end CPU gets (even more) attractive. Expect creative chipset segmentation and annoying marketing of IGP tiers (well we've had these things in place already) |
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|