If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1251 | ||
|
Senior Member
Join Date: Oct 2002
Posts: 2,433
|
Quote:
So I really wouldn't understand if AMD simply doubled the cache size (well it doesn't say how much the size was increased neither but that's the only sensible number I can think of). Quote:
|
||
|
|
|
|
|
#1252 |
|
Member
Join Date: Apr 2007
Location: Australia
Posts: 644
|
Also having more decode throughput will increase pressure on the L1D as well. so what would be an ideal target 32kb a core 48/64 ?
i am surpised at no 256bit FP ALU's. |
|
|
|
|
|
#1253 |
|
Member
Join Date: Nov 2006
Location: Somewhere over the ocean
Posts: 631
|
this is expected to be the real fusion architecture in wich the cpu and gpu can share computational resources, any news on this?
|
|
|
|
|
|
#1254 |
|
Ohio frog
Join Date: Jun 2005
Location: Ohio, USA
Posts: 4,172
|
I've one question has AMD split the instruction decoder already, how much of a rework it would be for them to completely split the module (and thus sit on CMT)?
They could use still use most of the inner of BD/PD/SR right? I wonder because they won't address with Streamroller the issue with the L3, they won't either dual thread the FP/SIMD scheduler., the SIMD native width still lag Intel counter part, Haswell will make things worse. The L2 is still suboptimal and slow. For AMD the L3 must not be a priority as I guess that they acknowledge already that winning big contracts now with their products in the server realm is unlikely. I wonder if they could go further, they said already that they no longer to fight Intel head to head (they can't anyway, it's unclear if they have choice but that's not the point). As it seems that they are no longer in a situation to go for the high end (high performances and servers part) would be be that terrible for them to deliver real mid range CPU (looking at the whole scale from embedded to servers part)? To give an idea like "Ok we fight fight core i3 (dual cores) with quad cores but actually our quadcore is not twice as big as Intel dual core". So I wonder if they could split the BD module and at the same time redesign the cache hierarchy with four cores in minds. A bit like the jaguar that are supposed to scale up to 4 cores. That should be their "module". Pretty much the cache hierarchy would look like the one in Jaguar, a shared L2 (bigger though, looking at Intel i3 and i5 4MB would do fine). We still don't have data but I would not be surprised if the L2 in Jaguar (it would not be running at half speed) offer overall better characteristic than the one in BD module. Just make a bigger one with a more robust L2 interface, they have already something to build upon (the Jaguar). Starting at 4 cores sounds sane. They may move forward later on (once everything is OK, like when the CPUs are no longer sucking the bandwidth from the mem controllers for example through a straw). So half a BD module would be pretty tiny (especially once they will have two decoders, there are already two L1 data cache. I think that for a while they should not try to match Intel with regard to the SIMD width. They should do like Jaguar cores made it so they can run instruction on 8 wide vectors at half speed. Lot of code and legacy apps won't use that before quiet a while. (Either way It's unclear if even Excavator will fix that vs Intel offering (not taking Haswel in account). I feel like with Bulldozer, Piledriver, Streamrollers they will already have made plenty improvements vs their Phenom II. In a sane set-up (wrt to the cache hierarchy and memory) I believe that those cores would prove that they are better than they look stomping on each other feet within a module.
__________________
What's trying to be a bunch of presentations PS360 youtube channel Sebbbi about virtual texturing Tuned EADGCF and liking it :) |
|
|
|
|
|
#1255 | ||
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
|
Quote:
Separating everything else shared would be a ton more work, because they're big and deeply buffered in comparison to the decoder which alternated between cores every cycle. Imagine what would go into duplicating the instruction cache and big fetch buffers, or the FPU with huge execution window and triple-issue with two FMA pipelines.. They'd have to seriously rebalance everything to fit a similar transistor budget. Quote:
What they need is a better L1D cache but astonishingly there's no indication on that! Or have two L1.5D caches sitting between the L1D and the L2? Like at around 64 or 128KB each, with latency in between the L1D and L2? I dunno.. But you can't just say they should take the L2 from Jaguar, or anything else for that matter, because until Jaguar is designed for > 4.2GHz speeds it isn't going to work. |
||
|
|
|
|
|
#1256 | |
|
Ohio frog
Join Date: Jun 2005
Location: Ohio, USA
Posts: 4,172
|
Quote:
Indeed you would have to scale down almost everything, basically rebuilding from scratch It's still disheartening because with all the improvement AMD made across the board may have they passed on CMT they would have something pretty sexy from scratch Going with something more standard they may also had more time to scale up something more akin to Jaguar cache hierarchy. It's sad they are stuck with that and it won't fly anytime soon (I mean it's not like Streamrollers are to ship tomorrow or even after tomorrow, neither I expect them to look that sexy vs Haswel ).
__________________
What's trying to be a bunch of presentations PS360 youtube channel Sebbbi about virtual texturing Tuned EADGCF and liking it :) |
|
|
|
|
|
|
#1257 |
|
Member
Join Date: Oct 2003
Posts: 320
|
My guess is that Haswell's core will not be a dramatic change pipeline and structure-wise over the SB core, though the new FMAC instructions will give a big FP boost. It seems more focused on the un-core with its new cache structure which is rumored to have four levels and be accessible to its GPU as well. I doubt Intel would risk or want to take the time to significantly overhaul the structure of the core with so many un-core modifications on deck.
Also, Intel alternates between its Oregon and Israeli design teams which seem to have recently been trading off on tweaking un-core and core respectively, and I think their Oregon team is up. Their last CPU was Nehalem and they left the Core2 guts designed by the Israeli team intact, focusing on overhauling the un-core by adding the on-die memory subsystem. The Israeli team then did a major overhaul of the the core w/ SandyBridge, and my guess based on their history is that the Oregon team will largely design around the pipeline flow structure present in SB. Rumor has it that Haswell will have ~10% better performance at the same clocks over IvyBridge; since AMD claims 10-15% over each of its next few iterations, and I expect Steamroller to be closer to 15%, AMD might make up a small bit of lost ground in the next round. Excavator will give AMD some extra die-space (even at the same process) to play with from the size benefits of automation on their FPU that they claimed, so they might be able to add some goodies like higher associativity caches the round after that. |
|
|
|
|
|
#1258 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,070
|
Haswell should bring an L1 that can support AVX without the bandwith constraint experienced with SB and IB. That's on top of promoting integer SIMD to 256 bits, adding gather, and adding FMA.
Steamroller promises to make its FPU a little smaller. For concurrent workloads, the lock elision and transactional memory support provide an opportunity for Haswell to scale its performance in heavily multithreaded integer. AMD thus far has promised that it is not going to improve its cache or memory architecture much. The increase in L1 Icache size is a change whose magnitude is not yet given. I'm not sure how many workloads were complaining about the 64KB on Bulldozer, however. Double the decoders sounds like an increase in the width of the instruction prefetch or the number of them is in order. The problem with expanding the L1 size is that the aliasing problem would worsen. Idle thought: They could increase the associativity and cache block size to chip away at the index bits. Perhaps split the L1 internally with a pseudo-associative cache? That would be two 32-64KB halves/banks of 4-ways and 128 byte blocks. The associativity and block length would take down 2 bits of aliasing, then a rule that syonyms on the last bit are split between the halves. It sounds complicated, but I'm not certain a 128KB 64-byte line 2-way cache is going to do much for this architecture. The decoupled prediction and fetch logic would help buffer the hiccups inherent to having larger lines and the variability of having a pseudo-associative cache. The plus side to having two decoders, if they are fed, is that Steamroller might be able to brute force through one known soft spot for Sandy Bridge, where its fetch bandwidth becomes constraining if the uop cache misses. The core that's consuming this instruction bandwidth really isn't that big a bruiser, though. The loop buffers and expanded predictor for Steamroller would take on significance because it sounds like no matter what the front end of the pipeline is going to be longer, and anything that keeps the rest of the core from feeling it is an improvement. Sandy Bridge probably has a decently long pipeline, part of which is mitigated by the uop cache. The uop cache was hard to engineer, so that probably means AMD can't manage it yet. Loop buffers in the tradition of Nehalem would be consistent with AMD's multiyear lag behind the leader.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#1259 | |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
|
Quote:
My guess is that the fetch still wouldn't be a bottleneck that often, if it can really sustain 32b/cycle (for some reason, Agner Fog's tests show it as much less, maybe they've fixed this?).. this would give an average 16b/cycle/core, and between core switching plus really deep buffers you'd think it could maintain this. So it could eat quite a few larger instructions so long as it eventually balances out with smaller ones. And most of the bigger instructions would be executed on the shared FlexFP, which would probably be execution limited before fetch limited. Of course there's still a fair bit of waste in fetch bandwidth due to branches entering after the start of and exiting before the end of 32b blocks. |
|
|
|
|
|
|
#1260 |
|
Senior Member
|
The 16B instruction fetch is not that much of an issue for Intel. Since Nehalem, there's an additional 4-entry buffer for storing fetched bytes from the i-cache that apparently is sufficient to sustain busy decoders. On the other hand, the 32B fetch in AMD's K10 was obviously an overkill, but that doesn't mean it shouldn't be carried over ahead for an architecture that will finally benefit from it.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#1261 | ||
|
Member
Join Date: Oct 2003
Posts: 320
|
Quote:
For Haswell, the new FMAC and the associated framework to keep it fed could be a huge boost. AMD does seem to be rebalancing its CPU Cores toward relatively more integer to floating-point performance, but it's not fabbing a massive GPU onto the same die for no reason. I think they want people to make use of the GPU when really heavy streams of floating-point calculations arise; the existing FPUs are sufficient to address legacy code and any sporadic floating point math that might come up. Quote:
Separating the usual ICache into 2 higher-associativity, but smaller pieces seems like it might entail more complication in the Ifetcher or even a split, which might be going far enough to defeat the point of the cluster base approach. It also probably does less for single threaded performance and power savings than a uops cache since the latter's contents are much "closer" pipeline-wise to the execution units than the Icache. From Agner Fog's tests, having only 2-way associativity in its ICache hurts the BD, especially since it's servicing two threads; it sounds like addressing the poor associativity would be a better first step than splitting it up. Whatever AMD did, reducing L1 misses by 30% is a lot... Last edited by Raqia; 31-Aug-2012 at 04:08. |
||
|
|
|
|
|
#1262 | |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,070
|
Quote:
The single-threaded case wouldn't change as much, but the dual threaded case would give an advantage relative to an SB that misses the uop cache. It sounds brute-force and rather aggressive, so I may have been charitable for entertaining the thought. However, if the decoders are truly duplicated without degrading them, then the aggregate throughput in the case of doubles and microcode would be significantly better, since BD can't do two doubles at once and microde blocks the front end for the other thread.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
|
#1263 |
|
Member
Join Date: Apr 2007
Location: Australia
Posts: 644
|
From a time to market and therefore effort perspective, wouldn't reusing the existing decode be quicker then designing a "new" 2 wide decoder? Would a 4 wide decoder make any meaningful difference to "regular consumer code"?
|
|
|
|
|
|
#1264 |
|
Member
Join Date: Oct 2003
Posts: 320
|
My guess is that the new split decoders might have something to do w/ the uop cache that was mentioned. The caches are meant to improve single threaded performance by caching the results of the decode stage instead of before it and the contents are tied to each core separately, so I'm guessing they probably saw a benefit in its implementation to separating the decoders.
It also is implied that there isn't going to be any AVX2 support until excavator from the total lack of mention. It does seem like they revised BD to support AVX at the last minute and it wasn't an ideal implementation, and hopefully their inplementation of scatter/gather in AVX2 is up to snuff. Wild guess but maybe they'll alias some of the GPU's units for the CPU's FPU needs eventually, not sure if that's realistic or how the context switching would work... |
|
|
|
|
|
#1265 | |
|
Member
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
|
Quote:
So the decoders will be either 3- or 4-way. 3 might be the best comphromise. |
|
|
|
|
|
|
#1266 | |
|
Member
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
|
Quote:
Better branch prediction will also decrease demand, as less instructions will be fetcher. With single decoder, if the buffer between ifetch and decode got full, ifetch had to wait/stall. And when the buffer had quite many instructions , and there was a branch prediction miss, all these instructions for the thread had to be flushed out. With Steamroller there is in average less instructions waiting to be decoded, so when a branch prediction miss occurs, there are less instructions flushed. So with more decoders, the total amount of instructions that had to be fetched is actually less. Last edited by hkultala; 01-Sep-2012 at 08:42. |
|
|
|
|
|
|
#1267 | ||||
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
|
Quote:
But no, I didn't mean that more fetch bandwidth would be needed to get the same amount of work done, if that's what you thought I was saying. Quote:
Quote:
Quote:
If you want to look at it as energy wasted instead of time it probably wastes less if the data never got to leave the fetch buffers. You could say that wider decoders mean that fetch buffers don't need to be as large, but since there are already separate ones for each thread you'd lose in the single threaded case by making them smaller. Since the single threaded decode bandwidth isn't increasing. You could say the same thing for the post-decode buffer, depending on how robust it is - if it were a real cache it'd be worth relying, but a loop buffer either works or doesn't and pretty easily becomes completely useless if not executing a small enough loop.. so I don't know if AMD will want to rely on it to guarantee a performance baseline. |
||||
|
|
|
|
|
#1268 |
|
Ohio frog
Join Date: Jun 2005
Location: Ohio, USA
Posts: 4,172
|
There are quiet some sharp minds here so I will dare a few questions.
AMD went for CMT on the premise that it would 80% of the performances of 2 cores for 50% of the costs. I haven't made measurements (and comparing a bulldozer module to previous AMD cores may not be an optimal comparison) but looking at both a Trinity die and then at Llano seems to tell another story. AMD might improve their modules performances with Streamrollers and then Excavators but I would not bet on a significant diminution in the size of the module (their high density libraries may do that but that also applies to their others processors lines). So what do you think of CMT? The premise was +80% of performances for 50% more silicon and it looks like what AMD will pull out is +100% of performances but a 100% increase in silicon (which is a bit moot). Do you think the approach is failure? So you would expect them to abandon it for their next brand new architecture. Do you think that if CMT were to be pushed further it could actually get closer to its premise? By pushed further I mean a module consisting of more than 2 cores.
__________________
What's trying to be a bunch of presentations PS360 youtube channel Sebbbi about virtual texturing Tuned EADGCF and liking it :) |
|
|
|
|
|
#1269 | |
|
Heteroscedasticitate
Join Date: Mar 2005
Posts: 2,354
|
Quote:
__________________
Donald Knuth: Science is what we understand well enough to explain to a computer. Art is everything else we do. |
|
|
|
|
|
|
#1270 | |
|
Ohio frog
Join Date: Jun 2005
Location: Ohio, USA
Posts: 4,172
|
Quote:
I made some quick (and gross) measurements of Star cores and Piledriver modules. I found out that PD are ~93% of the area two Star cores would cover. That without L2, the L2 isze are a wash. By eyes I would say that there is less a bit less "glues" between a 2 modules part and a 4cores part which may push the advantage further in favor of BD/PD. It ain't that bad, in power constrained environment at least, Trinity offer anywhere between 110% 120% the performances of llano. There are cases where Llano win but cases where trinity won by a greater margin. So let say CMT is a good idea, do you think that after Excavator AMD could go further and increase the number of cores within a module (like having 4 cores in a module)? Or there are constrained in how big they can scale the front end?
__________________
What's trying to be a bunch of presentations PS360 youtube channel Sebbbi about virtual texturing Tuned EADGCF and liking it :) |
|
|
|
|
|
|
#1271 |
|
Senior Member
|
AMD already refers to the new Jaguar quad-core architecture as a "Compute Unit", together with a shared L2. If they push Jaguar a bit over the pure mobile concept, it would fit very well in a sever/WS envelope, with some sort of scalable interconnect cache/memory infrastructure.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#1272 |
|
Member
Join Date: Oct 2003
Posts: 320
|
An interesting hear-say update about AMD's CPU development:
VR-Zone Article The rumored gains are nice, but AMD's main problem has always been execution. Keller being back in charge is a very hopeful change though. Last edited by Raqia; 07-Sep-2012 at 15:20. |
|
|
|
![]() |
| Tags |
| amd, blewdozer, oh well, patents |
| Thread Tools | |
| Display Modes | |
|
|