Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 29-Aug-2012, 02:58   #1251
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,433
Default

Quote:
Originally Posted by fellix View Post
Well, there are basically two ways to reduce the cache miss-rate: higher associativity or larger size. While the first option sharply diminishes its effect after 4-ways, the size can be bumped up as long as it fits the die-size constraints and it doesn't impact the target access latency.
Yes but certain forms of cache aliasing cannot be fixed with increased cache size at all. In fact Bulldozer's problem with linux kernel, ASLR and shared libraries would get (very slightly) worse, not better, with doubled cache size (because now you'd need bits 12-15 to stay the same not just bits 12-14).
So I really wouldn't understand if AMD simply doubled the cache size (well it doesn't say how much the size was increased neither but that's the only sensible number I can think of).
Quote:
The L1d cache issues are no less critical in this case. Bulldozer is hampered with higher miss-rate in the L1d than in K10 and I hope Steamroller alleviates this in some manner.
L1D improvements are notably absent in that presentation (apart from store-load forwarding if you want to count that in there). But I guess there's always hope...
mczak is offline   Reply With Quote
Old 29-Aug-2012, 06:48   #1252
itsmydamnation
Member
 
Join Date: Apr 2007
Location: Australia
Posts: 644
Default

Also having more decode throughput will increase pressure on the L1D as well. so what would be an ideal target 32kb a core 48/64 ?

i am surpised at no 256bit FP ALU's.
itsmydamnation is offline   Reply With Quote
Old 29-Aug-2012, 14:06   #1253
fehu
Member
 
Join Date: Nov 2006
Location: Somewhere over the ocean
Posts: 631
Default

this is expected to be the real fusion architecture in wich the cpu and gpu can share computational resources, any news on this?
fehu is offline   Reply With Quote
Old 29-Aug-2012, 19:57   #1254
liolio
Ohio frog
 
Join Date: Jun 2005
Location: Ohio, USA
Posts: 4,172
Default

I've one question has AMD split the instruction decoder already, how much of a rework it would be for them to completely split the module (and thus sit on CMT)?

They could use still use most of the inner of BD/PD/SR right?

I wonder because they won't address with Streamroller the issue with the L3, they won't either dual thread the FP/SIMD scheduler., the SIMD native width still lag Intel counter part, Haswell will make things worse. The L2 is still suboptimal and slow.

For AMD the L3 must not be a priority as I guess that they acknowledge already that winning big contracts now with their products in the server realm is unlikely.
I wonder if they could go further, they said already that they no longer to fight Intel head to head (they can't anyway, it's unclear if they have choice but that's not the point).

As it seems that they are no longer in a situation to go for the high end (high performances and servers part) would be be that terrible for them to deliver real mid range CPU (looking at the whole scale from embedded to servers part)? To give an idea like "Ok we fight fight core i3 (dual cores) with quad cores but actually our quadcore is not twice as big as Intel dual core".

So I wonder if they could split the BD module and at the same time redesign the cache hierarchy with four cores in minds. A bit like the jaguar that are supposed to scale up to 4 cores. That should be their "module". Pretty much the cache hierarchy would look like the one in Jaguar, a shared L2 (bigger though, looking at Intel i3 and i5 4MB would do fine). We still don't have data but I would not be surprised if the L2 in Jaguar (it would not be running at half speed) offer overall better characteristic than the one in BD module. Just make a bigger one with a more robust L2 interface, they have already something to build upon (the Jaguar). Starting at 4 cores sounds sane. They may move forward later on (once everything is OK, like when the CPUs are no longer sucking the bandwidth from the mem controllers for example through a straw).

So half a BD module would be pretty tiny (especially once they will have two decoders, there are already two L1 data cache. I think that for a while they should not try to match Intel with regard to the SIMD width. They should do like Jaguar cores made it so they can run instruction on 8 wide vectors at half speed. Lot of code and legacy apps won't use that before quiet a while. (Either way It's unclear if even Excavator will fix that vs Intel offering (not taking Haswel in account).

I feel like with Bulldozer, Piledriver, Streamrollers they will already have made plenty improvements vs their Phenom II.
In a sane set-up (wrt to the cache hierarchy and memory) I believe that those cores would prove that they are better than they look stomping on each other feet within a module.
liolio is offline   Reply With Quote
Old 29-Aug-2012, 20:13   #1255
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
Default

Quote:
Originally Posted by liolio View Post
I've one question has AMD split the instruction decoder already, how much of a rework it would be for them to completely split the module (and thus sit on CMT)?
Why should they? The shared decoder was probably the biggest bottleneck for module sharing code. The post-decode buffer will also alleviate fetch bandwidth contention, although that shouldn't be nearly as big of an issue.

Separating everything else shared would be a ton more work, because they're big and deeply buffered in comparison to the decoder which alternated between cores every cycle. Imagine what would go into duplicating the instruction cache and big fetch buffers, or the FPU with huge execution window and triple-issue with two FMA pipelines.. They'd have to seriously rebalance everything to fit a similar transistor budget.

Quote:
Originally Posted by liolio View Post
I wonder because they won't address with Streamroller the issue with the L3, they won't either dual thread the FP/SIMD scheduler., the SIMD native width still lag Intel counter part, Haswell will make things worse. The L2 is still suboptimal and slow.
But duplicating/splitting the stuff that's still shared doesn't change any of that.. well maybe dedicated L2 caches could be faster, I don't know.

What they need is a better L1D cache but astonishingly there's no indication on that! Or have two L1.5D caches sitting between the L1D and the L2? Like at around 64 or 128KB each, with latency in between the L1D and L2? I dunno..

But you can't just say they should take the L2 from Jaguar, or anything else for that matter, because until Jaguar is designed for > 4.2GHz speeds it isn't going to work.
Exophase is offline   Reply With Quote
Old 29-Aug-2012, 20:40   #1256
liolio
Ohio frog
 
Join Date: Jun 2005
Location: Ohio, USA
Posts: 4,172
Default

Quote:
Originally Posted by Exophase View Post
Why should they? The shared decoder was probably the biggest bottleneck for module sharing code. The post-decode buffer will also alleviate fetch bandwidth contention, although that shouldn't be nearly as big of an issue.

Separating everything else shared would be a ton more work, because they're big and deeply buffered in comparison to the decoder which alternated between cores every cycle. Imagine what would go into duplicating the instruction cache and big fetch buffers, or the FPU with huge execution window and triple-issue with two FMA pipelines.. They'd have to seriously rebalance everything to fit a similar transistor budget.
Oops my bad, I was confused. I'm so willing to see that thing come together that I forgot about the fact that's a lot of the front end is sized for the two "cores" (it's also amortized on two cores).
Indeed you would have to scale down almost everything, basically rebuilding from scratch

It's still disheartening because with all the improvement AMD made across the board may have they passed on CMT they would have something pretty sexy from scratch
Going with something more standard they may also had more time to scale up something more akin to Jaguar cache hierarchy. It's sad they are stuck with that and it won't fly anytime soon (I mean it's not like Streamrollers are to ship tomorrow or even after tomorrow, neither I expect them to look that sexy vs Haswel ).
liolio is offline   Reply With Quote
Old 29-Aug-2012, 22:42   #1257
Raqia
Member
 
Join Date: Oct 2003
Posts: 320
Default

My guess is that Haswell's core will not be a dramatic change pipeline and structure-wise over the SB core, though the new FMAC instructions will give a big FP boost. It seems more focused on the un-core with its new cache structure which is rumored to have four levels and be accessible to its GPU as well. I doubt Intel would risk or want to take the time to significantly overhaul the structure of the core with so many un-core modifications on deck.

Also, Intel alternates between its Oregon and Israeli design teams which seem to have recently been trading off on tweaking un-core and core respectively, and I think their Oregon team is up. Their last CPU was Nehalem and they left the Core2 guts designed by the Israeli team intact, focusing on overhauling the un-core by adding the on-die memory subsystem. The Israeli team then did a major overhaul of the the core w/ SandyBridge, and my guess based on their history is that the Oregon team will largely design around the pipeline flow structure present in SB.

Rumor has it that Haswell will have ~10% better performance at the same clocks over IvyBridge; since AMD claims 10-15% over each of its next few iterations, and I expect Steamroller to be closer to 15%, AMD might make up a small bit of lost ground in the next round. Excavator will give AMD some extra die-space (even at the same process) to play with from the size benefits of automation on their FPU that they claimed, so they might be able to add some goodies like higher associativity caches the round after that.
Raqia is offline   Reply With Quote
Old 30-Aug-2012, 04:40   #1258
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,070
Default

Haswell should bring an L1 that can support AVX without the bandwith constraint experienced with SB and IB. That's on top of promoting integer SIMD to 256 bits, adding gather, and adding FMA.
Steamroller promises to make its FPU a little smaller.
For concurrent workloads, the lock elision and transactional memory support provide an opportunity for Haswell to scale its performance in heavily multithreaded integer.
AMD thus far has promised that it is not going to improve its cache or memory architecture much.

The increase in L1 Icache size is a change whose magnitude is not yet given. I'm not sure how many workloads were complaining about the 64KB on Bulldozer, however.
Double the decoders sounds like an increase in the width of the instruction prefetch or the number of them is in order.
The problem with expanding the L1 size is that the aliasing problem would worsen.

Idle thought:
They could increase the associativity and cache block size to chip away at the index bits.
Perhaps split the L1 internally with a pseudo-associative cache? That would be two 32-64KB halves/banks of 4-ways and 128 byte blocks. The associativity and block length would take down 2 bits of aliasing, then a rule that syonyms on the last bit are split between the halves.
It sounds complicated, but I'm not certain a 128KB 64-byte line 2-way cache is going to do much for this architecture.
The decoupled prediction and fetch logic would help buffer the hiccups inherent to having larger lines and the variability of having a pseudo-associative cache.

The plus side to having two decoders, if they are fed, is that Steamroller might be able to brute force through one known soft spot for Sandy Bridge, where its fetch bandwidth becomes constraining if the uop cache misses. The core that's consuming this instruction bandwidth really isn't that big a bruiser, though.

The loop buffers and expanded predictor for Steamroller would take on significance because it sounds like no matter what the front end of the pipeline is going to be longer, and anything that keeps the rest of the core from feeling it is an improvement. Sandy Bridge probably has a decently long pipeline, part of which is mitigated by the uop cache. The uop cache was hard to engineer, so that probably means AMD can't manage it yet. Loop buffers in the tradition of Nehalem would be consistent with AMD's multiyear lag behind the leader.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 30-Aug-2012, 07:22   #1259
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
Default

Quote:
Originally Posted by 3dilettante View Post
The plus side to having two decoders, if they are fed, is that Steamroller might be able to brute force through one known soft spot for Sandy Bridge, where its fetch bandwidth becomes constraining if the uop cache misses. The core that's consuming this instruction bandwidth really isn't that big a bruiser, though.
Doubling the decoders doesn't reduce fetch stalls, in fact the opposite problem occurs since the demand on fetch is now higher. The post-decode buffer, on the other hand, does decrease demand.

My guess is that the fetch still wouldn't be a bottleneck that often, if it can really sustain 32b/cycle (for some reason, Agner Fog's tests show it as much less, maybe they've fixed this?).. this would give an average 16b/cycle/core, and between core switching plus really deep buffers you'd think it could maintain this. So it could eat quite a few larger instructions so long as it eventually balances out with smaller ones. And most of the bigger instructions would be executed on the shared FlexFP, which would probably be execution limited before fetch limited. Of course there's still a fair bit of waste in fetch bandwidth due to branches entering after the start of and exiting before the end of 32b blocks.
Exophase is offline   Reply With Quote
Old 30-Aug-2012, 15:12   #1260
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,816
Send a message via Skype™ to fellix
Default

The 16B instruction fetch is not that much of an issue for Intel. Since Nehalem, there's an additional 4-entry buffer for storing fetched bytes from the i-cache that apparently is sufficient to sustain busy decoders. On the other hand, the 32B fetch in AMD's K10 was obviously an overkill, but that doesn't mean it shouldn't be carried over ahead for an architecture that will finally benefit from it.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 31-Aug-2012, 03:58   #1261
Raqia
Member
 
Join Date: Oct 2003
Posts: 320
Default

Quote:
Originally Posted by 3dilettante View Post
Haswell should bring an L1 that can support AVX without the bandwith constraint experienced with SB and IB. That's on top of promoting integer SIMD to 256 bits, adding gather, and adding FMA.
Steamroller promises to make its FPU a little smaller.
For concurrent workloads, the lock elision and transactional memory support provide an opportunity for Haswell to scale its performance in heavily multithreaded integer.
AMD thus far has promised that it is not going to improve its cache or memory architecture much.
The transactional memory instructions are a big change, but these instructions could probably be added to cores without affecting the portions of the pipeline aimed at single threaded performance improvements. They could affect how Hyperthreading is implemented, and I eagerly await more details.

For Haswell, the new FMAC and the associated framework to keep it fed could be a huge boost. AMD does seem to be rebalancing its CPU Cores toward relatively more integer to floating-point performance, but it's not fabbing a massive GPU onto the same die for no reason. I think they want people to make use of the GPU when really heavy streams of floating-point calculations arise; the existing FPUs are sufficient to address legacy code and any sporadic floating point math that might come up.

Quote:
The increase in L1 Icache size is a change whose magnitude is not yet given. I'm not sure how many workloads were complaining about the 64KB on Bulldozer, however.
Double the decoders sounds like an increase in the width of the instruction prefetch or the number of them is in order.
The problem with expanding the L1 size is that the aliasing problem would worsen.

Idle thought:
They could increase the associativity and cache block size to chip away at the index bits.
Perhaps split the L1 internally with a pseudo-associative cache? That would be two 32-64KB halves/banks of 4-ways and 128 byte blocks. The associativity and block length would take down 2 bits of aliasing, then a rule that syonyms on the last bit are split between the halves.
It sounds complicated, but I'm not certain a 128KB 64-byte line 2-way cache is going to do much for this architecture.
The decoupled prediction and fetch logic would help buffer the hiccups inherent to having larger lines and the variability of having a pseudo-associative cache.

The plus side to having two decoders, if they are fed, is that Steamroller might be able to brute force through one known soft spot for Sandy Bridge, where its fetch bandwidth becomes constraining if the uop cache misses. The core that's consuming this instruction bandwidth really isn't that big a bruiser, though.

The loop buffers and expanded predictor for Steamroller would take on significance because it sounds like no matter what the front end of the pipeline is going to be longer, and anything that keeps the rest of the core from feeling it is an improvement. Sandy Bridge probably has a decently long pipeline, part of which is mitigated by the uop cache. The uop cache was hard to engineer, so that probably means AMD can't manage it yet. Loop buffers in the tradition of Nehalem would be consistent with AMD's multiyear lag behind the leader.
It does sound like AMD is steering away slightly from its cluster based approach mainly to give each core more single threaded performance. Giving a decoded-uops cache to each core could take a significant chunk of silicon per core to implement and sounds like it would naturally lead to a splitting the decoder in two to better service each core's uop-cache individually.

Separating the usual ICache into 2 higher-associativity, but smaller pieces seems like it might entail more complication in the Ifetcher or even a split, which might be going far enough to defeat the point of the cluster base approach. It also probably does less for single threaded performance and power savings than a uops cache since the latter's contents are much "closer" pipeline-wise to the execution units than the Icache. From Agner Fog's tests, having only 2-way associativity in its ICache hurts the BD, especially since it's servicing two threads; it sounds like addressing the poor associativity would be a better first step than splitting it up. Whatever AMD did, reducing L1 misses by 30% is a lot...

Last edited by Raqia; 31-Aug-2012 at 04:08.
Raqia is offline   Reply With Quote
Old 31-Aug-2012, 04:27   #1262
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,070
Default

Quote:
Originally Posted by Exophase View Post
Doubling the decoders doesn't reduce fetch stalls, in fact the opposite problem occurs since the demand on fetch is now higher. The post-decode buffer, on the other hand, does decrease demand.
The proviso about feeding the decoders is basically a "what-if" where AMD doubles down on the decode duplication and gives them more raw bandwidth than what is currently available.
The single-threaded case wouldn't change as much, but the dual threaded case would give an advantage relative to an SB that misses the uop cache.
It sounds brute-force and rather aggressive, so I may have been charitable for entertaining the thought.

However, if the decoders are truly duplicated without degrading them, then the aggregate throughput in the case of doubles and microcode would be significantly better, since BD can't do two doubles at once and microde blocks the front end for the other thread.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 31-Aug-2012, 05:01   #1263
itsmydamnation
Member
 
Join Date: Apr 2007
Location: Australia
Posts: 644
Default

From a time to market and therefore effort perspective, wouldn't reusing the existing decode be quicker then designing a "new" 2 wide decoder? Would a 4 wide decoder make any meaningful difference to "regular consumer code"?
itsmydamnation is offline   Reply With Quote
Old 01-Sep-2012, 03:13   #1264
Raqia
Member
 
Join Date: Oct 2003
Posts: 320
Default

My guess is that the new split decoders might have something to do w/ the uop cache that was mentioned. The caches are meant to improve single threaded performance by caching the results of the decode stage instead of before it and the contents are tied to each core separately, so I'm guessing they probably saw a benefit in its implementation to separating the decoders.

It also is implied that there isn't going to be any AVX2 support until excavator from the total lack of mention. It does seem like they revised BD to support AVX at the last minute and it wasn't an ideal implementation, and hopefully their inplementation of scatter/gather in AVX2 is up to snuff. Wild guess but maybe they'll alias some of the GPU's units for the CPU's FPU needs eventually, not sure if that's realistic or how the context switching would work...
Raqia is offline   Reply With Quote
Old 01-Sep-2012, 08:08   #1265
hkultala
Member
 
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
Default

Quote:
Originally Posted by itsmydamnation View Post
From a time to market and therefore effort perspective, wouldn't reusing the existing decode be quicker then designing a "new" 2 wide decoder? Would a 4 wide decoder make any meaningful difference to "regular consumer code"?
They are not going to use a 2-wide decoder, it would decrease signle-thread performance too much, and would usually not be any better than the single 4-wide decode for 2 threads.

So the decoders will be either 3- or 4-way. 3 might be the best comphromise.
hkultala is offline   Reply With Quote
Old 01-Sep-2012, 08:26   #1266
hkultala
Member
 
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
Default

Quote:
Originally Posted by Exophase View Post
Doubling the decoders doesn't reduce fetch stalls, in fact the opposite problem occurs since the demand on fetch is now higher. The post-decode buffer, on the other hand, does decrease demand.
The demand is the amount of instructions that need to be fetched to execute the program. doubling decoders don't really increase it. The decoders are not asking for more instructions. The fetcher is giving decoders code stream, and they will decode what they get.

Better branch prediction will also decrease demand, as less instructions will be fetcher.

With single decoder, if the buffer between ifetch and decode got full, ifetch had to wait/stall.
And when the buffer had quite many instructions , and there was a branch prediction miss, all these instructions for the thread had to be flushed out.

With Steamroller there is in average less instructions waiting to be decoded, so when a branch prediction miss occurs, there are less instructions flushed. So with more decoders, the total amount of instructions that had to be fetched is actually less.

Last edited by hkultala; 01-Sep-2012 at 08:42.
hkultala is offline   Reply With Quote
Old 01-Sep-2012, 09:51   #1267
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
Default

Quote:
Originally Posted by hkultala View Post
The demand is the amount of instructions that need to be fetched to execute the program. doubling decoders don't really increase it. The decoders are not asking for more instructions. The fetcher is giving decoders code stream, and they will decode what they get.
This is just a matter of semantics.. nothing increases the amount of instructions that "need" to be fetched, but increasing decoder width (or increasing execution width, and so on) can increase the amount of fetch bandwidth it could utilize. In other words, it could move a bottleneck onto the fetch units more often.

But no, I didn't mean that more fetch bandwidth would be needed to get the same amount of work done, if that's what you thought I was saying.

Quote:
Originally Posted by hkultala View Post
Better branch prediction will also decrease demand, as less instructions will be fetcher.
But the cost of a branch mispredict is pretty much uniform across the pipeline, so the relative demand stays the same..

Quote:
Originally Posted by hkultala View Post
With single decoder, if the buffer between ifetch and decode got full, ifetch had to wait/stall.
Well yeah. And when the execution units can't find enough to execute from its OoO window the decode units stall, and so on. There's no question that more decoder width is better if the bandwidth wasn't good enough (that's kind of a trivial statement), the question is how often was there not enough bandwidth. But I don't think anyone is really questioning that in the dual-threaded scenario the decoders were a bottleneck some of the time.

Quote:
Originally Posted by hkultala View Post
And when the buffer had quite many instructions , and there was a branch prediction miss, all these instructions for the thread had to be flushed out.

With Steamroller there is in average less instructions waiting to be decoded, so when a branch prediction miss occurs, there are less instructions flushed. So with more decoders, the total amount of instructions that had to be fetched is actually less.
It doesn't matter if a fetched instruction was waiting to be decoded or if it was further along in the pipeline.. a branch misprediction must flush all instructions that were ever fetched after that branch, regardless of whether or not they've been decoded. So the branch misprediction penalty is not reduced by having more decode/execution/whatever resources. And you still have to fetch the same amount to get back to an equivalent amount of work done.

If you want to look at it as energy wasted instead of time it probably wastes less if the data never got to leave the fetch buffers.

You could say that wider decoders mean that fetch buffers don't need to be as large, but since there are already separate ones for each thread you'd lose in the single threaded case by making them smaller. Since the single threaded decode bandwidth isn't increasing. You could say the same thing for the post-decode buffer, depending on how robust it is - if it were a real cache it'd be worth relying, but a loop buffer either works or doesn't and pretty easily becomes completely useless if not executing a small enough loop.. so I don't know if AMD will want to rely on it to guarantee a performance baseline.
Exophase is offline   Reply With Quote
Old 03-Sep-2012, 06:34   #1268
liolio
Ohio frog
 
Join Date: Jun 2005
Location: Ohio, USA
Posts: 4,172
Default

There are quiet some sharp minds here so I will dare a few questions.
AMD went for CMT on the premise that it would 80% of the performances of 2 cores for 50% of the costs. I haven't made measurements (and comparing a bulldozer module to previous AMD cores may not be an optimal comparison) but looking at both a Trinity die and then at Llano seems to tell another story.

AMD might improve their modules performances with Streamrollers and then Excavators but I would not bet on a significant diminution in the size of the module (their high density libraries may do that but that also applies to their others processors lines).

So what do you think of CMT?
The premise was +80% of performances for 50% more silicon and it looks like what AMD will pull out is +100% of performances but a 100% increase in silicon (which is a bit moot).
Do you think the approach is failure? So you would expect them to abandon it for their next brand new architecture.

Do you think that if CMT were to be pushed further it could actually get closer to its premise? By pushed further I mean a module consisting of more than 2 cores.
liolio is offline   Reply With Quote
Old 03-Sep-2012, 09:53   #1269
AlexV
Heteroscedasticitate
 
Join Date: Mar 2005
Posts: 2,354
Default

Quote:
Originally Posted by liolio View Post
AMD went for CMT on the premise that it would 80% of the performances of 2 cores for 50% of the costs. I haven't made measurements (and comparing a bulldozer module to previous AMD cores may not be an optimal comparison) but looking at both a Trinity die and then at Llano seems to tell another story.
This comparison would make sense if there existed a full, old-school, BD based dual-core, with no shared elements. Versus K8L they added quite a few things, so it's not necessarily surprising that the module looks fat compared to that.
__________________
Donald Knuth: Science is what we understand well enough to explain to a computer. Art is everything else we do.
AlexV is offline   Reply With Quote
Old 04-Sep-2012, 01:21   #1270
liolio
Ohio frog
 
Join Date: Jun 2005
Location: Ohio, USA
Posts: 4,172
Default

Quote:
Originally Posted by AlexV View Post
This comparison would make sense if there existed a full, old-school, BD based dual-core, with no shared elements. Versus K8L they added quite a few things, so it's not necessarily surprising that the module looks fat compared to that.
Agreed.
I made some quick (and gross) measurements of Star cores and Piledriver modules.
I found out that PD are ~93% of the area two Star cores would cover. That without L2, the L2 isze are a wash. By eyes I would say that there is less a bit less "glues" between a 2 modules part and a 4cores part which may push the advantage further in favor of BD/PD.

It ain't that bad, in power constrained environment at least, Trinity offer anywhere between 110% 120% the performances of llano. There are cases where Llano win but cases where trinity won by a greater margin.

So let say CMT is a good idea, do you think that after Excavator AMD could go further and increase the number of cores within a module (like having 4 cores in a module)?
Or there are constrained in how big they can scale the front end?
liolio is offline   Reply With Quote
Old 04-Sep-2012, 02:34   #1271
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,816
Send a message via Skype™ to fellix
Default

AMD already refers to the new Jaguar quad-core architecture as a "Compute Unit", together with a shared L2. If they push Jaguar a bit over the pure mobile concept, it would fit very well in a sever/WS envelope, with some sort of scalable interconnect cache/memory infrastructure.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 07-Sep-2012, 15:07   #1272
Raqia
Member
 
Join Date: Oct 2003
Posts: 320
Default

An interesting hear-say update about AMD's CPU development:

VR-Zone Article

The rumored gains are nice, but AMD's main problem has always been execution. Keller being back in charge is a very hopeful change though.

Last edited by Raqia; 07-Sep-2012 at 15:20.
Raqia is offline   Reply With Quote

Reply

Tags
amd, blewdozer, oh well, patents

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 05:21.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.