Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 02-Sep-2012, 04:00   #26
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,434
Default

Quote:
Originally Posted by itsmydamnation View Post
Bobcat does quite well in Integer performance despite both K8 and K10 having 3 Int ALU's that can do any instruction they like while bobcat has 2 INT ALU's and only one has Mul.
K8 and K10 couldn't quite do all instructions in all 3 pipes (but most of them - something bobcat mostly retained, the int pipes are still quite symmetric, just one less pipe). Most notably they certainly could only handle mul in one pipe too (in fact there's not a single x86 cpu out there which has more than one multiplier in the int domain).

As for simd I highly doubt Jaguar will achieve the performance of K10 clock per clock (even if the distribution of ops to pipes was very dumb). That is, if you run the same binary code at least - those new instructions it supports could definitely help quite a bit in some cases.
mczak is offline   Reply With Quote
Old 02-Sep-2012, 15:42   #27
AlexV
Heteroscedasticitate
 
Join Date: Mar 2005
Posts: 2,354
Default

Quote:
Originally Posted by 3dilettante View Post
I'm curious how more familiar with what goes into making processors think about how web sites are latching onto AMD's use of automated layout tools for its CPU cores. What are the new features of this, other than the fact that AMD of all companies is using them so extensively?

That "amoeba-like" arrangement of logic is something some may still recall from Intel's Prescott.
The tools have doubtless advanced quite a bit since then.
This is anecdotal, but just the other day somebody was telling me about a competition between hand laid and automated tools ran at a rather big shop in the business, with automated showing favourable results by a significant margin. Now, these guys weren't Intel, albeit quite huge in their own right, and also the measurements were for some blocks, it's an open question how things end up when you have to do global as opposed to local optimization.

The Intel mention is relevant because in my opinion when people are whining about the use of auto tools they miss the fact that auto tools are likely to be worse than hand layout done by very capable, large, well funded teams...which is not necessarily the case for anybody but Intel these days. So yeah, Intel's teams will probably do better overall with hand-layout...but that does not automatically mean that handwork is good for everyone, IMHO. AMD using automated seems very reasonable given their context / state.
__________________
Donald Knuth: Science is what we understand well enough to explain to a computer. Art is everything else we do.
AlexV is offline   Reply With Quote
Old 02-Sep-2012, 18:22   #28
Raqia
Member
 
Join Date: Oct 2003
Posts: 320
Default

Quote:
Originally Posted by AlexV View Post
This is anecdotal, but just the other day somebody was telling me about a competition between hand laid and automated tools ran at a rather big shop in the business, with automated showing favourable results by a significant margin. Now, these guys weren't Intel, albeit quite huge in their own right, and also the measurements were for some blocks, it's an open question how things end up when you have to do global as opposed to local optimization.

The Intel mention is relevant because in my opinion when people are whining about the use of auto tools they miss the fact that auto tools are likely to be worse than hand layout done by very capable, large, well funded teams...which is not necessarily the case for anybody but Intel these days. So yeah, Intel's teams will probably do better overall with hand-layout...but that does not automatically mean that handwork is good for everyone, IMHO. AMD using automated seems very reasonable given their context / state.
Hand laid designs do incorporate a certain level of regularity that in some instances only facilitate human level organization and understanding and aren't necessary to achieving the smallest size or best performance, and I could easily see it being detrimental in some cases. I'm sure that mindlessly following some rules and brute-force permuting within the space of all permissible layouts will certainly net you something alien looking but better than what a human team could design. (Ofcourse, that's never going to finish before the heat death of the Universe on any realistic transistor count...)

A much simplified but similar kind of problem is http://en.wikipedia.org/wiki/Circle_packing_in_a_square. The best of the known packings aren't necessarily what people would come up with in a reasonable amount of time, even after a lot of head scratching:

http://hydra.nat.uni-magdeburg.de/packing/csq/d5.html
http://hydra.nat.uni-magdeburg.de/packing/csq/d64.html

Circuit designers also operate on the level of logical blocks rather than spatial packings, so you could imagine that there's a lot of efficiency to be gained from automation when there are a lot of transistors are involved. Automation has some drawbacks now, but it's entirely possible that one day, extra computing resources and refined heuristics will make automated layout better in every way than hand drawn designs.

Last edited by Raqia; 03-Sep-2012 at 00:13.
Raqia is offline   Reply With Quote
Old 03-Sep-2012, 11:08   #29
sebbbi
Member
 
Join Date: Nov 2007
Posts: 938
Default

Quote:
Originally Posted by itsmydamnation View Post
Bobcat does quite well in Integer performance despite both K8 and K10 having 3 Int ALU's that can do any instruction they like while bobcat has 2 INT ALU's
In their Bulldozer slides AMD revealed that 3 integer ALUs in their previous architectures was not a good way to spend transistors. They couldn't extract enough ILP from majority of the code to keep all the 3 integer pipelines filled. The performance gain of the third integer ALU was marginal, so it was removed from Bulldozer. Bobcat is just utilizing the same principles as Bulldozer here (remove underutilized hardware).

On the other hand, Intel has hyperthreading, so they can better fill the execution pipelines of their CPU even in code that doesn't have sufficient amount of ILP.
sebbbi is offline   Reply With Quote
Old 03-Sep-2012, 11:43   #30
itsmydamnation
Member
 
Join Date: Apr 2007
Location: Australia
Posts: 645
Default

Are you sure that was ALU? Im pretty sure that was in relation to the 3rd AGU, but they kept it there just to keep everything symmetrical.
itsmydamnation is offline   Reply With Quote
Old 03-Sep-2012, 12:16   #31
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,544
Default

Quote:
Originally Posted by itsmydamnation View Post
Are you sure that was ALU? Im pretty sure that was in relation to the 3rd AGU, but they kept it there just to keep everything symmetrical.
I think so too. The three symmetric units could execute a macro-op each, but if you actually threw three instructions at it with memory operands it would stall on D$ accesses.

Cheers
__________________
I'm pink, therefore I'm spam
Gubbi is offline   Reply With Quote
Old 03-Sep-2012, 17:48   #32
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,434
Default

I think there were 2 issues at hand:
a) in K8/K10 design ALU/AGUs are paired hence you need 3 AGUs if you have 3 ALUs even if you can only ever perform 2 loads per clock, so the third AGU is a bit pointless (not quite 100% though since it can perform a LEA which doesn't require a memory load).
b) it is quite difficult to actually find 3 independent instructions you could execute simultaneously. To increase probability of this you need larger ROBs etc., so the overall power efficiency will decrease. And for the cases where you actually could extract 3 independent instructions you'd need a fatter decoder for BD to be really useful I guess.

Last edited by mczak; 03-Sep-2012 at 17:53.
mczak is offline   Reply With Quote
Old 05-Sep-2012, 21:51   #33
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
Default

The third AGU was present because it spent a tiny amount of area and transistors to simplify the job of the scheduler. While only a few scenarios could make use of a third AGU, keeping the pipelines mostly symmetrical meant picking the right lane for a macro-op was simpler.

An ALU or AGU in isolation would not be a significant bloat to Bulldozer in terms of area or transistor count.
The high-clock philosophy and the need to get all the register accesses and forwarding for an additional ALU and AGU sounds like a big motivator for lopping off the extra pair.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 06-Sep-2012, 15:52   #34
iwod
Member
 
Join Date: Jun 2004
Posts: 168
Default

1C of Jaguar is only 3.1 square mm? A quad core with SRAM and a Very Good Radeon Graphics would be what? Less then 40 square mm?

To me the only good thing Atom was that it speed up the NAS Market transition. It perform poorly on Netbook and Light Usage Desktop. ( Although it sold quite well )

BobCat was great, but it was late. And if Jaguar will allow that kind of improvement, while bringing in Quad Core and Better Graphics, Why would one spend money on its Desktop APU Trinity? To me both are aiming at market that are suited for Light weight work load, Video Viewing and Internet Browsing.

And which one will be more profitable?
iwod is offline   Reply With Quote
Old 06-Sep-2012, 20:32   #35
hkultala
Member
 
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
Default

Quote:
Originally Posted by iwod View Post
And if Jaguar will allow that kind of improvement, while bringing in Quad Core and Better Graphics, Why would one spend money on its Desktop APU Trinity? To me both are aiming at market that are suited for Light weight work load, Video Viewing and Internet Browsing.

And which one will be more profitable?
Trinity has MUCH better single-thread performance than Jaguar. Everything that is not heavily threaded will work much faster on Trinity.
hkultala is offline   Reply With Quote
Old 06-Sep-2012, 22:47   #36
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,434
Default

Quote:
Originally Posted by iwod View Post
1C of Jaguar is only 3.1 square mm? A quad core with SRAM and a Very Good Radeon Graphics would be what? Less then 40 square mm?
Not sure what you mean with "Very Good Radeon Graphics" but your number is way too small.
4 cores may be only 12 mm^2, double it to include L2. You could then probably fit that into 40mm² with the required i/o (64bit ddr3 and some more) but then you'd have no graphics at all.
A scaled-down Cape Verde (let's say 4 CUs) most likely adds another 50mm² on its own.
mczak is offline   Reply With Quote
Old 07-Sep-2012, 00:06   #37
Blazkowicz
Senior Member
 
Join Date: Dec 2004
Location: Toulouse
Posts: 4,136
Default

Quote:
Originally Posted by hkultala View Post
Trinity has MUCH better single-thread performance than Jaguar. Everything that is not heavily threaded will work much faster on Trinity.
yes, if you want a cheap desktop there's the single module trinity (or the celeron, which is quite a problem for AMD given how actually fast it is, it also has the industry's only credible open source linux drivers)

the A6 5400K variant is even unlocked so you can flip a BIOS option and clock it to 4.5GHz or something.
you may sadly benefit from this for your "Internet Browsing", because web pages are pigs (maybe you have to clock to 5GHz for a "turning pages" html5 reader to be smooth)

don't forget the 4GB memory if you do the folly of running firefox and chrome at the same time (or even chrome alone) with 2GB I got so much swap that the USB mouse cursor would freeze for five seconds
Blazkowicz is offline   Reply With Quote
Old 08-Sep-2012, 22:21   #38
swaaye
Entirely Suboptimal
 
Join Date: Mar 2003
Location: WI, USA
Posts: 6,845
Default

Really, I miss ugly '90s web sites that ran fast on a Pentium 90. It has been nice having the modern browser war though, with all of the clamoring for improved performance being beneficial (and free!) for everyone.
swaaye is offline   Reply With Quote
Old 09-Sep-2012, 15:58   #39
Raqia
Member
 
Join Date: Oct 2003
Posts: 320
Default

Quote:
Originally Posted by swaaye View Post
Really, I miss ugly '90s web sites that ran fast on a Pentium 90. It has been nice having the modern browser war though, with all of the clamoring for improved performance being beneficial (and free!) for everyone.
Ah, the days when most sites had "Under-construction" and some gif of a guy working w/ a jack hammer. Plus a "you are visitor #: XXX" counter.
Raqia is offline   Reply With Quote
Old 10-Sep-2012, 04:57   #40
bearmoo
Junior Member
 
Join Date: Sep 2007
Posts: 55
Default

Quote:
Originally Posted by mczak View Post
A scaled-down Cape Verde (let's say 4 CUs) most likely adds another 50mm² on its own.
4 CUs would be really nice. I remember single channel memory configuration like Brazos being mentioned for the Jaguar derived APUs. I wonder if it's enough to feed all the cores. Also, don't forget to add the die size for the integrated south bridge.
bearmoo is offline   Reply With Quote
Old 10-Sep-2012, 12:42   #41
fehu
Member
 
Join Date: Nov 2006
Location: Somewhere over the ocean
Posts: 633
Default

jaguar has on chip south bridge?

edit
ehm... ok, I googled...
fehu is offline   Reply With Quote
Old 11-Sep-2012, 04:39   #42
hkultala
Member
 
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
Default

Quote:
Originally Posted by fehu View Post
jaguar has on chip south bridge?

edit
ehm... ok, I googled...
Jaguar is a core, not a chip.

But AFAIK those jaguar-based chips will have it.
hkultala is offline   Reply With Quote
Old 19-Sep-2012, 14:43   #43
sebbbi
Member
 
Join Date: Nov 2007
Posts: 938
Default

Quote:
Originally Posted by Rangers View Post
I actually looked into Jaguar a little more for the first time. It's really puny, based on it being perhaps a 15% faster Bobcat.
I would say it's somewhere in the same ballpark as Intel's Atom.
Jaguar is 15% faster in integer code compared to Bobcat, but it has also doubled the peak flops compared to Bobcat (double wide SIMD units and doubled SIMD bandwidth per core). Bobcat was already beating ATOM in benchmarks, so Jaguar should be over 2x faster than ATOM in flops heavy utilization. And top of that, it also doubled the core count from Bobcat (there's no 4 core ATOMs currently in the market to challenge the Jaguar either).
Quote:
Originally Posted by hkultala View Post
Trinity has MUCH better single-thread performance than Jaguar. Everything that is not heavily threaded will work much faster on Trinity.
If you compare a Jaguar core to a Piledriver core (Trinity), Jaguar doesn't look that bad really.

Based on information from these sources:
- http://semiaccurate.com/2012/08/28/a...e-jaguar-core/
- http://www.anandtech.com/show/6201/a...architecture/2
- Agner Fog's microarchitecture.pdf

We can gather following comparison results:
- Both are modern x86/x64 out of order cores (with register renaming, efficient store forwarding, etc goodies)
- Both support newest instruction sets (BMI, AVX, FC16, etc).
- Both can execute 2 integer (ALU) operations per cycle.
- Both have throughput of 8 (vector) flops per cycle per core (Jaguar = 128b add + 128b mul, Piledriver = 128b mad, assuming of course that the other core uses half of the shared FPU resources).
- Both split 256 bit AVX instructions to two 128 bit operations.
- Jaguar cores have their own 2-way 32 KB L1i caches. Two Piledriver cores share a 2-way 64 KB L1i cache. Sharing a 2-way cache between 2 cores is bad for performance, so Jaguar seems to win this one.
- Piledriver core has tiny 16 KB L1 data cache, while Jaguar core has larger 32 KB L1 data cache. Jaguar wins again.
- Piledriver has a shared 16 way 2 MB of L2 cache for a pair of cores (4 MB for four cores in total). Jaguar has a shared 16 way 2 MB of L2 cache for four cores. Piledriver is better here.
- Jaguar has shorter pipelines than Piledriver. This should improve branching performance and help Jaguar to keep it's pipelines filled. But Piledriver has more complex branch prediction and lots of other IPC improving features.
- According to Agner Fog's analysis Bulldozer has significantly more bottleneck cases than Bobcat. Both Piledriver and Jaguar improved IPC of their predecessors, and likely shifted the bottlenecks a bit. But it's still unlikely that Jaguar has significantly more bottlenecks than Piledriver.

--> The IPC of Piledriver and Jaguar CPU cores should be pretty close.

Of course Piledriver has much much higher clock ceiling (for desktop use). However 17W Piledriver ULV clocks shouldn't be that much higher than comparable Jaguar clocks. 17W ULV Sandy Bridges were clocked at 1.6-1.8 GHz, and Jaguar cores should be slightly above that (1.1 * 1.65 GHz = 1.815 GHz). I don't personally expect 17W ULV Trinity (two module, four cores) to hit much higher clocks than that (turbo might of course reach 2 GHz+ just like it does on Sandy/Ivy Bridge).

Could you explain the reasons why Trinity/Piledriver has "MUCH better" single-threaded performance than Jaguar? I am not a hardware engineer, so I have likely missed some fine details.

Last edited by sebbbi; 19-Sep-2012 at 15:02.
sebbbi is offline   Reply With Quote
Old 19-Sep-2012, 16:29   #44
Lightman
Member
 
Join Date: Jun 2008
Location: Torquay, UK
Posts: 910
Default

I wonder if AMD can customize L2 cache clock depending on target power/performance. Bobcat had 1/2 speed L2 and I'm under impression Jaguar for tablets and nettops will continue that trend.
At least some reports are suggesting AMD can run L2 at full clock which for console design is very desired.
Lightman is offline   Reply With Quote
Old 19-Sep-2012, 18:46   #45
liolio
Ohio frog
 
Join Date: Jun 2005
Location: Ohio, USA
Posts: 4,172
Default

Quote:
Originally Posted by Lightman View Post
I wonder if AMD can customize L2 cache clock depending on target power/performance. Bobcat had 1/2 speed L2 and I'm under impression Jaguar for tablets and nettops will continue that trend.
At least some reports are suggesting AMD can run L2 at full clock which for console design is very desired.
Well AMD has stated that the L2 works at half speed (not the bus interface though).
They didn't spoke about it but I guess they could implement something akin to the feature that is supposed to make it into Streamrollers, power killing the unused part of L2.
liolio is offline   Reply With Quote
Old 19-Sep-2012, 19:31   #46
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
Default

Quote:
Originally Posted by sebbbi View Post
--> The IPC of Piledriver and Jaguar CPU cores should be pretty close.
You left out that Bobcat can only decode/issue 2 instructions per cycle, while PD can decode 4 and issue 4 to an integer core and/or 4 to the FPU. And Bobcat can only do 1 load + 1 store per cycle, while PD can do 2 loads (I don't think it can actually do 2 stores though). These are not small differences - mainly, being able to support a load/store or two in conjunction with two ALU/branch/multiply/etc is a big deal, especially for x86. Even in FPU heavy code it's nice to be able to issue at least one integer instruction in addition to two FP ops for flow control/pointer arithmetic/etc. And PD's FPU is more flexible even w/o FMA code because it can do either 2 FMULs or 2 FADDs per cycle instead of just one of each.

PD also undoubtedly has much bigger OoO resources, and probably better load/store disambiguation.

As far as L1D Is concerned, Jaguar does have the bigger cache but loses in associativity (2-way vs 4-way) which is a liability on some workloads. And from test numbers I've seen its L2 is not just lower bandwidth but at least as high latency.
Exophase is offline   Reply With Quote
Old 20-Sep-2012, 01:01   #47
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,434
Default

Quote:
Originally Posted by Exophase View Post
You left out that Bobcat can only decode/issue 2 instructions per cycle, while PD can decode 4 and issue 4 to an integer core and/or 4 to the FPU.
That is true but comparing 4 Jaguar cores vs. 2 Piledriver modules the overall decode throughput is indeed the same. Of course if it is running only one thread (per module) then it should be better on Piledriver, OTOH if you run into instructions (when there are two threads per module) which need the microcode decoder Jaguar might be better (as it won't block the other thread).

Quote:
And Bobcat can only do 1 load + 1 store per cycle, while PD can do 2 loads (I don't think it can actually do 2 stores though). These are not small differences - mainly, being able to support a load/store or two in conjunction with two ALU/branch/multiply/etc is a big deal, especially for x86.
PD can do either two loads or one load + one store per cycle (the cache has only two ports), whereas Jaguar is limited to one load + one store. This is not really that much of a difference but yes PD is better. I don't think though it's that much of an issue, essentially that's the same capability for Jaguar as intel had up to Nehalem (whereas SNB/IVB now are more like PD in that regard, either two loads or one load + one store).
Quote:
Even in FPU heavy code it's nice to be able to issue at least one integer instruction in addition to two FP ops for flow control/pointer arithmetic/etc. And PD's FPU is more flexible even w/o FMA code because it can do either 2 FMULs or 2 FADDs per cycle instead of just one of each.
I thought Bobcat (and Jaguar) could issue more than 2 uops per clock as well (up to 6?, each 2 of integer, load/store, simd), as long as there are enough ops in the queues (obviously the decoder couldn't feed that). Maybe it can only retire 2 per clock though, K8 had serious restrictions there as well. I might be totally wrong here .
In any case if you argue with PD's FPU then again it is 2 FPU for BD vs. 4 for Jaguar so it's unclear if that's always a win for PD (interestingly, the benchmarks suggest that having only one FPU does not limit performance that much for multithreaded code even in fp-heavy code - maybe due to the typical quite long latencies of these instructions utilization might not be very high typically for the single-threaded case).
Quote:
PD also undoubtedly has much bigger OoO resources, and probably better load/store disambiguation.
No doubt it has more OoO resources, but even Bobcat is quite respectable (e.g. int PRF size is 64 for Bobcat and 96 for BD (not sure if same for PD?) though yeah Bobcat has very few entries in the int/address/simd schedulers (but Jaguar should increase that). Bobcat's Load/Store unit is also quite robust (some amd paper stated it's more advanced than what any other amd cpu had at that time, so probably better than what K10 had).
Quote:
As far as L1D Is concerned, Jaguar does have the bigger cache but loses in associativity (2-way vs 4-way) which is a liability on some workloads.
In contrast though, Bobcat doesn't suffer in some workloads due to L1 write-through cache like BD does.
I'm not sure what's better on average, 16KB/4-way vs. 32KB/2-way. I'd call it a draw .
Quote:
And from test numbers I've seen its L2 is not just lower bandwidth but at least as high latency.
AMD stated 17 cycles for L2 for Bobcat (but Jaguar could be different) and 18-20 for BD (though the latter may include the L1 latency, not sure about the former - in any case the numbers don't look too different).

In short imho Jaguar doesn't look that much worse than Piledriver as far as IPC is concerned.

Last edited by mczak; 20-Sep-2012 at 01:08.
mczak is offline   Reply With Quote
Old 20-Sep-2012, 03:47   #48
sebbbi
Member
 
Join Date: Nov 2007
Posts: 938
Default

Already discussed in the last page, but I short recap.

Bobcat vs K10 (Athlon II X4 630 downclocked at 1.6Ghz):
http://www.xtremesystems.org/forums/...t-vs-K10-vs-K8

In generic integer calculations Bobcat has 5% slower IPC than similarly clocked K10. The slightly improved Stars core (Llano) has a few percents higher IPC than K10. Bulldozer has slightly worse IPC than Stars and Piledriver has pretty much equal IPC to Stars.

AMD promised a +15% IPC increase from Bobcat->Jaguar. If these predictions hold any water, the IPC of Jaguar might even be higher than Bulldozer's, and be very compatible with Piledriver.

The worst case of K10 against Bobcat was SIMD floating point performance (K10 has on avarage 50.19% better performance per clock). Jaguar doubles the SIMD float (and integer) performance (64 bit -> 128 bit SIMD) and introduces all the new AVX instructions. I don't think we have to worry about SIMD performance anymore. It should now be comparable to BD/PD.
Quote:
Originally Posted by Exophase View Post
You left out that Bobcat can only decode/issue 2 instructions per cycle, while PD can decode 4 and issue 4 to an integer core and/or 4 to the FPU.
BD/PD decoder is shared between two cores. A module can decode 4 uops per cycle, but the decoder is time sliced (every other cycle) between two cores.

Quote from Agner Fog's analysis:
"The decoders can handle four instructions per clock cycle. Instructions that belong to different cores cannot be decoded in the same clock cycle. When both cores are active, the decoders serve each core every second clock cycle, so that the maximum decode rate is two instructions per clock cycle per core."

So Jaguar and BD/PD cores can decode an equal amount of instructions per cycle.

Also BD/PD have an additional stall case, because the decoder is shared between two cores (Agner Fog):
"Instructions that generate more than two macro-ops are using microcode. The decoders cannot do anything else while microcode is generated. This means that both cores in a compute unit can stop decoding for several clock cycles after meeting an instruction that generates more than two macro-ops"
Quote:
Originally Posted by Exophase View Post
And PD's FPU is more flexible even w/o FMA code because it can do either 2 FMULs or 2 FADDs per cycle instead of just one of each.
Again PDs FPU is shared between two cores. If one core does 2 FMUL/FADD/FMA per cycle, the other core can do nothing. If resources are evenly split, both Jaguar and PD have equal flops throughput. However PD needs FMA code to reach it's peak, while Jaguar can do that on old code (with separate muls and adds). So Jaguar should have better peak performance in current/legacy code (FMA3/FMA4 usage in applications/games is still very low, partly because there's two implementations that are not compatible).
Quote:
Originally Posted by Exophase View Post
PD also undoubtedly has much bigger OoO resources, and probably better load/store disambiguation.
Yes, that's likely true however AMD improved OoO execution on Jaguar as well. According to (http://semiaccurate.com/2012/08/29/a...n-amds-jaguar/) the scheduler can handle more entries and the core has larger reorder buffers.
Quote:
Originally Posted by liolio View Post
Well AMD has stated that the L2 works at half speed (not the bus interface though).
Can you quote where you got this information? According to semiaccurate.com "the caches can run at half clock to save power when needed". If I understood that correctly, the caches can be dynamically down clocked to save performance (when CPU load is light).
sebbbi is offline   Reply With Quote
Old 20-Sep-2012, 05:01   #49
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
Default

Quote:
Originally Posted by mczak
That is true but comparing 4 Jaguar cores vs. 2 Piledriver modules the overall decode throughput is indeed the same. Of course if it is running only one thread (per module) then it should be better on Piledriver, OTOH if you run into instructions (when there are two threads per module) which need the microcode decoder Jaguar might be better (as it won't block the other thread).
Even if you really are running four fully loaded threads - which you quite often are not, but still want strong performance on one or two of them, especially if your clocks are low - that's not a fair comparison because PD will be able to achieve better flexibility by alternating between threads as it decodes. This is relevant because it means that when one core can't utilize the extra decode bandwidth (due to the OoO window being backed up) the other core gets better decode bandwidth. Assuming that that's how PD works, anyway.

Quote:
Originally Posted by mczak View Post
I thought Bobcat (and Jaguar) could issue more than 2 uops per clock as well (up to 6?, each 2 of integer, load/store, simd), as long as there are enough ops in the queues (obviously the decoder couldn't feed that). Maybe it can only retire 2 per clock though, K8 had serious restrictions there as well. I might be totally wrong here .
I'm referring to issue the way Intel, AMD, ARM, and as far as I know most of the industry does: how many instructions can enter the reorder buffers per cycle, not how many instructions can move from the reoder buffers to the execution units. The latter is instead called "dispatch" by those parties, and it's a very important distinct since x86 uarchs often issue macro-ops/fused uops but dispatch uops. So the issue rate closer matches real instructions.

Quote:
Originally Posted by sebbbi
Already discussed in the last page, but I short recap.

Bobcat vs K10 (Athlon II X4 630 downclocked at 1.6Ghz):
http://www.xtremesystems.org/forums/...t-vs-K10-vs-K8

In generic integer calculations Bobcat has 5% slower IPC than similarly clocked K10. The slightly improved Stars core (Llano) has a few percents higher IPC than K10. Bulldozer has slightly worse IPC than Stars and Piledriver has pretty much equal IPC to Stars.
The comparison that forum poster makes is poor. The benchmarks are not well chosen and his classifications aren't even correct. A few synthetics like Sandra, 3DMark, and "speedtraq" don't make for a comprehensive comparison in CPU performance.

This belief that Jaguar can surpass Llano's IPC is sorely lacking in any kind of architectural justification.
Exophase is offline   Reply With Quote
Old 20-Sep-2012, 06:24   #50
liolio
Ohio frog
 
Join Date: Jun 2005
Location: Ohio, USA
Posts: 4,172
Default

Quote:
Originally Posted by sebbbi View Post
Can you quote where you got this information? According to semiaccurate.com "the caches can run at half clock to save power when needed". If I understood that correctly, the caches can be dynamically down clocked to save performance (when CPU load is light).
I read it on Hardware.fr
I guess they back their claim on that slide:


I'm not sure my self about what they mean by L2D.
Could it that the data in the L2 data bank are only powered when there is a cache hit?
I don't understand to say the truth
liolio is offline   Reply With Quote

Reply

Tags
amd, bobcat, jaguar, x86, zacate, zatoichi

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 06:26.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.