If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#26 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,434
|
Quote:
As for simd I highly doubt Jaguar will achieve the performance of K10 clock per clock (even if the distribution of ops to pipes was very dumb). That is, if you run the same binary code at least - those new instructions it supports could definitely help quite a bit in some cases. |
|
|
|
|
|
|
#27 | |
|
Heteroscedasticitate
Join Date: Mar 2005
Posts: 2,354
|
Quote:
The Intel mention is relevant because in my opinion when people are whining about the use of auto tools they miss the fact that auto tools are likely to be worse than hand layout done by very capable, large, well funded teams...which is not necessarily the case for anybody but Intel these days. So yeah, Intel's teams will probably do better overall with hand-layout...but that does not automatically mean that handwork is good for everyone, IMHO. AMD using automated seems very reasonable given their context / state.
__________________
Donald Knuth: Science is what we understand well enough to explain to a computer. Art is everything else we do. |
|
|
|
|
|
|
#28 | |
|
Member
Join Date: Oct 2003
Posts: 320
|
Quote:
A much simplified but similar kind of problem is http://en.wikipedia.org/wiki/Circle_packing_in_a_square. The best of the known packings aren't necessarily what people would come up with in a reasonable amount of time, even after a lot of head scratching: http://hydra.nat.uni-magdeburg.de/packing/csq/d5.html http://hydra.nat.uni-magdeburg.de/packing/csq/d64.html Circuit designers also operate on the level of logical blocks rather than spatial packings, so you could imagine that there's a lot of efficiency to be gained from automation when there are a lot of transistors are involved. Automation has some drawbacks now, but it's entirely possible that one day, extra computing resources and refined heuristics will make automated layout better in every way than hand drawn designs. Last edited by Raqia; 03-Sep-2012 at 00:13. |
|
|
|
|
|
|
#29 | |
|
Member
Join Date: Nov 2007
Posts: 938
|
Quote:
On the other hand, Intel has hyperthreading, so they can better fill the execution pipelines of their CPU even in code that doesn't have sufficient amount of ILP. |
|
|
|
|
|
|
#30 |
|
Member
Join Date: Apr 2007
Location: Australia
Posts: 645
|
Are you sure that was ALU? Im pretty sure that was in relation to the 3rd AGU, but they kept it there just to keep everything symmetrical.
|
|
|
|
|
|
#31 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,544
|
Quote:
Cheers
__________________
I'm pink, therefore I'm spam |
|
|
|
|
|
|
#32 |
|
Senior Member
Join Date: Oct 2002
Posts: 2,434
|
I think there were 2 issues at hand:
a) in K8/K10 design ALU/AGUs are paired hence you need 3 AGUs if you have 3 ALUs even if you can only ever perform 2 loads per clock, so the third AGU is a bit pointless (not quite 100% though since it can perform a LEA which doesn't require a memory load). b) it is quite difficult to actually find 3 independent instructions you could execute simultaneously. To increase probability of this you need larger ROBs etc., so the overall power efficiency will decrease. And for the cases where you actually could extract 3 independent instructions you'd need a fatter decoder for BD to be really useful I guess. Last edited by mczak; 03-Sep-2012 at 17:53. |
|
|
|
|
|
#33 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,071
|
The third AGU was present because it spent a tiny amount of area and transistors to simplify the job of the scheduler. While only a few scenarios could make use of a third AGU, keeping the pipelines mostly symmetrical meant picking the right lane for a macro-op was simpler.
An ALU or AGU in isolation would not be a significant bloat to Bulldozer in terms of area or transistor count. The high-clock philosophy and the need to get all the register accesses and forwarding for an additional ALU and AGU sounds like a big motivator for lopping off the extra pair.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#34 |
|
Member
Join Date: Jun 2004
Posts: 168
|
1C of Jaguar is only 3.1 square mm? A quad core with SRAM and a Very Good Radeon Graphics would be what? Less then 40 square mm?
To me the only good thing Atom was that it speed up the NAS Market transition. It perform poorly on Netbook and Light Usage Desktop. ( Although it sold quite well ) BobCat was great, but it was late. And if Jaguar will allow that kind of improvement, while bringing in Quad Core and Better Graphics, Why would one spend money on its Desktop APU Trinity? To me both are aiming at market that are suited for Light weight work load, Video Viewing and Internet Browsing. And which one will be more profitable? |
|
|
|
|
|
#35 | |
|
Member
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
|
Quote:
|
|
|
|
|
|
|
#36 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,434
|
Quote:
4 cores may be only 12 mm^2, double it to include L2. You could then probably fit that into 40mm² with the required i/o (64bit ddr3 and some more) but then you'd have no graphics at all. A scaled-down Cape Verde (let's say 4 CUs) most likely adds another 50mm² on its own. |
|
|
|
|
|
|
#37 | |
|
Senior Member
Join Date: Dec 2004
Location: Toulouse
Posts: 4,136
|
Quote:
the A6 5400K variant is even unlocked so you can flip a BIOS option and clock it to 4.5GHz or something. you may sadly benefit from this for your "Internet Browsing", because web pages are pigs (maybe you have to clock to 5GHz for a "turning pages" html5 reader to be smooth) don't forget the 4GB memory if you do the folly of running firefox and chrome at the same time (or even chrome alone) with 2GB I got so much swap that the USB mouse cursor would freeze for five seconds |
|
|
|
|
|
|
#38 |
|
Entirely Suboptimal
Join Date: Mar 2003
Location: WI, USA
Posts: 6,845
|
Really, I miss ugly '90s web sites that ran fast on a Pentium 90. It has been nice having the modern browser war though, with all of the clamoring for improved performance being beneficial (and free!) for everyone.
|
|
|
|
|
|
#39 |
|
Member
Join Date: Oct 2003
Posts: 320
|
Ah, the days when most sites had "Under-construction" and some gif of a guy working w/ a jack hammer. Plus a "you are visitor #: XXX" counter.
|
|
|
|
|
|
#40 |
|
Junior Member
Join Date: Sep 2007
Posts: 55
|
4 CUs would be really nice. I remember single channel memory configuration like Brazos being mentioned for the Jaguar derived APUs. I wonder if it's enough to feed all the cores. Also, don't forget to add the die size for the integrated south bridge.
|
|
|
|
|
|
#41 |
|
Member
Join Date: Nov 2006
Location: Somewhere over the ocean
Posts: 633
|
jaguar has on chip south bridge?
edit ehm... ok, I googled... |
|
|
|
|
|
#42 |
|
Member
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
|
|
|
|
|
|
|
#43 | ||
|
Member
Join Date: Nov 2007
Posts: 938
|
Quote:
Quote:
Based on information from these sources: - http://semiaccurate.com/2012/08/28/a...e-jaguar-core/ - http://www.anandtech.com/show/6201/a...architecture/2 - Agner Fog's microarchitecture.pdf We can gather following comparison results: - Both are modern x86/x64 out of order cores (with register renaming, efficient store forwarding, etc goodies) - Both support newest instruction sets (BMI, AVX, FC16, etc). - Both can execute 2 integer (ALU) operations per cycle. - Both have throughput of 8 (vector) flops per cycle per core (Jaguar = 128b add + 128b mul, Piledriver = 128b mad, assuming of course that the other core uses half of the shared FPU resources). - Both split 256 bit AVX instructions to two 128 bit operations. - Jaguar cores have their own 2-way 32 KB L1i caches. Two Piledriver cores share a 2-way 64 KB L1i cache. Sharing a 2-way cache between 2 cores is bad for performance, so Jaguar seems to win this one. - Piledriver core has tiny 16 KB L1 data cache, while Jaguar core has larger 32 KB L1 data cache. Jaguar wins again. - Piledriver has a shared 16 way 2 MB of L2 cache for a pair of cores (4 MB for four cores in total). Jaguar has a shared 16 way 2 MB of L2 cache for four cores. Piledriver is better here. - Jaguar has shorter pipelines than Piledriver. This should improve branching performance and help Jaguar to keep it's pipelines filled. But Piledriver has more complex branch prediction and lots of other IPC improving features. - According to Agner Fog's analysis Bulldozer has significantly more bottleneck cases than Bobcat. Both Piledriver and Jaguar improved IPC of their predecessors, and likely shifted the bottlenecks a bit. But it's still unlikely that Jaguar has significantly more bottlenecks than Piledriver. --> The IPC of Piledriver and Jaguar CPU cores should be pretty close. Of course Piledriver has much much higher clock ceiling (for desktop use). However 17W Piledriver ULV clocks shouldn't be that much higher than comparable Jaguar clocks. 17W ULV Sandy Bridges were clocked at 1.6-1.8 GHz, and Jaguar cores should be slightly above that (1.1 * 1.65 GHz = 1.815 GHz). I don't personally expect 17W ULV Trinity (two module, four cores) to hit much higher clocks than that (turbo might of course reach 2 GHz+ just like it does on Sandy/Ivy Bridge). Could you explain the reasons why Trinity/Piledriver has "MUCH better" single-threaded performance than Jaguar? I am not a hardware engineer, so I have likely missed some fine details. Last edited by sebbbi; 19-Sep-2012 at 15:02. |
||
|
|
|
|
|
#44 |
|
Member
Join Date: Jun 2008
Location: Torquay, UK
Posts: 910
|
I wonder if AMD can customize L2 cache clock depending on target power/performance. Bobcat had 1/2 speed L2 and I'm under impression Jaguar for tablets and nettops will continue that trend.
At least some reports are suggesting AMD can run L2 at full clock which for console design is very desired. |
|
|
|
|
|
#45 | |
|
Ohio frog
Join Date: Jun 2005
Location: Ohio, USA
Posts: 4,172
|
Quote:
They didn't spoke about it but I guess they could implement something akin to the feature that is supposed to make it into Streamrollers, power killing the unused part of L2.
__________________
What's trying to be a bunch of presentations PS360 youtube channel Sebbbi about virtual texturing Tuned EADGCF and liking it :) |
|
|
|
|
|
|
#46 | |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
|
Quote:
PD also undoubtedly has much bigger OoO resources, and probably better load/store disambiguation. As far as L1D Is concerned, Jaguar does have the bigger cache but loses in associativity (2-way vs 4-way) which is a liability on some workloads. And from test numbers I've seen its L2 is not just lower bandwidth but at least as high latency. |
|
|
|
|
|
|
#47 | ||||||
|
Senior Member
Join Date: Oct 2002
Posts: 2,434
|
Quote:
Quote:
Quote:
In any case if you argue with PD's FPU then again it is 2 FPU for BD vs. 4 for Jaguar so it's unclear if that's always a win for PD (interestingly, the benchmarks suggest that having only one FPU does not limit performance that much for multithreaded code even in fp-heavy code - maybe due to the typical quite long latencies of these instructions utilization might not be very high typically for the single-threaded case). Quote:
Quote:
I'm not sure what's better on average, 16KB/4-way vs. 32KB/2-way. I'd call it a draw Quote:
In short imho Jaguar doesn't look that much worse than Piledriver as far as IPC is concerned. Last edited by mczak; 20-Sep-2012 at 01:08. |
||||||
|
|
|
|
|
#48 | |||
|
Member
Join Date: Nov 2007
Posts: 938
|
Already discussed in the last page, but I short recap.
Bobcat vs K10 (Athlon II X4 630 downclocked at 1.6Ghz): http://www.xtremesystems.org/forums/...t-vs-K10-vs-K8 In generic integer calculations Bobcat has 5% slower IPC than similarly clocked K10. The slightly improved Stars core (Llano) has a few percents higher IPC than K10. Bulldozer has slightly worse IPC than Stars and Piledriver has pretty much equal IPC to Stars. AMD promised a +15% IPC increase from Bobcat->Jaguar. If these predictions hold any water, the IPC of Jaguar might even be higher than Bulldozer's, and be very compatible with Piledriver. The worst case of K10 against Bobcat was SIMD floating point performance (K10 has on avarage 50.19% better performance per clock). Jaguar doubles the SIMD float (and integer) performance (64 bit -> 128 bit SIMD) and introduces all the new AVX instructions. I don't think we have to worry about SIMD performance anymore. It should now be comparable to BD/PD. Quote:
Quote from Agner Fog's analysis: "The decoders can handle four instructions per clock cycle. Instructions that belong to different cores cannot be decoded in the same clock cycle. When both cores are active, the decoders serve each core every second clock cycle, so that the maximum decode rate is two instructions per clock cycle per core." So Jaguar and BD/PD cores can decode an equal amount of instructions per cycle. Also BD/PD have an additional stall case, because the decoder is shared between two cores (Agner Fog): "Instructions that generate more than two macro-ops are using microcode. The decoders cannot do anything else while microcode is generated. This means that both cores in a compute unit can stop decoding for several clock cycles after meeting an instruction that generates more than two macro-ops" Quote:
Quote:
Can you quote where you got this information? According to semiaccurate.com "the caches can run at half clock to save power when needed". If I understood that correctly, the caches can be dynamically down clocked to save performance (when CPU load is light). |
|||
|
|
|
|
|
#49 | |||
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
|
Quote:
Quote:
Quote:
This belief that Jaguar can surpass Llano's IPC is sorely lacking in any kind of architectural justification. |
|||
|
|
|
|
|
#50 | |
|
Ohio frog
Join Date: Jun 2005
Location: Ohio, USA
Posts: 4,172
|
Quote:
I guess they back their claim on that slide: ![]() I'm not sure my self about what they mean by L2D. Could it that the data in the L2 data bank are only powered when there is a cache hit? I don't understand to say the truth
__________________
What's trying to be a bunch of presentations PS360 youtube channel Sebbbi about virtual texturing Tuned EADGCF and liking it :) |
|
|
|
|
![]() |
| Tags |
| amd, bobcat, jaguar, x86, zacate, zatoichi |
| Thread Tools | |
| Display Modes | |
|
|