Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 23-Sep-2012, 22:28   #76
hkultala
Member
 
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
Default

Quote:
Originally Posted by sebbbi View Post
Basically my Jaguar vs Trinity/PD investigation began, because I wanted to get deeper insight how a 17W (2 module, 4 core) low clocked Trinity would compare to the new Jaguar core based APU (both have four cores, similar clock rate, similar TDP and can sustain 2 uops/cycle/core). After the trade show event (in January), there has been zero news about the 17W Trinity, and it has been eight months since. I am just wondering if AMD are going to replace it with a Jaguar based APU. A 1.815 GHz (=1.65*1.1) Jaguar based APU should be very close in performance compared to the 17W Trinity running at the rumored 1.5-1.6 GHz clocks. That's why IPC comparisons make sense. I am just trying to figure out how they would compare in a TDP constrained setting.
Then you should start doing ipc comparisons than make sense instead of senseless "peak execution throughput" comparisons.
hkultala is offline   Reply With Quote
Old 23-Sep-2012, 23:23   #77
itsmydamnation
Member
 
Join Date: Apr 2007
Location: Australia
Posts: 645
Default

The other thing about bulldozer is the L2 isn't that slow if you consider what it handles, its function is closer to SB/IB L3 then there L2 and the latencies are closer as well. The issue is the small L1 and/or lack of an intermediate cache rather then the L2 itself.
itsmydamnation is offline   Reply With Quote
Old 24-Sep-2012, 10:03   #78
sebbbi
Member
 
Join Date: Nov 2007
Posts: 941
Default

Quote:
Originally Posted by itsmydamnation View Post
The other thing about bulldozer is the L2 isn't that slow if you consider what it handles, its function is closer to SB/IB L3 then there L2 and the latencies are closer as well. The issue is the small L1 and/or lack of an intermediate cache rather then the L2 itself.
Exactly. AMDs L2 caches (in BD/PD/Jaguar/Bobcat/etc) are larger and slower than Intel's L2 caches. They are somewhere in between Intel's L2 and L3 caches (in both size and latency). Because there's nothing in between L1D<->L2, the size and associativity of L1D is very important for AMD architectures. This is unfortunately an area where BD/PD are lacking (tiny 16 kB 4-way L1) compared to other AMD designs (Bobcat/Jaguar/Stars/Phenom have all bigger L1D caches with better associativity).
Quote:
Originally Posted by hkultala View Post
But while all operations of one thread are sitting in their reservation stations waiting for their data, the fpu could be used to execute operations from another thread, so bulldozer's shared fpu and intel's hyperthreading both work very well for this.
Agreed. This is the biggest advantage BD/PD has over all other AMD architectures (including Stars, Bobcat and Jaguar). When running generic object oriented code (with unpredictable access patterns) BD/PD should have a nice advantage (because of all the pipeline and cache stalls that can be filled).

How much this advantage affects your code base is another debate. As a console programmer, I naturally have a completely different view towards this issue as many PC (or server) software programmers. In games there's huge amount of (mainly vectorized) batch processing happening every frame (viewport culling, matrix multiplies, particle processing, etc). All this batch processing can take 50%-80% of your frame time (depending on game type of course). This kind of code is often highly optimized and uses (often manual) data prefetching heavily, and thus doesn't hit cache (or pipeline) stalls that much. This is also something that is visible in BD benchmarks. It fares well in many PC based software, but not so good in games ported from consoles.
Quote:
Originally Posted by hkultala View Post
in most algorithms like >90% of the executed FP operations really are fma operations, so fma can really give a huge boost.
I have to disagree with this one. I have never seen heavy inner loops with more than 75% of MAD/FMA instructions. Even the simplest pure SOA-style dot product loop has 3xFMA+1xMUL per four dot products (75% FMA). Our inner view culling loop (mainly 5 dot products per viewport) has some vector float compares and splats in addition to those (around 50% FMA). 4x4 matrix multiply is another FMA heavy operation, but that is 16 splat, 4 mul, 12 FMA (37.5% FMA). If you consider the fact that splats go to MMX pipeline on BD, the percentage of FMA entering float vector pipeline is again 75% (common for pure dot product based operations). It's very hard to create a function with more than 75% FMA.
sebbbi is offline   Reply With Quote
Old 24-Sep-2012, 14:26   #79
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,819
Send a message via Skype™ to fellix
Default

Quote:
Originally Posted by sebbbi
Because there's nothing in between L1D<->L2, the size and associativity of L1D is very important for AMD architectures. This is unfortunately an area where BD/PD are lacking (tiny 16 kB 4-way L1) compared to other AMD designs (Bobcat/Jaguar/Stars/Phenom have all bigger L1D caches with better associativity).
Well, for BD actually there's a write coalescing cache in between L1D and L2, but it's too small and strictly profiled to cover all the issues.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 24-Sep-2012, 15:01   #80
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,436
Default

Quote:
Originally Posted by sebbbi View Post
Because there's nothing in between L1D<->L2, the size and associativity of L1D is very important for AMD architectures. This is unfortunately an area where BD/PD are lacking (tiny 16 kB 4-way L1) compared to other AMD designs (Bobcat/Jaguar/Stars/Phenom have all bigger L1D caches with better associativity).
Stars/Phenom l1d do not have better associativity - quite the contrary it's only 2-way (which does seem low indeed) but of course they are much bigger (64kB). And they were exclusive and write-back, of course.
mczak is offline   Reply With Quote
Old 24-Sep-2012, 21:19   #81
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,819
Send a message via Skype™ to fellix
Default

Quote:
Originally Posted by mczak View Post
Stars/Phenom l1d do not have better associativity - quite the contrary it's only 2-way (which does seem low indeed) but of course they are much bigger (64kB). And they were exclusive and write-back, of course.
K8/K10 also had dedicated ECC protection or the L1D, where BD omitted this feature, so now on every single bit error the corresponding cache line must be reloaded from the L2.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 24-Sep-2012, 21:40   #82
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,436
Default

Quote:
Originally Posted by fellix View Post
K8/K10 also had dedicated ECC protection or the L1D, where BD omitted this feature, so now on every single bit error the corresponding cache line must be reloaded from the L2.
Unless you're living very close to the sun I don't think you'd notice it's slower due to that happening once every other year...
mczak is offline   Reply With Quote
Old 24-Sep-2012, 21:58   #83
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,556
Default

Quote:
Originally Posted by mczak View Post
Unless you're living very close to the sun I don't think you'd notice it's slower due to that happening once every other year...
I don't think it's so low. The susceptibility goes up as voltage and feature size goes down. Check out the soft-error rate models used in this publication: http://www.cse.psu.edu/~mdl/paper/lin-islped04.pdf

For "normal" voltage they had 1 every 10 million cycles, and that was way back on 130nm. So I'd expect the number to be in the seconds, not years.

Nonetheless, write-through caches have much lower error rates.
Exophase is offline   Reply With Quote
Old 25-Sep-2012, 08:47   #84
hkultala
Member
 
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
Default

Quote:
Originally Posted by sebbbi View Post
I have to disagree with this one. I have never seen heavy inner loops with more than 75% of MAD/FMA instructions. Even the simplest pure SOA-style dot product loop has 3xFMA+1xMUL per four dot products (75% FMA). Our inner view culling loop (mainly 5 dot products per viewport) has some vector float compares and splats in addition to those (around 50% FMA). 4x4 matrix multiply is another FMA heavy operation, but that is 16 splat, 4 mul, 12 FMA (37.5% FMA). If you consider the fact that splats go to MMX pipeline on BD, the percentage of FMA entering float vector pipeline is again 75% (common for pure dot product based operations). It's very hard to create a function with more than 75% FMA.
Ok, you have small vectors on 3d grahics. I was thinking more on the direction of signal processing and scientific workloads.

But even in those 4-wide case where there are 3 FMA + 1 MUL, 2 FMA units can give throughput of 1 iteration/2 cycles(*), but one adder + one multiplier can only give throuhput of one iteration/4 cycles.
So FMA still give twice the troughput.

(*) (assuming we can parallelize the code so that the serialization of the fma's do not become bottleneck, for example by running "multiple totally independent work items" in parallel in same simd lane)

Last edited by hkultala; 25-Sep-2012 at 09:18.
hkultala is offline   Reply With Quote
Old 25-Sep-2012, 14:59   #85
sebbbi
Member
 
Join Date: Nov 2007
Posts: 941
Default

I am intrigued by the fact that AMD chose to implement twice as large and twice as associative L1D caches to the (low end) Bobcat than they did for Bulldozer. Did they have to compromise the cache design in Bulldozer to reach high clocks? Current max BD turbo is at 4.5 GHz (and without heat limits BD can overclock up to 8 GHz with LN2). They seem to have lots of extra clock headroom that they cannot use (even on desktops) because of TDP/heat constraints.

Intel seems to be using it's extra clock headroom to improve IPC (clocks haven't improved lately but IPC has). Same seems to be true for Haswell. This kind of development seems to better fit the idea of using down clocked high end parts in 10W/17W ultra portables. BD/PD at 1.6 GHz (AMD A8-4555M) isn't in any way optimal use of hardware (it has so many needless extra transistors dedicated for reaching higher clocks).
Quote:
Originally Posted by hkultala View Post
But even in those 4-wide case where there are 3 FMA + 1 MUL, 2 FMA units can give throughput of 1 iteration/2 cycles(*), but one adder + one multiplier can only give throuhput of one iteration/4 cycles. So FMA still give twice the troughput.
Yes. But BD has only two FMA units per module (2 cores), while Bobcat/Jaguar have two adders and two multipliers. So both reach the same throughput.
sebbbi is offline   Reply With Quote
Old 25-Sep-2012, 20:21   #86
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,556
Default

Quote:
Originally Posted by sebbbi View Post
I am intrigued by the fact that AMD chose to implement twice as large and twice as associative L1D caches to the (low end) Bobcat than they did for Bulldozer. Did they have to compromise the cache design in Bulldozer to reach high clocks? Current max BD turbo is at 4.5 GHz (and without heat limits BD can overclock up to 8 GHz with LN2). They seem to have lots of extra clock headroom that they cannot use (even on desktops) because of TDP/heat constraints.
That has to be the case, BD's L1D cache is really small by all comparable standards. Even most ARM chips have been using 32KB L1 for a while now, although Cortex-A15 is going to 2-way associativity, regrettably. Still, you can see that Bobcat's decisions are not out of place, rather BD that looks odd.

I'm guessing AMD chose the quite wide associativity in Bobcat's L1D at least partially so they could use VIPT w/o aliasing problems. The L1I could be PIPT or maybe they don't mind aliasing flushes as much there (at least they didn't on BD)
Exophase is offline   Reply With Quote
Old 26-Sep-2012, 00:32   #87
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,436
Default

Quote:
Originally Posted by Exophase View Post
That has to be the case, BD's L1D cache is really small by all comparable standards. Even most ARM chips have been using 32KB L1 for a while now, although Cortex-A15 is going to 2-way associativity, regrettably. Still, you can see that Bobcat's decisions are not out of place, rather BD that looks odd.
Yeah the only other "modern-ish" x86 cpu featuring such small l1d cache is - p4. Actually up to Northwood only 8kB (4-way) but Prescott bumped it to 16kB (8-way).
I dunno though sacrificing cache size/associativity so you could reach higher clock speeds which you actually can't reach in practice anyway sounds like a colossal mistake to me. The p4 at least could actually hit higher clock speeds even in practice (not that it helped it mind you but small l1d cache was probably the smallest of its problems).
Though atom isn't that far ahead of BD there with its 24kB/6-way l1d cache .

Quote:
I'm guessing AMD chose the quite wide associativity in Bobcat's L1D at least partially so they could use VIPT w/o aliasing problems. The L1I could be PIPT or maybe they don't mind aliasing flushes as much there (at least they didn't on BD)
That Bobcat paper mentions L1 ITLB and cache are accessed in parallel which would imply virtual indexing. It also mentions though the itlb isn't actually accessed if it's the same page as previous fetch hence it shouldn't really matter for performance.
mczak is offline   Reply With Quote
Old 26-Sep-2012, 01:47   #88
I.S.T.
Senior Member
 
Join Date: Feb 2004
Posts: 2,442
Default

Forgive me, but what do VIPT and PIPT stand for? Once I get what they stand for, I can just look it up, so y'all don't need to explain the whole thing.

Thanks in advance.
I.S.T. is offline   Reply With Quote
Old 26-Sep-2012, 03:41   #89
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,102
Default

Virtually Indexed Physically Tagged
Physically Indexed Physically Tagged
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 26-Sep-2012, 22:47   #90
I.S.T.
Senior Member
 
Join Date: Feb 2004
Posts: 2,442
Default

Thank you!
I.S.T. is offline   Reply With Quote
Old 28-Sep-2012, 23:06   #91
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,436
Default

Quote:
Originally Posted by sebbbi View Post
Basically my Jaguar vs Trinity/PD investigation began, because I wanted to get deeper insight how a 17W (2 module, 4 core) low clocked Trinity would compare to the new Jaguar core based APU (both have four cores, similar clock rate, similar TDP and can sustain 2 uops/cycle/core). After the trade show event (in January), there has been zero news about the 17W Trinity, and it has been eight months since. I am just wondering if AMD are going to replace it with a Jaguar based APU. A 1.815 GHz (=1.65*1.1) Jaguar based APU should be very close in performance compared to the 17W Trinity running at the rumored 1.5-1.6 GHz clocks. That's why IPC comparisons make sense. I am just trying to figure out how they would compare in a TDP constrained setting.
FWIW looks like the 2-module ULV Trinity part (A8-4555M) has been released. Unlike the 1-module version (A6-4455M, released ages ago) it didn't quite make it to 17W though instead it's now a 19W part. Clocks 1.6Ghz/2.4Ghz - so turbo clock should be higher than Jaguar but I don't know how often it's actually able to clock up that much. In any case the clocks are quite a bit lower compared to the 25W part (A10-4655M - 2Ghz/2.8Ghz) - though a 1.8Ghz quad-core Kabini might also need 25W. I couldn't find information about the a8-4555m gpu other than it's called 7600G, could be either a 4 simd part or a 6 simd part with low clocks). In any case a 4 CU GCN part should look quite favorable to that though trinity ulv should still be faster there because of dual channel memory.
But probably a better comparison of quad-core Kabini would be against ULV 2-module Kaveri.
mczak is offline   Reply With Quote
Old 19-Feb-2013, 18:34   #92
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,819
Send a message via Skype™ to fellix
Default

AMD "Jaguar" details from ISSCC'13
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 19-Feb-2013, 19:07   #93
Albuquerque
Red-headed step child
 
Join Date: Jun 2004
Location: Guess ;)
Posts: 3,084
Default

Damn! That's pretty swanky...

I'm looking to buy a long-lifed Win8 Pro dockable tablet for my wife this year; I'd love to get one of those Jaguar cores over the Atom options that would otherwise fit the bill.
__________________
"...twisting my words"
Quote:
Originally Posted by _xxx_ 1/25 View Post
Get some supplies <...> Within the next couple of months, you'll need it.
Quote:
Originally Posted by _xxx_ 6/9 View Post
And riots are about to begin too.
Quote:
Originally Posted by _xxx_8/5 View Post
food shortages and huge price jumps I predicted recently are becoming very real now.
Quote:
Originally Posted by _xxx_ View Post
If it turns out I was wrong, I'll admit being stupid
Albuquerque is offline   Reply With Quote
Old 19-Feb-2013, 19:24   #94
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,102
Default

The L2 interface takes up about as much space as an entire core.
There's a whitepaper out there talking about the power optimization tech used for Jaguar, and the addition of that interface added a large amount of active flops to the design relative to other components.

I wonder what other scalability measures it has besides allowing for a shared 4-bank L2.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 19-Feb-2013, 20:18   #95
Albuquerque
Red-headed step child
 
Join Date: Jun 2004
Location: Guess ;)
Posts: 3,084
Default

I think I found the whitepaper you are referring to over on Calypto. I'll give it a read later tonight...
__________________
"...twisting my words"
Quote:
Originally Posted by _xxx_ 1/25 View Post
Get some supplies <...> Within the next couple of months, you'll need it.
Quote:
Originally Posted by _xxx_ 6/9 View Post
And riots are about to begin too.
Quote:
Originally Posted by _xxx_8/5 View Post
food shortages and huge price jumps I predicted recently are becoming very real now.
Quote:
Originally Posted by _xxx_ View Post
If it turns out I was wrong, I'll admit being stupid
Albuquerque is offline   Reply With Quote
Old 19-Feb-2013, 20:27   #96
fellix
Senior Member
 
Join Date: Dec 2004
Location: Varna, Bulgaria
Posts: 2,819
Send a message via Skype™ to fellix
Default

Quote:
Originally Posted by 3dilettante View Post
The L2 interface takes up about as much space as an entire core.
It's more like a fifth "dedicated" core among the rest, with a lot of active logic and power control functions beside facilitating the interface arbitration.

Well, the bus interface unit in a BD module is not small either.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic.
Microsoft: Russia -- Big and bloated.
Linux: EU -- Diverse and broke.
fellix is offline   Reply With Quote
Old 20-Feb-2013, 03:39   #97
I.S.T.
Senior Member
 
Join Date: Feb 2004
Posts: 2,442
Default

When is Jaguar due out? I don't recall any news about that...
I.S.T. is offline   Reply With Quote
Old 20-Feb-2013, 04:21   #98
AlexV
Heteroscedasticitate
 
Join Date: Mar 2005
Posts: 2,354
Default

As far as I know, and subject to change, we'll probably see Kabini (desktop Jag) around May-ish. Lisa Su had a tentative roadmap in her CES2013 talk. Take with adequate grain of salt though, AMD's roadmaps are fluid.
__________________
Donald Knuth: Science is what we understand well enough to explain to a computer. Art is everything else we do.
AlexV is online now   Reply With Quote
Old 20-Feb-2013, 13:00   #99
I.S.T.
Senior Member
 
Join Date: Feb 2004
Posts: 2,442
Default

Quote:
Originally Posted by AlexV View Post
Take with adequate grain of salt though, AMD's roadmaps are fluid.
To say the least.

Thanks, AlexV.
I.S.T. is offline   Reply With Quote
Old 20-Feb-2013, 21:51   #100
Alexko
Senior Member
 
Join Date: Aug 2009
Posts: 2,023
Send a message via MSN to Alexko
Default

Quote:
Originally Posted by Albuquerque View Post
Damn! That's pretty swanky...

I'm looking to buy a long-lifed Win8 Pro dockable tablet for my wife this year; I'd love to get one of those Jaguar cores over the Atom options that would otherwise fit the bill.
There are rumors that the next MS Surface Pro will be Kabini-based, in addition to a higher-end Haswell model if I recall correctly. For whatever that's worth.
__________________
"Well, you mentioned Disneyland, I thought of this porn site, and then bam! A blue Hulk." —The Creature
My (currently dormant) blog: Teχlog
Alexko is offline   Reply With Quote

Reply

Tags
amd, bobcat, jaguar, x86, zacate, zatoichi

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 11:50.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.