If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#76 | |
|
Member
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
|
Quote:
|
|
|
|
|
|
|
#77 |
|
Member
Join Date: Apr 2007
Location: Australia
Posts: 647
|
The other thing about bulldozer is the L2 isn't that slow if you consider what it handles, its function is closer to SB/IB L3 then there L2 and the latencies are closer as well. The issue is the small L1 and/or lack of an intermediate cache rather then the L2 itself.
|
|
|
|
|
|
#78 | ||
|
Member
Join Date: Nov 2007
Posts: 945
|
Quote:
Quote:
How much this advantage affects your code base is another debate. As a console programmer, I naturally have a completely different view towards this issue as many PC (or server) software programmers. In games there's huge amount of (mainly vectorized) batch processing happening every frame (viewport culling, matrix multiplies, particle processing, etc). All this batch processing can take 50%-80% of your frame time (depending on game type of course). This kind of code is often highly optimized and uses (often manual) data prefetching heavily, and thus doesn't hit cache (or pipeline) stalls that much. This is also something that is visible in BD benchmarks. It fares well in many PC based software, but not so good in games ported from consoles. I have to disagree with this one. I have never seen heavy inner loops with more than 75% of MAD/FMA instructions. Even the simplest pure SOA-style dot product loop has 3xFMA+1xMUL per four dot products (75% FMA). Our inner view culling loop (mainly 5 dot products per viewport) has some vector float compares and splats in addition to those (around 50% FMA). 4x4 matrix multiply is another FMA heavy operation, but that is 16 splat, 4 mul, 12 FMA (37.5% FMA). If you consider the fact that splats go to MMX pipeline on BD, the percentage of FMA entering float vector pipeline is again 75% (common for pure dot product based operations). It's very hard to create a function with more than 75% FMA. |
||
|
|
|
|
|
#79 | |
|
Senior Member
|
Quote:
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
|
#80 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,439
|
Quote:
|
|
|
|
|
|
|
#81 |
|
Senior Member
|
K8/K10 also had dedicated ECC protection or the L1D, where BD omitted this feature, so now on every single bit error the corresponding cache line must be reloaded from the L2.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#82 |
|
Senior Member
Join Date: Oct 2002
Posts: 2,439
|
Unless you're living very close to the sun I don't think you'd notice it's slower due to that happening once every other year...
|
|
|
|
|
|
#83 | |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,579
|
Quote:
For "normal" voltage they had 1 every 10 million cycles, and that was way back on 130nm. So I'd expect the number to be in the seconds, not years. Nonetheless, write-through caches have much lower error rates. |
|
|
|
|
|
|
#84 | |
|
Member
Join Date: May 2002
Location: Herwood, Tampere, Finland
Posts: 264
|
Quote:
But even in those 4-wide case where there are 3 FMA + 1 MUL, 2 FMA units can give throughput of 1 iteration/2 cycles(*), but one adder + one multiplier can only give throuhput of one iteration/4 cycles. So FMA still give twice the troughput. (*) (assuming we can parallelize the code so that the serialization of the fma's do not become bottleneck, for example by running "multiple totally independent work items" in parallel in same simd lane) Last edited by hkultala; 25-Sep-2012 at 09:18. |
|
|
|
|
|
|
#85 |
|
Member
Join Date: Nov 2007
Posts: 945
|
I am intrigued by the fact that AMD chose to implement twice as large and twice as associative L1D caches to the (low end) Bobcat than they did for Bulldozer. Did they have to compromise the cache design in Bulldozer to reach high clocks? Current max BD turbo is at 4.5 GHz (and without heat limits BD can overclock up to 8 GHz with LN2). They seem to have lots of extra clock headroom that they cannot use (even on desktops) because of TDP/heat constraints.
Intel seems to be using it's extra clock headroom to improve IPC (clocks haven't improved lately but IPC has). Same seems to be true for Haswell. This kind of development seems to better fit the idea of using down clocked high end parts in 10W/17W ultra portables. BD/PD at 1.6 GHz (AMD A8-4555M) isn't in any way optimal use of hardware (it has so many needless extra transistors dedicated for reaching higher clocks). Yes. But BD has only two FMA units per module (2 cores), while Bobcat/Jaguar have two adders and two multipliers. So both reach the same throughput. |
|
|
|
|
|
#86 | |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,579
|
Quote:
I'm guessing AMD chose the quite wide associativity in Bobcat's L1D at least partially so they could use VIPT w/o aliasing problems. The L1I could be PIPT or maybe they don't mind aliasing flushes as much there (at least they didn't on BD) |
|
|
|
|
|
|
#87 | ||
|
Senior Member
Join Date: Oct 2002
Posts: 2,439
|
Quote:
I dunno though sacrificing cache size/associativity so you could reach higher clock speeds which you actually can't reach in practice anyway sounds like a colossal mistake to me. The p4 at least could actually hit higher clock speeds even in practice (not that it helped it mind you but small l1d cache was probably the smallest of its problems). Though atom isn't that far ahead of BD there with its 24kB/6-way l1d cache Quote:
|
||
|
|
|
|
|
#88 |
|
Senior Member
Join Date: Feb 2004
Posts: 2,447
|
Forgive me, but what do VIPT and PIPT stand for? Once I get what they stand for, I can just look it up, so y'all don't need to explain the whole thing.
Thanks in advance. |
|
|
|
|
|
#89 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,136
|
Virtually Indexed Physically Tagged
Physically Indexed Physically Tagged
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#90 |
|
Senior Member
Join Date: Feb 2004
Posts: 2,447
|
Thank you!
|
|
|
|
|
|
#91 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,439
|
Quote:
But probably a better comparison of quad-core Kabini would be against ULV 2-module Kaveri. |
|
|
|
|
|
|
#92 |
|
Senior Member
|
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
#93 |
|
Red-headed step child
Join Date: Jun 2004
Location: Guess ;)
Posts: 3,084
|
Damn! That's pretty swanky...
I'm looking to buy a long-lifed Win8 Pro dockable tablet for my wife this year; I'd love to get one of those Jaguar cores over the Atom options that would otherwise fit the bill.
__________________
"...twisting my words" |
|
|
|
|
|
#94 |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,136
|
The L2 interface takes up about as much space as an entire core.
There's a whitepaper out there talking about the power optimization tech used for Jaguar, and the addition of that interface added a large amount of active flops to the design relative to other components. I wonder what other scalability measures it has besides allowing for a shared 4-bank L2.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
#95 |
|
Red-headed step child
Join Date: Jun 2004
Location: Guess ;)
Posts: 3,084
|
I think I found the whitepaper you are referring to over on Calypto. I'll give it a read later tonight...
__________________
"...twisting my words" |
|
|
|
|
|
#96 | |
|
Senior Member
|
Quote:
Well, the bus interface unit in a BD module is not small either.
__________________
Apple: China -- Brutal leadership done right.
Google: United States -- Somewhat democratic. Microsoft: Russia -- Big and bloated. Linux: EU -- Diverse and broke. |
|
|
|
|
|
|
#97 |
|
Senior Member
Join Date: Feb 2004
Posts: 2,447
|
When is Jaguar due out? I don't recall any news about that...
|
|
|
|
|
|
#98 |
|
Heteroscedasticitate
Join Date: Mar 2005
Posts: 2,354
|
As far as I know, and subject to change, we'll probably see Kabini (desktop Jag) around May-ish. Lisa Su had a tentative roadmap in her CES2013 talk. Take with adequate grain of salt though, AMD's roadmaps are fluid.
__________________
Donald Knuth: Science is what we understand well enough to explain to a computer. Art is everything else we do. |
|
|
|
|
|
#99 |
|
Senior Member
Join Date: Feb 2004
Posts: 2,447
|
|
|
|
|
|
|
#100 |
|
Senior Member
|
There are rumors that the next MS Surface Pro will be Kabini-based, in addition to a higher-end Haswell model if I recall correctly. For whatever that's worth.
__________________
"Well, you mentioned Disneyland, I thought of this porn site, and then bam! A blue Hulk." —The Creature My (currently dormant) blog: Teχlog |
|
|
|
![]() |
| Tags |
| amd, bobcat, jaguar, x86, zacate, zatoichi |
| Thread Tools | |
| Display Modes | |
|
|