Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 11-Sep-2012, 23:22   #51
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,442
Default

Quote:
Originally Posted by 3dilettante View Post
It's kind of interesting to see a core that could for some reason generate 3 store addresses a cycle. Perhaps the extra port is to keep the store address out of the way of the load calculations as much as possible.
Maybe a copy&paste error from SNB/IVB? Since there's now a dedicated store AGU, and only one store data port, it seems like there would be no reason at all to use a shared load/store AGU for calculating the store address. In some way that would be more like Nehalem/Westmere, which also had separate load and store AGUs (but of course just one of each).
But OMG this thing is a beast. AMD thought there's not much point of having a 3rd INT ALU and intel now has 4...
I wonder about the front-end though, no mention of any improvements there. Is that really good enough to feed that monster back-end?
mczak is offline   Reply With Quote
Old 11-Sep-2012, 23:45   #52
Raqia
Member
 
Join Date: Oct 2003
Posts: 320
Default

PDFs for presentations today:

https://intel.activeevents.com/sf12/...ler/catalog.do

click the top link if you're like me and not registered. Do a search for "Haswell." The presentations ARCS001 and SPCS001 are worth a look.
Raqia is offline   Reply With Quote
Old 12-Sep-2012, 00:10   #53
Raqia
Member
 
Join Date: Oct 2003
Posts: 320
Default

ARCS001 on page 12 states that the extra AGU for store alleviates pressure on ports 2 & 3 for loads. I guess that's something they identified from their simulations as a bottleneck, maybe for hyperthreading? They've also directly addressed many bottlenecks that Agner Fog identified on page 174 of his guide.
Raqia is offline   Reply With Quote
Old 12-Sep-2012, 00:33   #54
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,442
Default

Hmm interesting that both port 0 and port 1 can do FMA and port 1 now can do fp mul too but port 0 can't do fp add. Any ideas why that would be?
mczak is offline   Reply With Quote
Old 12-Sep-2012, 00:57   #55
Raqia
Member
 
Join Date: Oct 2003
Posts: 320
Default

Maybe in common legacy workloads, FP Adds coincide w/ branches, shifts, and divides.

Edit: I guess more importantly FP Adds coincide w/ FP Mul. The whole point of FMAC is to increase efficiency on that particular combination of instructions, and you still need to take into account legacy code.

Last edited by Raqia; 12-Sep-2012 at 01:11.
Raqia is offline   Reply With Quote
Old 12-Sep-2012, 01:03   #56
Grall
Invisible Member
 
Join Date: Apr 2002
Location: La-la land
Posts: 5,034
Default

No word on the rumored on-package graphics S/DRAM module?
__________________
"If I were a science teacher and a student said the Universe is 6000 years old, I would mark that answer as wrong (why? Because it is)."
-Phil Plait
Grall is offline   Reply With Quote
Old 12-Sep-2012, 02:33   #57
tunafish
Member
 
Join Date: Aug 2011
Posts: 371
Default

Quote:
Originally Posted by Grall View Post
Well, how would it know this RAM is closer...?
Presumably, the BIOS would give hints that good operating systems would record and make use of.

Quote:
Besides, from what I understand no desktop OS can optimize RAM during runtime; once something's loaded somewhere it pretty much stays there.
This is incorrect. All software only use virtual addresses, the OS is free to relocate the physical pages anywhere it wants. The oldest example is of course swapping to disk, but modern Linux can do things like migrating memory to a closer NUMA node, or migrating pages to merge many small pages into few 2MB ones. AFAIK, Windows does nothing of the sort.
tunafish is offline   Reply With Quote
Old 12-Sep-2012, 03:31   #58
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Grall View Post
Well, how would it know this RAM is closer...? Besides, from what I understand no desktop OS can optimize RAM during runtime; once something's loaded somewhere it pretty much stays there.
Just the way driver knows GPU memory is closer to GPU. It driver will allocate rendertargets in that memory and when it's full, kick them back to CPU RAM.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 12-Sep-2012, 04:12   #59
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,141
Default

Quote:
Originally Posted by sebbbi View Post
Hopefully they spill the beans soon. I just hope that you don't have to code a loop for it (like you must for Knights Corner). The worst case is that it's just a long microcoded sequence, but that wouldn't make much sense. I am keeping my pessimistic view until Intel proves me otherwise. Efficient gather is almost too good to be true

... Intel has added two extra ports, but none of them does load related things. And "no changes to key pipelines" either. No mention about other load related improvements either. So my conclusion is that gather likely takes several cycles to complete (even without cache misses).
Port 6 and 7 provide integer capability and branching, while also keeping the vector pipes unencumbered.
An internal gather loop could utilize the extra integer operand access and branch capability of the extra ports without simultaneously blocking the vector pipes that would make use of a gather instruction. The store AGU all by itself seems unbalanced, unless it's sharing that port with something they've chosen not to discuss yet, maybe the specialized hardware that would scan a gather index register and detect how many belong to the same cache line.
Port 5 has vector shuffles, which might include that permute unit that both gather and vector work would like. It's not zero-sum because a gather would provide data from memory in the desired arrangement.

The data given so far makes Haswell sound much more interesting than Steamroller, although there's always the chance that more is to come on the latter's account. The breadth of the engineering effort for this architecture visually dwarfs the competition. The promotion of integer vector instructions to 256-bit is going to put some serious hurt on one of the few areas BD was not outclassed in.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 12-Sep-2012, 04:16   #60
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by sebbbi View Post
Hopefully they spill the beans soon. I just hope that you don't have to code a loop for it (like you must for Knights Corner). The worst case is that it's just a long microcoded sequence, but that wouldn't make much sense. I am keeping my pessimistic view until Intel proves me otherwise. Efficient gather is almost too good to be true

... Intel has added two extra ports, but none of them does load related things. And "no changes to key pipelines" either. No mention about other load related improvements either. So my conclusion is that gather likely takes several cycles to complete (even without cache misses).
A microcoded sequence could still be faster.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 12-Sep-2012, 11:10   #61
Grall
Invisible Member
 
Join Date: Apr 2002
Location: La-la land
Posts: 5,034
Default

Quote:
Originally Posted by tunafish View Post
This is incorrect. All software only use virtual addresses, the OS is free to relocate the physical pages anywhere it wants.
It was my understanding that these virtual addresses were transformed into actual addresses once the program code was loaded somewhere into RAM by the operating system and would thus be unable to move, but if that's not true then it's pretty cool.

Quote:
modern Linux can do things like migrating memory to a closer NUMA node, or migrating pages to merge many small pages into few 2MB ones.
That sounds extremely useful actually. Considering the number of people experiencing crashes in certain misbehaved software like bethesda's Skyrim due to memory fragmentation, it would have been nice if windows had been able to do this as well. Any idea why microsoft haven't bothered to pursue this path? Perhaps they're too content with their current market dominance... *shrug*
__________________
"If I were a science teacher and a student said the Universe is 6000 years old, I would mark that answer as wrong (why? Because it is)."
-Phil Plait
Grall is offline   Reply With Quote
Old 12-Sep-2012, 11:11   #62
sebbbi
Member
 
Join Date: Nov 2007
Posts: 947
Default

Quote:
Originally Posted by mczak View Post
But OMG this thing is a beast. AMD thought there's not much point of having a 3rd INT ALU and intel now has 4...
Intel has hyperthreading, so they can feed the ALUs from two instruction streams. Four ALUs is overkill for ILP alone, but add the TLP from two threads to the mix, and the situation becomes very different. As long as other parts of the chip are not a bottleneck, hyperthreading (two threads on a single core) should have performance closer to two separate (2 ALU) cores (in ALU tasks). It's not looking good for AMD.
Quote:
Originally Posted by rpg.314 View Post
A microcoded sequence could still be faster.
It would likely be at least slightly faster. If nothing else is improved, at least gather takes only one slot in (x86) L1 instruction cache (but several slots in uop cache), and they can choose the optimal uop sequence for the processor (x86 compilers are too general purpose for this task). But that's a bit pessimistic view, I must admit. Maybe I have spent too much time evading stuff like microcoded imul and sraw (variable shifts) in console programming
Quote:
Originally Posted by 3dilettante View Post
Port 6 and 7 provide integer capability and branching, while also keeping the vector pipes unencumbered.
An internal gather loop could utilize the extra integer operand access and branch capability of the extra ports without simultaneously blocking the vector pipes that would make use of a gather instruction. The store AGU all by itself seems unbalanced, unless it's sharing that port with something they've chosen not to discuss yet, maybe the specialized hardware that would scan a gather index register and detect how many belong to the same cache line.
Port 5 has vector shuffles, which might include that permute unit that both gather and vector work would like. It's not zero-sum because a gather would provide data from memory in the desired arrangement.
Good points. That would (also) be a good use for the extra ALU/branch ports. If your algorithm is vector math heavy (almost no ALU ops), and doesn't include too many gathers, the CPU should be able to mask out ("co-issue") all the microcoded ALU ops from the gather. But for algorithms that already have interleaved ALU and vector ops, this technique would make the ALU a bottleneck. It would also prevent the other thread (HT) of running ALU heavy code while the other runs vector heavy code... but of course the current ways of doing gather manually are even worse. And the fourth ALU helps in both cases.
Quote:
Originally Posted by 3dilettante View Post
The breadth of the engineering effort for this architecture visually dwarfs the competition. The promotion of integer vector instructions to 256-bit is going to put some serious hurt on one of the few areas BD was not outclassed in.
Fourth ALU should be improving the integer performance as well (especially with hyperthreading). It seems that they have made some very good architectural choices that fit together very well. Previously (year ago) I though that gather would be one of the key new features of this architecture, but Haswell has so much more than that to offer. I can't wait to do some performance tests with transactional memory. Assuming it's fully L1 based, the transaction cannot access more than 32KB of memory (minus hyperthreading, minus cache aliasing = around 10KB to be sure). But that's more than enough for games, as game access patterns are usually cache line optimized, and limited in scope. Enterprise software however might need more than Haswell L1 has to offer for their transactions.
sebbbi is offline   Reply With Quote
Old 12-Sep-2012, 11:11   #63
Gubbi
Senior Member
 
Join Date: Feb 2002
Posts: 2,576
Default

Quote:
Originally Posted by sebbbi View Post
Hopefully they spill the beans soon. I just hope that you don't have to code a loop for it (like you must for Knights Corner). The worst case is that it's just a long microcoded sequence, but that wouldn't make much sense. I am keeping my pessimistic view until Intel proves me otherwise. Efficient gather is almost too good to be true
It adds a lot of complexity to support single instruction gather. A gather instruction could generate a multitude of addresses that all cause a MMU page-walk, the accesses themselves would require a multi-ported cache to be efficient, each access potentially causing a full cache miss.

Worst case, a single gather instructions could take several thousands of cycles to complete. So you either make it interruptible or suffer intolerable interrupt latency, the former means you need to save partial state of registers (*ugh*) the latter is just not acceptable.

Mind you, a fair fraction of accesses are likely to miss caches and with 4 to 8 cores on a chip it'll be relatively easy to saturate the main memory interface anyway.

So you end up spending a lot of complexity and power on something that might not add a whole lot of performance in the end.

Cheers
__________________
I'm pink, therefore I'm spam
Gubbi is online now   Reply With Quote
Old 12-Sep-2012, 11:15   #64
sebbbi
Member
 
Join Date: Nov 2007
Posts: 947
Default

Quote:
Originally Posted by Grall View Post
Considering the number of people experiencing crashes in certain misbehaved software like bethesda's Skyrim due to memory fragmentation, it would have been nice if windows had been able to do this as well. Any idea why microsoft haven't bothered to pursue this path?
Skyrim is a 32 bit executable, and has 32 bit pointers. It runs out of virtual address space, so no reordering can help it. 64 bit pointers allow a 64 bit virtual address space. With 64 bit pointers you pretty much never run out of virtual address space.
sebbbi is offline   Reply With Quote
Old 12-Sep-2012, 11:18   #65
Grall
Invisible Member
 
Join Date: Apr 2002
Location: La-la land
Posts: 5,034
Default

Quote:
Originally Posted by sebbbi View Post
It runs out of virtual address space, so no reordering can help it.
Yeah, I know it's 32-bit, but if windows supported reordering maybe a large enough chunk of continuous memory could be presented to the game.
__________________
"If I were a science teacher and a student said the Universe is 6000 years old, I would mark that answer as wrong (why? Because it is)."
-Phil Plait
Grall is offline   Reply With Quote
Old 12-Sep-2012, 11:23   #66
sebbbi
Member
 
Join Date: Nov 2007
Posts: 947
Default

Quote:
Originally Posted by Grall View Post
Yeah, I know it's 32-bit, but if windows supported reordering maybe a large enough chunk of continuous memory could be presented to the game.
No operating system can reorder your software's own virtual address space. It can only reorder the physical data in memory, and update virtual address tables accordingly. If you have only 32 bit pointers in your game and you do a lot of dynamic memory allocation, you will eventually run out of continuous memory blocks (in the 32 bit virtual memory address space), and there's nothing an OS can do to help you.
sebbbi is offline   Reply With Quote
Old 12-Sep-2012, 11:51   #67
tunafish
Member
 
Join Date: Aug 2011
Posts: 371
Default

Quote:
Originally Posted by Grall View Post
It was my understanding that these virtual addresses were transformed into actual addresses once the program code was loaded somewhere into RAM by the operating system and would thus be unable to move, but if that's not true then it's pretty cool.
No, that's what page tables and TLBs are for. Basically, when you issue a load, the first thing that happens is that the CPU looks for the address you gave it from the TLB. If found, it takes the physical address stored in the TLB, and uses it instead. If not found, it fires up the page walker, and walks the page tables (an in-memory data structure) to find the correct physical address (and stores it in the TLB). If still not found, it interrupts into the OS and lets it handle it.

So the address translation is entirely dynamic and run-time. It's how processes are separated on multi-tasking operating systems -- your address 0x4000 can point to something completely different than my 0x4000, and the privileged operating system structures are not found in either of our address spaces.

You can actually do all kinds of neat things with page tables. For example, Azul systems uses it for unblocking GC. Basically, it gives you a cheap (ish) hook you can invoke on any memory access to a given page (on x86, that's 4k/2M/1G granularity).


Quote:
That sounds extremely useful actually. Considering the number of people experiencing crashes in certain misbehaved software like bethesda's Skyrim due to memory fragmentation
This wouldn't actually help. The things that get fragmented are the 32-bit virtual addresses -- the physical pieces of ram can be moved about at will, but the 32-bit addresses cannot change, simply because the OS would then have to fix up every address in the program, and it cannot know what is an address and what is an unfortunately chosen integer.

Quote:
it would have been nice if windows had been able to do this as well. Any idea why microsoft haven't bothered to pursue this path? Perhaps they're too content with their current market dominance... *shrug*
They just can't keep up. In the internals, modern Linux is now about a decade ahead of Win8, and the difference is growing, not decreasing.
tunafish is offline   Reply With Quote
Old 12-Sep-2012, 16:31   #68
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,580
Default

Quote:
Originally Posted by 3dilettante View Post
Intel's improved fetch bandwidth and widened the back end to handle two branches per cycle.
It doesn't mention predicting two branches per cycle, though.
Unless they decoupled the branch predictor from the rest of the frontend like AMD did I don't think they'd really even be able to predict multiple taken branches in one cycle. One block is loaded from fetch/uop cache and for that you can only make use of one BTB hit. No later instructions in the block would apply.

You could benefit from being able to predict multiple untaken branches in a block (up to the end or first taken branch). It may already do this. I know the BTB supports up to 4 branches per fetch block in SB; the prediction resolution before lookup may be capable of predicting all four in parallel.

Quote:
Originally Posted by mczak View Post
Hmm interesting that both port 0 and port 1 can do FMA and port 1 now can do fp mul too but port 0 can't do fp add. Any ideas why that would be?
My guess would be this: on SB/IB, FADD and FMUL latency is only 3 cycles but on Haswell FMA latency is 5 cycles which is substantially higher. David Kanter has remarked that Intel engineers found Bulldozer's 5-6 cycle FMA latency to be a weakness, so I don't think they'd be happy with 5 cycles for FADD and FMUL. So I'm guessing they did what they could to bypass the FMA unit to reduce latency for FADD and FMUL: you can get a multiply result early and start an add late. And for the early multiply result the rest of the FMA is a don't care, if it runs at all, but for the early add you have to feed it a 0 to start with. So it may be that a fast FADD is more complex to support than a fast FMUL and therefore they only have one.

Quote:
Originally Posted by sebbbi View Post
It would likely be at least slightly faster. If nothing else is improved, at least gather takes only one slot in (x86) L1 instruction cache (but several slots in uop cache), and they can choose the optimal uop sequence for the processor (x86 compilers are too general purpose for this task).
In SB/IB the uop cache doesn't store more than the first few uops from a microcode sequence. Were you thinking that Haswell would expand entire microcode routines into the uop cache? I'm not sure they'd do this because it'd complicate the mapping between uop cache and L1 instruction cache and it'd also open up the potential for uop cache thrashing with lots of microcode instructions which would all have to be inlined into the cache to get proper performance.

Without such a mechanism Haswell would need to have much faster microcode ROM throughput to maintain a fast microcoded gather. Historically it has only been one uop per cycle, where the decoders then can't provide anything. This might be enough for the gather itself (depending on what microcode is available), but it stills everything else. It's hard to imagine Intel investing in either the ability to dispatch from both the microcode and uop cache/decoders simultaneously nor a wide microcode ROM that can feed several uops per cycle, but I really wouldn't know what they do and don't find practical here..

Barring that I'd expect the gather to be done by an independent hardware state machine, regardless of whether or not it can service multiple loads per cycle. Even if it's stuck at one load per cycle it'll still be a lot better than the current alternative.

Quote:
Originally Posted by Gubbi View Post
It adds a lot of complexity to support single instruction gather. A gather instruction could generate a multitude of addresses that all cause a MMU page-walk, the accesses themselves would require a multi-ported cache to be efficient, each access potentially causing a full cache miss.

Worst case, a single gather instructions could take several thousands of cycles to complete. So you either make it interruptible or suffer intolerable interrupt latency, the former means you need to save partial state of registers (*ugh*) the latter is just not acceptable.

Mind you, a fair fraction of accesses are likely to miss caches and with 4 to 8 cores on a chip it'll be relatively easy to saturate the main memory interface anyway.

So you end up spending a lot of complexity and power on something that might not add a whole lot of performance in the end.

Cheers
It's not pretty, but the gather could be replayed in its entirety upon any fault whatsoever, and if you so desire, after an interrupt. All that's required is that the cache and TLB have at least as many ways as there are elements in the vector, because otherwise you could get an infinite loop where the later fields keep evicting the former ones. This shouldn't be a problem for Haswell. In the normal case, the cost of a cache miss is big compared to the cost of redoing the earlier loads which are now in cache. An interrupt can evict the stuff that was gathered from the cache, but that's not that much worse than it evicting any of the rest of the program's working set (not to mention, having to save/restore registers). And interrupts aren't really frequent enough for this to be a concern.

This may be why AVX2 has no scatter instruction. Replaying a half-done scatter has more consequences.

No one would expect a gather instruction that's single cycle if everything's in cache. The reasonable highest end expectation is a gather that can load multiple elements if they're all in the same cache line, like Larrabee can. Tons of code would benefit from this. A lot of useful gathers can even have multiple fields going to the exact same address.

Last edited by Exophase; 12-Sep-2012 at 16:44.
Exophase is offline   Reply With Quote
Old 12-Sep-2012, 16:34   #69
Grall
Invisible Member
 
Join Date: Apr 2002
Location: La-la land
Posts: 5,034
Default

Sebbbi, Tuna; thanks. Your posts are very educational, as always.
__________________
"If I were a science teacher and a student said the Universe is 6000 years old, I would mark that answer as wrong (why? Because it is)."
-Phil Plait
Grall is offline   Reply With Quote
Old 12-Sep-2012, 16:55   #70
mczak
Senior Member
 
Join Date: Oct 2002
Posts: 2,442
Default

Quote:
Originally Posted by Exophase View Post
My guess would be this: on SB/IB, FADD and FMUL latency is only 3 cycles but on Haswell FMA latency is 5 cycles which is substantially higher. David Kanter has remarked that Intel engineers found Bulldozer's 5-6 cycle FMA latency to be a weakness, so I don't think they'd be happy with 5 cycles for FADD and FMUL. So I'm guessing they did what they could to bypass the FMA unit to reduce latency for FADD and FMUL: you can get a multiply result early and start an add late. And for the early multiply result the rest of the FMA is a don't care, if it runs at all, but for the early add you have to feed it a 0 to start with. So it may be that a fast FADD is more complex to support than a fast FMUL and therefore they only have one.
Hmm that makes sense, though you're wrong about the latencies. Only fadd is 3 cycles on snb/ivb/hsw, fmul is 5 cycles, same as fma. So maybe fadd indeed has some special path to get latency down to 3 whereas for the fmul it can just use ordinary fma path. This is indeed different to amd which had same latency for fmul and fadd (and now fma) for ages (K8/K10 had latency 4, BD latency 5-6).
mczak is offline   Reply With Quote
Old 12-Sep-2012, 17:04   #71
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,580
Default

You're right, my mistake. All the more reason why it only supports one FADD, though. It's possible that it's implemented with an entirely separate unit.

Having a big difference in latency between FADD and FMUL is actually kind of surprising, the significand multiplication itself must be eating a lot of that because you'd expect the normalization to be more expensive with the add.
Exophase is offline   Reply With Quote
Old 12-Sep-2012, 17:53   #72
tunafish
Member
 
Join Date: Aug 2011
Posts: 371
Default

I'd give another vote to "FADD is probably it's own dedicated unit". Both because FADD units are much cheaper than multiply ones, and because scheduling instructions gets a lot harder in cases where you can stuff things into the middle of a pipeline.

Also, the slides for ARCS004 and 005 are now up. TSX works on L1 (only), and gather zeroes out elements in the mask register when it successfully fetches them -- this way, after any fault the gather can just be restarted and it keeps all the work it has already done.
tunafish is offline   Reply With Quote
Old 13-Sep-2012, 05:13   #73
liolio
French frog
 
Join Date: Jun 2005
Location: France
Posts: 4,172
Default

So sadly it's almost double confirm that the next core i3 4xxx are to ridicule the upcoming quad core (/2 modules) Streamrollers for all personal usages.
AMD sounds really in a tough spot... especially as Haswell might be released before Streamrollers.

The scary part part is whereas Jaguar cores look nice we have no release date and Intel has a counter... If Atom uses the same power saving techniques as Haswell and taking in account how long Intel must have been working on that one I'm scared that the "really nice Jaguar core" may look like toys in front of Intel offering.
liolio is offline   Reply With Quote
Old 13-Sep-2012, 05:34   #74
Alexko
Senior Member
 
Join Date: Aug 2009
Posts: 2,027
Send a message via MSN to Alexko
Default

Quote:
Originally Posted by liolio View Post
So sadly it's almost double confirm that the next core i3 4xxx are to ridicule the upcoming quad core (/2 modules) Streamrollers for all personal usages.
AMD sounds really in a tough spot... especially as Haswell might be released before Streamrollers.
I think the aim of Bulldozer, and therefore Steamroller was never really to compete with Intel on a core-for-core basis, but to allow for more cores within the same transistor and power budget. This is precisely what AMD does with the FX lineup—well, technically they use more transistors and power, but still—and on the A lineup, that is for APUs, they choose to spend the extra transistors and power on the GPU.

Steamroller should continue that trend and, frankly, if Kaveri really does deliver a 30% performance improvement over Trinity, it should be more than enough for most people, so spending extra transistors and power on the GPU seems like the right thing to do.

Quote:
Originally Posted by liolio View Post
The scary part part is whereas Jaguar cores look nice we have no release date and Intel has a counter... If Atom uses the same power saving techniques as Haswell and taking in account how long Intel must have been working on that one I'm scared that the "really nice Jaguar core" may look like toys in front of Intel offering.
We'll see. I think the new Atom is supposed to be OoO—which would make it an Atom only in name—but it's still targeted at phones while Jaguar isn't meant to go any lower than tablets. Different power targets usually mean different performance targets too. I wouldn't write AMD off just yet.
__________________
"Well, you mentioned Disneyland, I thought of this porn site, and then bam! A blue Hulk." —The Creature
My (currently dormant) blog: Teχlog
Alexko is online now   Reply With Quote
Old 13-Sep-2012, 06:03   #75
DavidC
Member
 
Join Date: Sep 2006
Posts: 273
Default

Quote:
Originally Posted by liolio View Post
now the gpu in compute operation has 0.5TB/s of bandwidth the last level of cache the thing could literally fly.
Unless it was specified, I think the cache means the GPU dedicated L3 cache, rather than LLC. That makes a big difference.
DavidC is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 12:34.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.