If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#126 | |||
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
|
Quote:
Besides that, Cortex-A9 implementations do tend to have relatively high L2 latency and relatively high main memory latency, so there's plenty of room for improvement; the former can actually be delivered by ARM since the L2 is tightly coupled with the CPUs again. What really confuses me is how you can make this statement while simultaneously saying A6's CPU is higher performing - does only it get to magically make latency go away? Quote:
Quote:
Everything you said about OoO applies to scalar VFP in Cortex-A9 vs Cortex-A51 just as much as it implies to NEON. The word isn't back yet but it's also possible that there are two "real work" VFP pipes (ie, 2x scalar FMADDs) |
|||
|
|
|
|
|
#127 | ||
|
Senior Member
Join Date: Feb 2002
Posts: 2,543
|
Quote:
The A15 has 3-wide decode and retirement and a 40+ entry reorder window. None of the material I've seen is more detailed than "40+" ROB entries and none says how many dispatch ports or execution units it has. The A15 is designed for higher operating frequency. Higher operating frequency generally increases latency measured in cycles to the memory subsystem, if ARM combats this with a new cache architecture, fine, but it still means that the amount of ROB entries per peak instruction throughput per cycle roughly stays the same, so don't expect more than 50% IPC increase. Quote:
We don't know anything about the A6 other than it has a kickass memory subsystem. Where does the performance come from ? Is it 4-wide? Does it have multi ported D$? How big is the reorder buffer? Does it have memory disambiguation ? Cheers
__________________
I'm pink, therefore I'm spam |
||
|
|
|
|
|
#128 | ||||
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
|
Quote:
This is the best Cortex-A9 reference I've seen: http://www.docstoc.com/docs/73399229...roarchitecture When they say "3+1" dispatch all diagrams would suggest that's either referring to the third port being capable of going to LS vs NEON/VFP, or the separate branch resolution. It's not a real quad dispatch either way. There's no official documentation on the issue queue, but the diagram draws 6 squares, so the best guess will be that it's 6 wide. Everything else about it suggests a unified scheduler. Given that ARM themselves says that 8 scheduler slots were pushing the upper limit of feasibility in their design constraints for Cortex-A15 it'd be awfully strange if Cortex-A9 had 24, although I suppose it's possible given that they were designed by two totally different teams. Quote:
A15 has 8 issue queues (to each execution pipeline) in 5 clusters, each with 8 slots. That's 64 entries total. It can dispatch to each of the 8 pipelines each cycle. The pipelines are 2x simple ALU, 1x branch, 1x MUL, 1x load, 1x store, and 2x NEON/VFP. Note that the ALUs bring back parallel shift + op execution, which was moved to separate stages in A9. But there's way more to the comparison than just execution window, execution width, and latency to the memory subsystem. I don't think I really need to start listing things. Quote:
Quote:
Anyway, back to the original claim - regardless of what you think the maximum improvement Cortex-A15 can bring is, why would you think Dhrystone would be what's representative of the upper limit? Dhrystone is relatively static, predictable, small, and the test is designed so that you can spend a lot of the time in hand tuned ASM. An other words, an easy problem A lot of the hardware in Cortex-A15, quite possibly the majority of it, is designed for problems harder than Dhrystone. Last edited by Exophase; 10-Oct-2012 at 21:27. |
||||
|
|
|
|
|
#129 | |||||
|
Senior Member
Join Date: Feb 2002
Posts: 2,543
|
Quote:
Dispatch: Page 6 here Although the diagram is confusing, it does say up to FOUR dispatches per cycle. Quote:
When an instruction is renamed, it is allocated an entry in the commit queue. The only time I've seen the size of the commit queue mentioned was in comp.arch on usenet two years ago, where the number 40 was mentioned. Quote:
Quote:
Quote:
The A15 can execute 50% more instructions per cycle. That also implies that latency of a memory operation grows by 50% measured in instructions even if number of cycles stays the same.In order to get a perfect 50% speedup you'd need to reduce main memory latency to 66%. Can the A15 do that? Possibly, the tests I've seen of A9 shows a 200ns main memory latency, so there is certainly room for improvement. Also, datapaths are twice as wide so that'll buy you a lot on throughput workloads (FP and media). The extra bandwidth can also be used for more aggressive prefetch where you effectively trade bandwidth for lower latency Cheers
__________________
I'm pink, therefore I'm spam Last edited by Gubbi; 11-Oct-2012 at 10:55. |
|||||
|
|
|
|
|
#130 |
|
Senior Member
Join Date: Dec 2004
Location: Toulouse
Posts: 4,129
|
Drhystone? Here's an extremely old benchmark, that not only can be abused with compilation optimisation (thanks wikipedia) but will also typically entirely fit in L1. Nowadays mobile CPU have become like PCs of the past 15 years with a hierarchy of L1, L2 and memory with a huge relative latency, so you're not testing real performance and don't even have an excuse for it.
|
|
|
|
|
|
#131 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,543
|
Quote:
Cheers
__________________
I'm pink, therefore I'm spam |
|
|
|
|
|
|
#132 | |||||||
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,553
|
Quote:
The document I linked is much more detailed than yours, and makes it pretty clear it doesn't any true capability to dispatch four things in one cycle. The comment is probably counting folded branch resolution as dispatch, which is fair in the sense that it correlates to an instruction that was decoded and issued, but still not what most would consider true dispatch. But this is really nit-picking over details. Quote:
Of course, since they don't have a unified scheduler, you generally won't come that close to actually utilizing the full reordering capacity, in general it'll probably be < 40 instructions. Quote:
Quote:
Did you even read the document I linked? It's far more detailed than any Cortex-A9 document out there! It's also more detailed than most descriptions Intel or AMD has given for their CPUs. You can find some more information in the publicly visible TRM (like various buffer/cache sizes/associativities). Quote:
Quote:
Taking all that and putting it into a simplistic equation saying that it must need 66% lower MAIN memory latency to achieve 50% better perf/clock on average is a total farce. I don't know what you're doing here. You find out the performance by benchmarking it, but right now the best thing to go on is ARM's claim that it'll get 50% better performance. Quote:
You'll also find that despite some SoCs having main memory latencies over 50% better than others they don't usually get a huge boost in performance. Cortex-A15 is less sensitivity to main memory latency than Cortex-A9 (I'm not claiming how much, but it's definitely less). Do I have to explain why? I feel like you're not listening to me. ARM claimed 40% higher Dhrystone scores at the same MHz. They claimed 50% higher integer performance in general, again at the same MHz. The latter was not about Dhrystone. They haven't explained it further but some other charts imply this number is from SPEC. |
|||||||
|
|
|
|
|
#133 |
|
PM
Join Date: Dec 2002
Posts: 1,370
|
Eh, actual A-15 hardware will be out soon enough... I am content to wait for real world results.
__________________
// |
|
|
|
|
|
#134 | |||||
|
Senior Member
Join Date: Feb 2002
Posts: 2,543
|
Quote:
Quote:
Quote:
Without a global scheduler the OOO capabilities are much more limited than an equivalent x86 implementation. A simple integer rich workload with a few loads missing D$ sprinkled in could effectively limit the amount of instructions in flight to the size of the int issue queues, - 16 entries. AFAICT, if you're right, the only way to get anywhere near the maximum number of instructions in flight is FP/NEON code. There is always a surprising amount of integer chores in FP codes and that way most of the issue queues could be filled (or at least see any action). Quote:
Quote:
The commit queue looks like a data-full ROB, but it claims to be a PRF OOO implementation. The OOO capabilities looks to be ample except they are limited by the issue queue sizes. BTW. This is off topic for this thread, move it ? Cheers
__________________
I'm pink, therefore I'm spam Last edited by Gubbi; 12-Oct-2012 at 13:23. |
|||||
|
|
|
|
|
#135 |
|
Member
Join Date: Sep 2010
Posts: 996
|
Intel to Merge Xeon and Itanium in 2015-2017
Ivy Bridge (Core i3/i5/i7) debuted in 2012 Haswell (Core i3/i5/i7) will debut in early 2013 Ivy Bridge-EP (Xeon E3/E5) should arrive in mid-2013 Ivy Bridge-E (Core i7) debuts in late 2013 Ivy Bridge-EX for critical servers (Xeon E7) debuts in late 2013 Broadwell (Core i3/i5/i7) should ship in early 2014 Haswell-EP (Xeon E3, E5) should ship by mid 2014 Haswell-E (Core i7) debuts in late 2014 Haswell-EX (Xeon E7) is planned for late 2014 Broadwell-EP (Xeon E3 / E5) is planned for mid 2015 Broadwell-E (Core i7) arrives in late 2015 Broadwell-EX (Xeon E7) is planned for late 2016 The new socket could be the one you already know - according to some sources, Intel plans to re-wire the LGA-2011 for Haswell/Broadwell, making it incompatible with Sandy Bridge/Ivy Bridge-based products. The rewiring isn't being done to support new architectures, but rather provide more power - according to documents we saw, Intel plans to introduce 150W and up to 180W parts when Haswell and Broadcom architectures enter the cut throat server business. Hmm, sounds very nice. |
|
|
|
|
|
#136 |
|
Senior Member
Join Date: Feb 2004
Posts: 2,439
|
Merging them is highly inaccurate. Merging the support system(Socket, perhaps chipset, etc) is accurate. We won't be seeing itaniums on our PCs, and for good reason.
|
|
|
|
|
|
#137 |
|
Senior Member
Join Date: Dec 2004
Location: Toulouse
Posts: 4,129
|
It could allow an x86 in one socket and an itanium in another, assuming you would want to do that.
|
|
|
|
|
|
#138 | |
|
Invisible Member
Join Date: Apr 2002
Location: La-la land
Posts: 4,982
|
Quote:
So, the day intel finally pulls the plug on itanium, customers could drop in x86 chips there instead.
__________________
"If I were a science teacher and a student said the Universe is 6000 years old, I would mark that answer as wrong (why? Because it is)." -Phil Plait |
|
|
|
|
|
|
#139 |
|
Senior Member
Join Date: Oct 2002
Posts: 2,433
|
Frankly I don't know why it took intel so long. Back in 2006 roadmaps suggested that Xeons and Itanics will use the same chipsets in the future and ultimately boards could support both chips (I dunno what happened with the "same chipsets" but up to now at least the sockets obviously ended up different). Remember QuickPath was initially known as CSI ("Common System Interface").
|
|
|
|
|
|
#140 |
|
Senior Member
Join Date: Dec 2004
Location: Toulouse
Posts: 4,129
|
Intel has always done minimum service regarding socket compatibility, they had three generations of socket 370 and four of socket 775, each time the motherboards were backwards compatible but never forward compatible (millions of computers are stuck with a pentium 4 and can't get a Core 2 Celeron).
Or there's Socket 1156 and 1155, where everyone has forgotten what the new socket brought to the table already. Intel is opportunist, they won't care about breaking compatibility if that means the CPU will use 1% less power or something. They are also good at pushing a new platform in the distribution channels. They care more about deadlines and such. |
|
|
|
|
|
#141 | |
|
Invisible Member
Join Date: Apr 2002
Location: La-la land
Posts: 4,982
|
Quote:
This may change in the future as stationary computers are being increasingly encroached upon by mobile platforms. CPU sockets may in fact not even survive the end of this decade.
__________________
"If I were a science teacher and a student said the Universe is 6000 years old, I would mark that answer as wrong (why? Because it is)." -Phil Plait |
|
|
|
|
|
|
#142 |
|
Senior Member
Join Date: Feb 2004
Posts: 2,439
|
|
|
|
|
|
|
#143 | |
|
Member
Join Date: Mar 2009
Posts: 160
|
Quote:
but, correct me if I'm wrong, they are saying that what is today the "PCH" is going to be on the same package as the CPU? |
|
|
|
|
|
|
#144 |
|
Member
Join Date: Aug 2011
Posts: 366
|
|
|
|
|
|
|
#145 |
|
Member
Join Date: Sep 2005
Posts: 206
|
"Intel’s Haswell CPU Microarchitecture" by David Kanter
Intel has indeed pushed the "mass-market state of the art" forward in many fronts at once. It would be truly sad if it turns out that the mass-market needs peak with dual-core consumption devices, which rules out future Haswell-like big jumps. |
|
|
|
|
|
#146 |
|
Invisible Member
Join Date: Apr 2002
Location: La-la land
Posts: 4,982
|
Let's be realistic - heavy computing capability in a CPU is only neccessary for those who actually do heavy computing. It's like expecting everyone to buy cars that can pull off competitive times at a dragracing strip - unrealistic! Not all that sad, really. It's simply reality.
__________________
"If I were a science teacher and a student said the Universe is 6000 years old, I would mark that answer as wrong (why? Because it is)." -Phil Plait |
|
|
|
|
|
#147 | |
|
Member
Join Date: Sep 2005
Posts: 206
|
Quote:
|
|
|
|
|
|
|
#148 |
|
Invisible Member
Join Date: Apr 2002
Location: La-la land
Posts: 4,982
|
Yeah, because it made sense from many perspectives to have it work this way, but with Moore's law finally starting to hit the ceiling things are changing - and x86 CPUs are so much more powerful than what the average guy needs anyway it's silly.
When shrinking nodes don't bring any appreciable savings in cost per transistor anymore there's little room to improve performance anyway.
__________________
"If I were a science teacher and a student said the Universe is 6000 years old, I would mark that answer as wrong (why? Because it is)." -Phil Plait |
|
|
|
|
|
#149 | |
|
Senior Member
Join Date: Dec 2004
Location: Toulouse
Posts: 4,129
|
Quote:
Nowadays there's only Sparc supercomputers, POWER7 mini-computers and Z mainframes competing with the desktop PC |
|
|
|
|
|
|
#150 | |
|
Member
Join Date: Jul 2003
Posts: 323
|
http://www.xbitlabs.com/news/cpu/dis...Regulator.html
Quote:
|
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|