If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#51 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,442
|
Quote:
But OMG this thing is a beast. AMD thought there's not much point of having a 3rd INT ALU and intel now has 4... I wonder about the front-end though, no mention of any improvements there. Is that really good enough to feed that monster back-end? |
|
|
|
|
|
|
#52 |
|
Member
Join Date: Oct 2003
Posts: 320
|
PDFs for presentations today:
https://intel.activeevents.com/sf12/...ler/catalog.do click the top link if you're like me and not registered. Do a search for "Haswell." The presentations ARCS001 and SPCS001 are worth a look. |
|
|
|
|
|
#53 |
|
Member
Join Date: Oct 2003
Posts: 320
|
ARCS001 on page 12 states that the extra AGU for store alleviates pressure on ports 2 & 3 for loads. I guess that's something they identified from their simulations as a bottleneck, maybe for hyperthreading? They've also directly addressed many bottlenecks that Agner Fog identified on page 174 of his guide.
|
|
|
|
|
|
#54 |
|
Senior Member
Join Date: Oct 2002
Posts: 2,442
|
Hmm interesting that both port 0 and port 1 can do FMA and port 1 now can do fp mul too but port 0 can't do fp add. Any ideas why that would be?
|
|
|
|
|
|
#55 |
|
Member
Join Date: Oct 2003
Posts: 320
|
Maybe in common legacy workloads, FP Adds coincide w/ branches, shifts, and divides.
Edit: I guess more importantly FP Adds coincide w/ FP Mul. The whole point of FMAC is to increase efficiency on that particular combination of instructions, and you still need to take into account legacy code. Last edited by Raqia; 12-Sep-2012 at 01:11. |
|
|
|
|
|
#56 |
|
Invisible Member
Join Date: Apr 2002
Location: La-la land
Posts: 5,034
|
No word on the rumored on-package graphics S/DRAM module?
__________________
"If I were a science teacher and a student said the Universe is 6000 years old, I would mark that answer as wrong (why? Because it is)." -Phil Plait |
|
|
|
|
|
#57 | |
|
Member
Join Date: Aug 2011
Posts: 371
|
Presumably, the BIOS would give hints that good operating systems would record and make use of.
Quote:
|
|
|
|
|
|
|
#58 |
|
Senior Member
|
Just the way driver knows GPU memory is closer to GPU. It driver will allocate rendertargets in that memory and when it's full, kick them back to CPU RAM.
|
|
|
|
|
|
#59 | |
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,141
|
Quote:
An internal gather loop could utilize the extra integer operand access and branch capability of the extra ports without simultaneously blocking the vector pipes that would make use of a gather instruction. The store AGU all by itself seems unbalanced, unless it's sharing that port with something they've chosen not to discuss yet, maybe the specialized hardware that would scan a gather index register and detect how many belong to the same cache line. Port 5 has vector shuffles, which might include that permute unit that both gather and vector work would like. It's not zero-sum because a gather would provide data from memory in the desired arrangement. The data given so far makes Haswell sound much more interesting than Steamroller, although there's always the chance that more is to come on the latter's account. The breadth of the engineering effort for this architecture visually dwarfs the competition. The promotion of integer vector instructions to 256-bit is going to put some serious hurt on one of the few areas BD was not outclassed in.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|
|
|
|
|
|
#60 | |
|
Senior Member
|
Quote:
|
|
|
|
|
|
|
#61 | ||
|
Invisible Member
Join Date: Apr 2002
Location: La-la land
Posts: 5,034
|
Quote:
Quote:
__________________
"If I were a science teacher and a student said the Universe is 6000 years old, I would mark that answer as wrong (why? Because it is)." -Phil Plait |
||
|
|
|
|
|
#62 | ||
|
Member
Join Date: Nov 2007
Posts: 947
|
Quote:
It would likely be at least slightly faster. If nothing else is improved, at least gather takes only one slot in (x86) L1 instruction cache (but several slots in uop cache), and they can choose the optimal uop sequence for the processor (x86 compilers are too general purpose for this task). But that's a bit pessimistic view, I must admit. Maybe I have spent too much time evading stuff like microcoded imul and sraw (variable shifts) in console programming Quote:
Fourth ALU should be improving the integer performance as well (especially with hyperthreading). It seems that they have made some very good architectural choices that fit together very well. Previously (year ago) I though that gather would be one of the key new features of this architecture, but Haswell has so much more than that to offer. I can't wait to do some performance tests with transactional memory. Assuming it's fully L1 based, the transaction cannot access more than 32KB of memory (minus hyperthreading, minus cache aliasing = around 10KB to be sure). But that's more than enough for games, as game access patterns are usually cache line optimized, and limited in scope. Enterprise software however might need more than Haswell L1 has to offer for their transactions. |
||
|
|
|
|
|
#63 | |
|
Senior Member
Join Date: Feb 2002
Posts: 2,576
|
Quote:
Worst case, a single gather instructions could take several thousands of cycles to complete. So you either make it interruptible or suffer intolerable interrupt latency, the former means you need to save partial state of registers (*ugh*) the latter is just not acceptable. Mind you, a fair fraction of accesses are likely to miss caches and with 4 to 8 cores on a chip it'll be relatively easy to saturate the main memory interface anyway. So you end up spending a lot of complexity and power on something that might not add a whole lot of performance in the end. Cheers
__________________
I'm pink, therefore I'm spam |
|
|
|
|
|
|
#64 | |
|
Member
Join Date: Nov 2007
Posts: 947
|
Quote:
|
|
|
|
|
|
|
#65 |
|
Invisible Member
Join Date: Apr 2002
Location: La-la land
Posts: 5,034
|
Yeah, I know it's 32-bit, but if windows supported reordering maybe a large enough chunk of continuous memory could be presented to the game.
__________________
"If I were a science teacher and a student said the Universe is 6000 years old, I would mark that answer as wrong (why? Because it is)." -Phil Plait |
|
|
|
|
|
#66 |
|
Member
Join Date: Nov 2007
Posts: 947
|
No operating system can reorder your software's own virtual address space. It can only reorder the physical data in memory, and update virtual address tables accordingly. If you have only 32 bit pointers in your game and you do a lot of dynamic memory allocation, you will eventually run out of continuous memory blocks (in the 32 bit virtual memory address space), and there's nothing an OS can do to help you.
|
|
|
|
|
|
#67 | |||
|
Member
Join Date: Aug 2011
Posts: 371
|
Quote:
So the address translation is entirely dynamic and run-time. It's how processes are separated on multi-tasking operating systems -- your address 0x4000 can point to something completely different than my 0x4000, and the privileged operating system structures are not found in either of our address spaces. You can actually do all kinds of neat things with page tables. For example, Azul systems uses it for unblocking GC. Basically, it gives you a cheap (ish) hook you can invoke on any memory access to a given page (on x86, that's 4k/2M/1G granularity). Quote:
Quote:
|
|||
|
|
|
|
|
#68 | ||||
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,580
|
Quote:
You could benefit from being able to predict multiple untaken branches in a block (up to the end or first taken branch). It may already do this. I know the BTB supports up to 4 branches per fetch block in SB; the prediction resolution before lookup may be capable of predicting all four in parallel. Quote:
Quote:
Without such a mechanism Haswell would need to have much faster microcode ROM throughput to maintain a fast microcoded gather. Historically it has only been one uop per cycle, where the decoders then can't provide anything. This might be enough for the gather itself (depending on what microcode is available), but it stills everything else. It's hard to imagine Intel investing in either the ability to dispatch from both the microcode and uop cache/decoders simultaneously nor a wide microcode ROM that can feed several uops per cycle, but I really wouldn't know what they do and don't find practical here.. Barring that I'd expect the gather to be done by an independent hardware state machine, regardless of whether or not it can service multiple loads per cycle. Even if it's stuck at one load per cycle it'll still be a lot better than the current alternative. Quote:
This may be why AVX2 has no scatter instruction. Replaying a half-done scatter has more consequences. No one would expect a gather instruction that's single cycle if everything's in cache. The reasonable highest end expectation is a gather that can load multiple elements if they're all in the same cache line, like Larrabee can. Tons of code would benefit from this. A lot of useful gathers can even have multiple fields going to the exact same address. Last edited by Exophase; 12-Sep-2012 at 16:44. |
||||
|
|
|
|
|
#69 |
|
Invisible Member
Join Date: Apr 2002
Location: La-la land
Posts: 5,034
|
Sebbbi, Tuna; thanks. Your posts are very educational, as always.
__________________
"If I were a science teacher and a student said the Universe is 6000 years old, I would mark that answer as wrong (why? Because it is)." -Phil Plait |
|
|
|
|
|
#70 | |
|
Senior Member
Join Date: Oct 2002
Posts: 2,442
|
Quote:
|
|
|
|
|
|
|
#71 |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,580
|
You're right, my mistake. All the more reason why it only supports one FADD, though. It's possible that it's implemented with an entirely separate unit.
Having a big difference in latency between FADD and FMUL is actually kind of surprising, the significand multiplication itself must be eating a lot of that because you'd expect the normalization to be more expensive with the add. |
|
|
|
|
|
#72 |
|
Member
Join Date: Aug 2011
Posts: 371
|
I'd give another vote to "FADD is probably it's own dedicated unit". Both because FADD units are much cheaper than multiply ones, and because scheduling instructions gets a lot harder in cases where you can stuff things into the middle of a pipeline.
Also, the slides for ARCS004 and 005 are now up. TSX works on L1 (only), and gather zeroes out elements in the mask register when it successfully fetches them -- this way, after any fault the gather can just be restarted and it keeps all the work it has already done. |
|
|
|
|
|
#73 |
|
French frog
Join Date: Jun 2005
Location: France
Posts: 4,172
|
So sadly it's almost double confirm that the next core i3 4xxx are to ridicule the upcoming quad core (/2 modules) Streamrollers for all personal usages.
AMD sounds really in a tough spot... especially as Haswell might be released before Streamrollers. The scary part part is whereas Jaguar cores look nice we have no release date and Intel has a counter... If Atom uses the same power saving techniques as Haswell and taking in account how long Intel must have been working on that one I'm scared that the "really nice Jaguar core" may look like toys in front of Intel offering.
__________________
What's trying to be a bunch of presentations PS360 youtube channel Sebbbi about virtual texturing Tuned EADGCF and liking it :) |
|
|
|
|
|
#74 | ||
|
Senior Member
|
Quote:
Steamroller should continue that trend and, frankly, if Kaveri really does deliver a 30% performance improvement over Trinity, it should be more than enough for most people, so spending extra transistors and power on the GPU seems like the right thing to do. Quote:
__________________
"Well, you mentioned Disneyland, I thought of this porn site, and then bam! A blue Hulk." —The Creature My (currently dormant) blog: Teχlog |
||
|
|
|
|
|
#75 |
|
Member
Join Date: Sep 2006
Posts: 273
|
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|