Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 31-Oct-2012, 18:54   #51
Davros
Darlek ******
 
Join Date: Jun 2004
Posts: 9,495
Default

Quote:
Originally Posted by Osamar View Post
Sorry for the semi-off topic.

It could be possible to "patch" Outcast with an improved voxel renderer for higher resolution using modern CPU or even a GPU voxel renderer?

Just, mind flying.
Last time I ran outcast there was a 1024x768 patch
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™
Davros is offline   Reply With Quote
Old 01-Nov-2012, 13:46   #52
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by 3dilettante View Post
Whatever future date this is, trends within the forseeable future do not point to this being feasible for consumer-level real-time graphics.
Some trends do, some trends don't. Let's try to avoid cherry picking...
Quote:
If Intel's near-threshold computing works out, it pushes for a stronger bifurcation between the throughput-oriented logic and the high-speed logic.
That would mean GeForces and Radeons will use such design in the foreseeable future? I don't see that "trend" happening. In fact current GPUs have similar or higher voltages than CPUs.

It's a very interesting technology for lowering the power consumption during standby operation of mobile devices, but as soon as you want some real work to be done the voltage and frequency have to swing up to deliver cost-effective performance. Note that Intel already adopted an 8T SRAM design, precisely to allow lowering the voltage when parked. This kind of technology doesn't apply any more or any less to the GPU or CPU.
Quote:
As aggravating as the GPU on chips like Sandy Bridge and the like may be to some, it simply provides more utility for the user than a few more high-speed cores they won't use, and it does so within a much more confined power and die budget.
Multi-core adoption has been slow due to the programming challenges involved and due to the chicken-or-egg issue between developers and consumers. These issues are slowly but surely getting resolved. Intel's TSX technology greatly simplifies multi-threaded development and lowers the synchronization overhead, while quad-core is becoming mainstream and thus it becomes interesting for developers to invest into it and subsequently the increase in multi-threaded software will be an incentive for consumers to buy more cores. It's still early days though. We haven't witnessed any paradigm shift in software development, where multi-core design become as natural as say object-oriented design. But it's bound to come. There's still plenty of untapped task parallelism in generic software. And even graphics engines are becoming multi-threaded.

Also note that clock frequency and voltage can be regulated on a core by core basis. When the workload is SIMD-intensive, the operating parameters can be adjusted. And again, executing very wide vector operations on less wide execution units reduces the switching activity elsewhere. So there's no reason why such a CPU wouldn't be able to be as power efficient as a GPU (and I do mean a contemporary GPU, the convergence is still happening from both ends). Die budget isn't an issue either. Between Yonah and Haswell there will be an eightfold increase in floating-point throughput per core. That's not costing an eightfold increase in transistor budget. Extending the SIMD units further will also be relatively cheap.
Quote:
There is some possibility for a Larrabee-type solution of small throughput cores being put on the same die as powerful OoO heavies, which has been on Intel's to-do list for probably going on a decade now. It seems like it almost made it at some point before delays or the competitiveness of the IGP won out. Even with this, the likely demands on the silicon and how Intel has accepted the use of specialized logic when it is needed point to the idea of a dozen or so OoO core monster chip as the future of rendering not being likely for the next 2-3 silicon nodes. We are running out of those, by the way.
The cores have to be homogeneous: a set of scalar integer units and a set of wide vector units, fed off the same instruction stream. It may pose some hardware challenges, but it's the only thing software developers will adopt. HSA doesn't stand a chance against the programmer-friendliness of AVX2 and its successors.

That said, looking at the very rapid evolution from fixed-function to programmable to unified that happened to the GPU, I don't think we'll run out of silicon nodes before the CPU and GPU will unify. Note that if the iGPU was ditched and replaced with CPU cores, a mainstream Haswell chip could already achieve close to 1 TFLOP. It won't happen in the next few years, but we're certainly closer to unified graphics than you think.
Nick is offline   Reply With Quote
Old 01-Nov-2012, 15:33   #53
Davros
Darlek ******
 
Join Date: Jun 2004
Posts: 9,495
Default

Quote:
Originally Posted by Nick View Post
It won't happen in the next few years,
Hang on you said earlier avx2 will kill discreet gpu's
and avx2 is coming next year
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™
Davros is offline   Reply With Quote
Old 01-Nov-2012, 15:48   #54
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,074
Default

Quote:
Originally Posted by Nick View Post
Some trends do, some trends don't. Let's try to avoid cherry picking...
A large number of trends are pointing against consumer-level 24-core OoO chips.
The ROI for such chips for the consumer market is uncompelling; the set of applications users have and those that need that many cores is aside from niche cases software that is often a solution searching for a problem; those chips don't work for more power-constrained environments in a market where a big chunk of the new revenue is power-constrained, and an extra generic core on top of many provides no signifcant utility, "blue crystals", or strong enough means to encourage upgrades.

Quote:
That would mean GeForces and Radeons will use such design in the foreseeable future? I don't see that "trend" happening. In fact current GPUs have similar or higher voltages than CPUs.
Their operating parameters are far closer to what is necessary for near-threshold operation than a CPU clocked at 4 GHz, particularly mobile GPUs.
One of the demonstration circuits Intel had for the technology besides a 10-100 MHz Pentium was a lighting accellerator for a low-voltage graphics solution.

Quote:
It's a very interesting technology for lowering the power consumption during standby operation of mobile devices, but as soon as you want some real work to be done the voltage and frequency have to swing up to deliver cost-effective performance.
It's not just for standby operation. It is meant to provide performance per watt that is potentially, but only theoretically at this point, an order of magnitude better that what is possible with silicon that cannot sustain regular operation much below 1V. If applied to specialized or fixed-function hardware, it would be hardware that would be lost in the noise of a single active CPU core's ramping up and down.
For serial performance, it appears to be a non-starter because ticking along anywhere near desktop CPU speeds is physically penalized or impossible because of the process and design features of the tech. It's much more interesting for specialized or throughput-oriented uses, hence the emphasis on mobile graphics and HPC.

Quote:
Multi-core adoption has been slow due to the programming challenges involved and due to the chicken-or-egg issue between developers and consumers.
And a lack of compelling need.
If you need the mass throughput of a multicore, Intel wants you to get a SB-E or somesuch. Failing that, there are cloud servers using Xeons with many cores.
Consumers on average have dropped back to the dual to quad core transition with the proliferation of mobile media consumption devices with rapidly advancing GPUs.
We can do core upon core upon core, with the biggest challenge being that, beyond the low single digits, core count is a Do Not Care.


Quote:
Also note that clock frequency and voltage can be regulated on a core by core basis. When the workload is SIMD-intensive, the operating parameters can be adjusted. And again, executing very wide vector operations on less wide execution units reduces the switching activity elsewhere. So there's no reason why such a CPU wouldn't be able to be as power efficient as a GPU (and I do mean a contemporary GPU, the convergence is still happening from both ends).
Intel's mobile graphics solutions IGP or Larrabee-ish cores will, if its tech works, never clock anywhere near as high as minimally acceptable for the primary CPU. Their active power while working--not idle--could be 10-100x lower.
This is very interesting for actual workloads people care about.
__________________
Dreaming of a .065 micron etch-a-sketch.

Last edited by 3dilettante; 01-Nov-2012 at 16:03. Reason: punctuation
3dilettante is offline   Reply With Quote
Old 01-Nov-2012, 18:59   #55
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by 3dilettante View Post
A large number of trends are pointing against consumer-level 24-core OoO chips.
The ROI for such chips for the consumer market is uncompelling, the set of applications users have and those that need that many cores is aside from niche cases software that is a solution searching for a problem, those chips don't work for more power-constrained environments in a market where a big chunk of the new revenue is power-constrained, and an extra generic core provides no "blue crystals" or strong enough means to encourage upgrades.
I never said 24-core. Also note that discrete graphics cards are slowly becoming a niche market. People who do require such performance level are also likely in the same market for a 24-core CPU several years from now. So let's be very clear about what segment and time frame we're talking about. It's obvious that the CPU will unify with the iGPU first, before core counts go up. Something like an 8-core successor to Haswell with no iGPU can have plenty of generic computing power for mainstream graphics needs and many other purposes (including new ones).
Quote:
Their operating parameters are far closer to what is necessary for near-threshold operation than a CPU clockd at 4 GHz, particularly mobile GPUs.
Please don't compare a mobile GPU against a desktop CPU. Many discrete graphics cards are bigger power hogs than the CPU, even at 4 GHz. Mobile Haswell CPUs will consume as low as 10 Watt (and that's CPU+GPU).

So exactly what operating parameters do you believe to be "far" closer to what is necessary for near-threshold operation on a GPU versus a CPU? Peak clock frequency is affected by pipeline length but otherwise it seems to me that a CPU is just as close to being able to operate at near-threshold voltage than a GPU.
Quote:
One of the demonstration circuits Intel had for the technology besides a 10-100 MHz Pentium was a lighting accellerator for a low-voltage graphics solution.
Actually that Pentium (a 4 stage architecture) was able to run at up to 915 MHz at 1.2 Volt, and the logic side was still operational at 0.28 Volt. So I don't see any reason to assume that a GPU would be "far" closer to NVT operation than any CPU. The required design changes are the same for both.
Quote:
It's not just for standby operation. It is meant to provide performance per watt that is potentially, but only theoretically at this point, an order of magnitude better that what is possible with silicon that cannot sustain regular operation much below 1V.
Yes, but striving for this optimal performance/Watt completely obliterates performance/dollar. Hence outside of ultra-low performance niche devices that need to run on harvested energy, the only practical use is for standby operation, still requiring it to be able to run at a relatively high frequency during peak usage, to be commercially viable.
Quote:
And a lack of compelling need.
If you need the mass throughput of a multicore, Intel wants you to get a SB-E or somesuch. Failing that, there are cloud servers using Xeons with many cores.
Consumers on average have dropped back to the dual to quad core transition with the proliferation of mobile media consumption devices with rapidly advancing GPUs.
We can do core upon core upon core, with the biggest challenge being that, beyond the low single digits, core count is a Do Not Care.
No. This is exactly the chicken-and-egg issue I mentioned. Back when 640 kB was enough for everyone, there was no "compelling need" for a mobile phone capable of running Angry Birds. You don't miss what you never had. Likewise, today there appears to be a low demand for more cores, but that's only because of a lack of software, which is in turn caused by the huge challenges of multi-core development. It's not due to a lack of task parallelism, nor a lack of desire for higher performance itself. People still want CPUs with higher single-threaded performance. TSX will no doubt be a game-changer for multi-core by simplifying things for developers and making it more efficient at the same time.
Quote:
Intel's mobile graphics solutions IGP or Larrabee-ish cores will, if its tech works, never clock anywhere near as high as minimally acceptable for the primary CPU. Their active power while working--not idle--could be 10-100x lower.
This is very interesting for actual workloads people care about.
Haswell consumes 10x less power at low frequency and voltage. So like I said, the operating parameters of future CPUs with very wide SIMD units could be adjusted to the workload on a core-by-core basis. So you'll get the benefits of homogeneous computing, with the performance of heterogeneous computing.
Nick is offline   Reply With Quote
Old 01-Nov-2012, 20:03   #56
keldor314
Junior Member
 
Join Date: Feb 2010
Posts: 90
Default

That CPUs are better than GPUs at complex code patterns is somewhat of a misconception. If anything, GPUs can perform better here.

The big reason is that additional parallelism gives the processor much more flexibility to schedule instructions. Whereas a CPU must resort to complex OOO schemes to avoid stalls in the pipeline, which are less efficient with complex code, a GPU simply schedules an instruction from another thread, which, provided there are enough threads ready to run, completely masks a stall.

Consider a branch with nasty, data-dependent, behavior. In this case, the CPU can only guess which way the branch will go, meaning there's a high chance for a mispredict, which costs perhaps 15 cycles x 6 way superscalar = 90 instructions. A GPU would just schedule another thread, and not even attempt to predict the branch, meaning that as long as you can somehow keep the SIMD lanes coherent (which is sometimes actually possible in this case!), you take no branch penalty. This is even more troublesome in a difficult to decode architecture like x86, where it may take several cycles to decode the instruction to the point where you even know that it's a branch in the first place, meaning you can sometimes mispredict on static branches, or even non-branches!

Another tough issue with superscalar architectures is that you can very easily run out of instruction level parallelism. Worse, the amount of silicon it takes to achieve ILP of width N is of complexity order O(N^2), meaning you'll fall further and further behind a simple multithreaded architecture, which has complexity closer to O(N) or perhaps O(N log N).

One big argument for CPUs is that GPUs rely on SIMD instructions, which break down in highly irregular control flow situations. Keep in mind that CPUs also rely on SIMD, though not as wide (4 or 8 in the case of AVX, vs. 32 or 64 for Nvidia and ATI, respectively). Thing is, even with only 1 live SIMD lane, on something like a GTX 680, this still translates to 48 operations per clock (8 SMX x 4 instructions/SMX x 2 wide superscalar, with 6 ALU banks per SMX), which is still very near to CPU scalar performance in the best case (6 cores x 6 wide superscalar = 36 operations/clock, though at a higher clock frequency). Keep in mind that this assumes that the CPU is able to avoid stalls, as well as fill in all of the instruction slots every cycle, which is rather difficult, especially for complex code, meaning that the CPU numbers are inflated. The GPU is much harder to stall, due to the aggressive hyperthreading, and code that actually fragments control flow to this degree is really quite rare outside of pathogenic cases, so it's likely to perform much better that these numbers suggest.

As for programability, the SIMT model is far easier to use than the explicit SIMD model CPUs use. As for efficiency, remember that you can always force the SIMT model to work as an explicit SIMD model is software, but you cannot necessarily do the same in reverse, since SIMT requires full lane level predication (including predicated branches).

The bottom line is that for any problem where it is possible to break it down into thousands of threads, a GPU will almost always outperform a CPU, and the margin by which it does will tend to grow. Since pretty much any rendering algorithm operates on millions of independent pixels (with multiple independent samples for each one!), GPUs will pretty much always trump CPUs for rendering.

GPUs have changed a lot in the last 5 years, to the point that it's very hard to tell the difference between a GPU and a CPU at a functional level. The difference is in the microarchitecture, where GPUs are optimized for running many simultaneous threads, while CPUs are optimized for running only a few.
keldor314 is offline   Reply With Quote
Old 01-Nov-2012, 20:58   #57
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,074
Default

Quote:
Originally Posted by Nick View Post
I never said 24-core.
My initial statement wasn't in response to a claim you made. If a one-sentence reply is all that is made to contradict my claim, I don't find it unreasonable to assume you are taking the same position.

Quote:
Also note that discrete graphics cards are slowly becoming a niche market.
Aside from certain advantages such as PCB-mounted high speed memory, the discrete/integrated dichotomy is almost orthogonal. With the likely advance in memory stacking and 2.5D/3D integration coming at some point in the future, the GPU add-in board could go away. My argument on the divergent needs of silicon targeting one workload or the other would not change based on location.

Quote:
People who do require such performance level are also likely in the same market for a 24-core CPU several years from now. So let's be very clear about what segment and time frame we're talking about. It's obvious that the CPU will unify with the iGPU first, before core counts go up. Something like an 8-core successor to Haswell with no iGPU can have plenty of generic computing power for mainstream graphics needs and many other purposes (including new ones).
I was specific on the segment given: consumer-level real time graphics.
The time frame I gave for forseable could have been tightened down to indicate that it was based on data that could be pulled from publically available process roadmaps from the major foundries and Intel, along with any product roadmaps--although this detail peters out earlier. That's about the 14nm node or equivalent for the foundries, maybe one more for Intel.

A user looking to replace their Radeon 7970 or GeForce 680 on a quad or hex core system is not going to be in the same market for a 24-core CPU.
I'm not sure what you mean by a Haswell successor. Haswell will most likely have SKUs that can max out consumer TDPs with 4 cores, and its immediate successor isn't going to make six times as many cores more palatable. Go much further and it's less likely to be a successor than a replacement.


Quote:
Please don't compare a mobile GPU against a desktop CPU. Many discrete graphics cards are bigger power hogs than the CPU, even at 4 GHz. Mobile Haswell CPUs will consume as low as 10 Watt (and that's CPU+GPU).
That's a specious argument based on a superficial sampling of top-end GPUs not designed for NTV operation. Those GPUs burn power, but it is expected that they can accomplish far more graphically than the 4 GHz CPU, and they do.
To then compare discrete desktop product to an ultraportable platform is pointless, and ignores that Haswell's portable variant actually has more GPU as a fraction of areas specifically because having more low-clocked silicon saves power.

Quote:
So exactly what operating parameters do you believe to be "far" closer to what is necessary for near-threshold operation on a GPU versus a CPU?
Their clock speeds are far lower and architecturally they tend to favor simpler pipelines and an economy in logic implementation. Their processing engines are closer to an original Pentium than a Haswell. Some mobile GPUs can operate in the hundreds of MHz, which is much closer than a multi-GHz processor to the low ceiling NTV puts on switching speeds.
NTV adds area and complexity costs, and it becomes a negative once it approaches regular speeds and voltages.

Quote:
Peak clock frequency is affected by pipeline length but otherwise it seems to me that a CPU is just as close to being able to operate at near-threshold voltage than a GPU.
The switch speeds ceiling allowed by NTV would require a very long pipeline, assuming that an acceptable FO4 per stage is reachable with a pipeline specified to run NTV and at 4 GHz.

Quote:
Actually that Pentium (a 4 stage architecture) was able to run at up to 915 MHz at 1.2 Volt, and the logic side was still operational at 0.28 Volt. So I don't see any reason to assume that a GPU would be "far" closer to NVT operation than any CPU. The required design changes are the same for both.
Its power efficiency curve is not as interesting at 1.2V, and we see the frequency curve just about stall above .8V. It's a small and ancient core that's burning power at the upper end of its range that can be matched by more modern designs with more performance.
Configuring the chip for NTV requires trade-offs against high-speed operation, and forcing it to those speeds actually makes it less efficient or less manufacturable.

Quote:
Yes, but striving for this optimal performance/Watt completely obliterates performance/dollar.
The glut is in transistor counts and the sheer number of die the industry can produce to service a slower-growing global demand. Intel's already idling fabs at 22nm due to softness in demand. There is more flexibility in terms of transistor count and area, but very little for power going forward.

Quote:
Hence outside of ultra-low performance niche devices that need to run on harvested energy, the only practical use is for standby operation, still requiring it to be able to run at a relatively high frequency during peak usage, to be commercially viable.
Intel's not getting funding from the US government on NTV vector permute engines for the sake of harvested energy computing. The power constraints for HPC at the exascale level are immense. Haswell's low-wattage variant is pushing further towards broad areas of low-speed logic as a power/performance tradeoff.

Quote:
No. This is exactly the chicken-and-egg issue I mentioned. Back when 640 kB was enough for everyone, there was no "compelling need" for a mobile phone capable of running Angry Birds.
Back then mobile phones bricks, and a desktop tower couldn't have run Angry Birds.
There was no compelling need for the physically impossible, or at least no more than any other thing requiring unicorns.

Quote:
You don't miss what you never had. Likewise, today there appears to be a low demand for more cores, but that's only because of a lack of software, which is in turn caused by the huge challenges of multi-core development. It's not due to a lack of task parallelism, nor a lack of desire for higher performance itself. People still want CPUs with higher single-threaded performance. TSX will no doubt be a game-changer for multi-core by simplifying things for developers and making it more efficient at the same time.
There is a fundamental shift in the dynamics of the market, from the outset of the IBM-compatible era.
Until recently, the PC had the anomolous benefit of being a business, media, and personal use portal. It was an open and fragmented era where creative, commercial, and individual use flexibility and capabilities were satisfied and funded by the same pool of silicon and the same pool of dollars.
This is not the same era.
The drivers for creative computing or scientific computing are no longer the same as consumer computing, or the same as business computing or enterprise system computing.

It used to be that engineering and revenue went into and came from this one big pool where all stakeholders could benefit from the PC chip as a disruptive technology.
If any sector stagnated, there were other needs or other customers who wanted more, and their contribution pushed the whole forward. The marginal utility of the next big thing drove rapid upgrade cycles across the whole domain.

The market trends now are for a fragmentation of a mature platform, one that is no longer disruptive but mundane and plodding.
For various reasons, we see spending going away from the single clunky box or merchant chip that does everything inconveniently for the consumer.
The consumer market is at least in part regressing, because silicon integration has advanced so far that people now have portable devices that can do just enough of the job of that clunky box that does everything, just not very prettily. The new platform is an inflexible portal for consumption, locked down, and hostile to creating content or processing it. It doesn't need to last, and it is better the more disposable it becomes.
Their money is not going to bring about a need for 24-core PC chips. Their devices do not necessarily want cloud servers running on those either. The supercomputers want more than those chips can provide.
There is still a need for pushing the envelope here, but it is not universally beneficial, so it is not going to be the product priced for the consumer.



Quote:
Haswell consumes 10x less power at low frequency and voltage.
Will you be able to test at some point in the future what FPS is acheivable for some games using Swiftshader for a 10W Haswell chip?
You can then run the same games with the GPU on.
Log battery life.

Quote:
So like I said, the operating parameters of future CPUs with very wide SIMD units could be adjusted to the workload on a core-by-core basis. So you'll get the benefits of homogeneous computing, with the performance of heterogeneous computing.
It's not enough for those interested in NTV, particularly since so much of Haswell's output will rely on binning to get the cream of the crop. NTV is meant for even lower power consumption with better throughput per Watt, and it is meant to do so consistently.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 01-Nov-2012, 22:33   #58
Davros
Darlek ******
 
Join Date: Jun 2004
Posts: 9,495
Default

Quote:
Originally Posted by 3dilettante View Post
Will you be able to test at some point in the future what FPS is acheivable for some games using Swiftshader for a 10W Haswell chip?
Not sure about the 10w but I can take a very, very (cue a few more very's) rough guess for a 4.8ghz haswell (if such a thing was to ever exist) running UT2004 1680x1050 max details, a small 2 player level = 15-20fps
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™
Davros is offline   Reply With Quote
Old 02-Nov-2012, 01:27   #59
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by Davros View Post
Hang on you said earlier avx2 will kill discreet gpu's
and avx2 is coming next year
I never said that.
Nick is offline   Reply With Quote
Old 02-Nov-2012, 04:04   #60
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by keldor314 View Post
That CPUs are better than GPUs at complex code patterns is somewhat of a misconception. If anything, GPUs can perform better here.

The big reason is that additional parallelism gives the processor much more flexibility to schedule instructions. Whereas a CPU must resort to complex OOO schemes to avoid stalls in the pipeline, which are less efficient with complex code, a GPU simply schedules an instruction from another thread, which, provided there are enough threads ready to run, completely masks a stall.
If that was the silver bullet, all we'd need is more than 2-way SMT on a CPU with very wide SIMD units.

But it's not that simple, and GPUs do stall. Complex code uses a big working set, and if your register file and/or caches are too small to hold that working set, then you use up precious bandwidth just to pull in frequently used data. Once you're out of bandwidth, and this can happen at many levels, the GPU stalls.

The CPU's out-of-order execution enables it to keep running just one or two threads per core, thereby maximizing access localities and ensuring high cache hit rates, making it inherently bandwidth efficient at every level. Computing power is getting cheaper, while bandwidth is getting harder to scale, so inevitable the GPU has to learn tricks from the CPU. In fact this has already been going on for years; they try to keep the thread count low by decreasing the back-to-back execution latency. Eventually they'll want to bypass the results back to the top of the execution units, and it's a relatively small step from that to not always schedule instructions from different threads, but to also schedule independent instructions from the same thread.
Quote:
Consider a branch with nasty, data-dependent, behavior. In this case, the CPU can only guess which way the branch will go, meaning there's a high chance for a mispredict, which costs perhaps 15 cycles x 6 way superscalar = 90 instructions.
Actually the misprediction rate of a modern CPU is incredibly low. And graphics is far more regular than the average code.
Quote:
A GPU would just schedule another thread, and not even attempt to predict the branch...
Which only works when you have enough threads to schedule between. Due to high memory access latencies, having many data-dependent branches will stall the GPU.

Branch prediction is a necessity to keep the thread count low. Sooner or later GPUs will need it. Stalling all the time because your data is far away, is worse than mispredicting every now and then.
Quote:
Another tough issue with superscalar architectures is that you can very easily run out of instruction level parallelism.
Not really. First of all, Sandy Bridge has an out-of-order window of 168 instructions. There's not a lot of ILP to miss. In fact Hyper-Threading only offers at most 30% speedup. If ILP was a major issue you'd expect that to be significantly more.

And again, the GPU's urge to switch to another thread to maximize ALU utilization can work against itself. The register file and cache contention create a bottleneck that is far worse than missing a bit of ILP.
Quote:
Worse, the amount of silicon it takes to achieve ILP of width N is of complexity order O(N^2), meaning you'll fall further and further behind a simple multithreaded architecture, which has complexity closer to O(N) or perhaps O(N log N).
It's really not that complex. The NetBurst architecture was theoretically capable of executing four arithmetic operations per clock cycle. It had all the logic for it, 12 years ago. The effective IPC was far lower though, but this had several reasons beyond how wide the execution core was. Core 2 was less wide, but IPC went up, and it didn't cost O(N^2) in transistor budget. Haswell has four arithmetic execution ports again, but this isn't all that complex by today's standards. There's very little logic that has to scale by O(n^2) to make that happen.
Quote:
As for programability, the SIMT model is far easier to use than the explicit SIMD model CPUs use. As for efficiency, remember that you can always force the SIMT model to work as an explicit SIMD model is software, but you cannot necessarily do the same in reverse, since SIMT requires full lane level predication (including predicated branches).
The CPU supports any programming model, including SIMT with predication.
Quote:
The bottom line is that for any problem where it is possible to break it down into thousands of threads, a GPU will almost always outperform a CPU, and the margin by which it does will tend to grow. Since pretty much any rendering algorithm operates on millions of independent pixels (with multiple independent samples for each one!), GPUs will pretty much always trump CPUs for rendering.
You're making (false) assumptions here. Aside from the lack of gather support, which will soon be fixed, the CPU is behind on the GPU due to difference in SIMD width. That can be fixed too while still calling it a CPU.
Quote:
GPUs have changed a lot in the last 5 years, to the point that it's very hard to tell the difference between a GPU and a CPU at a functional level. The difference is in the microarchitecture, where GPUs are optimized for running many simultaneous threads, while CPUs are optimized for running only a few.
Yes, many differences between the CPU and GPU have disappeared one by one. But the convergence is still ongoing, and it's happening from both ends, so it's only a matter of years before those last few differences fade as well.
Nick is offline   Reply With Quote
Old 02-Nov-2012, 04:15   #61
willardjuice
super willyjuice
 
Join Date: May 2005
Location: Astoria, NY
Posts: 986
Default

Quote:
Originally Posted by Nick
I never said that.
[edit] I mean I agree you didn't say "kill discrete gpus", but I believe Davros was purposely over exaggerating. I think the point is that it seems you're implying that after haswell, CPUs would be relatively competitive with their iGPU counterparts. I'm not sure if I'm buying that quite yet.

Quote:
Originally Posted by Grall
A wide, massively parallel GPU absolutely crushes - CRUSHES - any reasonable number of traditional CPUs in performance. Trying to run any modern game using software CPU rendering is going to be pitifully slow, absolutely unplayable and just pathetic.
Quote:
Originally Posted by Nick
The main reason for that is a lack of gather support. But that's exactly the main feature of AVX2, which Intel will introduce in its Haswell CPUs in 2013.
willardjuice is offline   Reply With Quote
Old 02-Nov-2012, 06:13   #62
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by 3dilettante View Post
A user looking to replace their Radeon 7970 or GeForce 680 on a quad or hex core system is not going to be in the same market for a 24-core CPU.
I agree. But that wasn't the time-frame I imagined. Sooner or later there will be a consumer market for a 24-core CPU. A bit of a high-end niche market, just like the one discrete GPUs will be in, but it's coming. And if such a 24-core CPU comes with two 1024-bit SIMD units per core, we're looking at about 12 TFLOPS of computing power. It will simply exhaust the memory bandwidth. So there's no point in having an iGPU on the same die. It would just add cost and be more limited.
Quote:
Their clock speeds are far lower and architecturally they tend to favor simpler pipelines and an economy in logic implementation. Their processing engines are closer to an original Pentium than a Haswell. Some mobile GPUs can operate in the hundreds of MHz, which is much closer than a multi-GHz processor to the low ceiling NTV puts on switching speeds.
That really doesn't matter to achieving NTV operation. A 4 GHz CPU and 1 GHz GPU at the same voltage will simply retain that relative frequency difference when made suitable for NTV operation. Getting that 4-stage Pentium to run at 915 MHz at 1.2 Volt, while also having been altered for NTV operation, makes it a pretty high frequency.

If you're still not convinced: A 5.3GHz 8T-SRAM with Operation Down to 0.41V in 65nm CMOS. A lower starting frequency doesn't make the GPU any more amenable for NTV operation.
Quote:
NTV adds area and complexity costs, and it becomes a negative once it approaches regular speeds and voltages.

The switch speeds ceiling allowed by NTV would require a very long pipeline, assuming that an acceptable FO4 per stage is reachable with a pipeline specified to run NTV and at 4 GHz.
There's plenty of clock headroom for CPUs nowadays. Making them capable of operating at NTV only takes a few percent in die area (note that Intel has already been using 8T SRAM since Nehalem), and the timing impact is easily absorbed by the next process node shrink. It could make sense for mobile chips sooner rather than later.
Quote:
Back then mobile phones bricks, and a desktop tower couldn't have run Angry Birds.
There was no compelling need for the physically impossible, or at least no more than any other thing requiring unicorns.
Yes, but you're missing the point. There is a very real demand for mobile phones capable of running Angry Birds, today. That "compelling need" was born out of having the actual software, which was only made possible by having the hardware capable of running it. Something that was the subject of unicorn tales tens of years ago. There's no compelling need for a CPU with significantly more cores today, because there is no software for it, which in turn is the fault of not having (affordable) hardware that is easy to develop for. Consumers don't care about software that doesn't exist yet. But that can change. Given how much easier multi-core developement will become with TSX technology, it is a certainly that it will eventually increase the demand for more cores.
Quote:
There is a fundamental shift in the dynamics of the market, from the outset of the IBM-compatible era.
Until recently, the PC had the anomolous benefit of being a business, media, and personal use portal. It was an open and fragmented era where creative, commercial, and individual use flexibility and capabilities were satisfied and funded by the same pool of silicon and the same pool of dollars.
This is not the same era.
The drivers for creative computing or scientific computing are no longer the same as consumer computing, or the same as business computing or enterprise system computing.
And yet not so different CPU architectures are used in each of these markets. Computing is computing. You can find a mix of ILP/TLP/DLP workloads in any code.

Note that mobile phone CPUs start to use out-of-order execution and some are quad-core too. Technology like AVX2 also makes sense from a performance/Watt perspective. The HPC market has a need for it too. So despite all this diversification in devices, the drivers for the computing technology are still the same.
Quote:
Will you be able to test at some point in the future what FPS is acheivable for some games using Swiftshader for a 10W Haswell chip?
You can then run the same games with the GPU on.
Log battery life.
Performance/Watt has consistently been catching up with the GPU in the past decade. Given that Haswell adds gather support and FMA, it should considerably reduce the remaining gap. I don't see how you could expect this to support your arguments.
Nick is offline   Reply With Quote
Old 02-Nov-2012, 07:12   #63
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by willardjuice View Post
I mean I agree you didn't say "kill discrete gpus", but I believe Davros was purposely over exaggerating. I think the point is that it seems you're implying that after haswell, CPUs would be relatively competitive with their iGPU counterparts. I'm not sure if I'm buying that quite yet.
I'm only implying that Haswell will mark an historic inflection point. Since AVX1 was practically irrelevant, we'll observe a fourfold increase in floating-point computing power, as well as the addition of gather support and vector-vector shift to truly enable the wide 'vertical' SIMD programming model that has made GPUs so successful. Haswell also doubles the L/S and cache bandwidth and even adds another execution port to avoid contention on the vector ports. They could have spread these features over multiple generations, but they're introducing them all at once. A fair amount of scalar code will become vectorizable and run up to eight times faster while benefiting from AVX2's homogeneous nature. So it's a definite game-changer and the GPGPU computing days in the consumer market at clearly numbered.

CPUs will indeed also become "relatively" competitive with their iGPU counterpart. But not yet in every single aspect or for every model. It makes a lot of sense to not take any risks and keep the iGPU for a few more generations. All I'm saying is that AVX2 should be an eye-opener for what the possibilities of a unified architecture will be.
Nick is offline   Reply With Quote
Old 02-Nov-2012, 17:00   #64
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,074
Default

Quote:
Originally Posted by Nick View Post
I agree. But that wasn't the time-frame I imagined. Sooner or later there will be a consumer market for a 24-core CPU. A bit of a high-end niche market, just like the one discrete GPUs will be in, but it's coming.
The discrete market is likely mostly if not totally going away. Some of the notable niches that might persist longest are not priced at a consumer level and would only exist for some esoteric legacy or professional feature reason. Sprouting a market for chips priced at extreme edition or higher prices is not changing the picture for the vast majority of the market.

Quote:
And if such a 24-core CPU comes with two 1024-bit SIMD units per core, we're looking at about 12 TFLOPS of computing power. It will simply exhaust the memory bandwidth. So there's no point in having an iGPU on the same die. It would just add cost and be more limited.
What number of OoO cores with 2 1024-bit vector units are you saying will provide equivalent performance to what level or size of IGP?
The IGP or throughput portion of the die will be optimized for lower leakage and dynamic power. Haswell is slightly compromising some of its CPU performance by decoupling the L3 cache from the CPU domain, in order to allow scenarios that rely predominantly on the IGP to permit it to be active while the whole CPU section is dropped to an even lower power state. For power savings alone in a consumer or media box scenario, dedicating a significant area to specialized low-speed silicon and allowing only 4 OoO cores to gate off is already compelling.

If 20 of those overbuilt cores are only really needed to substitute for an on-die GPU or throughput section, the vast majority of the market is going to question the need for the 20 cores.

Your bandwidth figures may not be entirely accurate if integration continues apace.

Quote:
That really doesn't matter to achieving NTV operation. A 4 GHz CPU and 1 GHz GPU at the same voltage will simply retain that relative frequency difference when made suitable for NTV operation. Getting that 4-stage Pentium to run at 915 MHz at 1.2 Volt, while also having been altered for NTV operation, makes it a pretty high frequency.
Are you arguing that the NTV P4 should be able to hit 12 GHz?
Maybe it's really not that simple since nothing does, and there's a non-linear amount of effort to get up to that level of speed.
The last Pentium at the .25 micron node could hit 300 MHz with a TDP of 7-8 Watts with a core with 4.5 million transistors, with 1.8-2V.
The first Pentium 4 at ~4 GHz was at .09microns, had a TDP of 115 Watts, and 125 million transistors and 1.25/1.4V.

I dispute the claim that there is a simplistic linear translation in clock speed when faced with increased delay at voltages above near threshold and the increased power cost related to driving larger gates, more complex circuits, and actual changes to the physical and chemical processes that have been mooted for really ramping down NTV power consumption. It's especially critical since a lot less can be hidden in .25 ns versus 1 ns, and honestly the NTV chip's really interesting around 2-3ns per cycle.

Note that the NTV Pentium had about 25% more transistors, consistent with certain circuit design tradeoffs to counter variability and increase reliability at very low voltage. Reducing leakage and variability at .6V means adding changes that drive up the power cost of operating at regular speeds.

Nonetheless, even if we assume something that is that dubious, the P4 core is almost 28 times larger than the most recent non-experimental Pentium core. Granted, that is mostly cache. Stripping it down to core and L1 might put the P4 core at 30-40 million, so just shy of ten times the size.
Add on 25% and the P4 is getting near 50 million transistors.
What is interesting in a P4's performance with a TDP of 7 watts at 32nm?

It's not even interesting compared to cores running at normal voltages.

Quote:
A lower starting frequency doesn't make the GPU any more amenable for NTV operation.
A 4GHz core is not interesting until it gets into the GHz range. GPU pipelines are interesting earlier and their comparatively simplistic pipelines are not as difficult to wrangle than high-speed superscalar OoO engines.
There are plenty of non GHz cores that get equal or better IPC to aggressive cores, they just can't hit high clocks. Without the high clocks, they need to reorder even less since the biggest driver is memory latency relative to the cores.


Quote:
There's plenty of clock headroom for CPUs nowadays. Making them capable of operating at NTV only takes a few percent in die area (note that Intel has already been using 8T SRAM since Nehalem), and the timing impact is easily absorbed by the next process node shrink. It could make sense for mobile chips sooner rather than later.
NTV requires 10T SRAM, so 20% more area for memory, and circuit additions and gate expansion that can add 10% and 5% to logic, at least.
There are >6 million transistor cores that get around 700mW, because that's not the realm that the NTV Pentium is interesting at. The interesting point is 500MHz and below, where the circuits can take their time and a high-speed complex core is pointless fluff.

Quote:
Yes, but you're missing the point. There is a very real demand for mobile phones capable of running Angry Birds, today. That "compelling need" was born out of having the actual software, which was only made possible by having the hardware capable of running it. Something that was the subject of unicorn tales tens of years ago. There's no compelling need for a CPU with significantly more cores today, because there is no software for it, which in turn is the fault of not having (affordable) hardware that is easy to develop for.
What do consumers need it for? Not media creators, not enterprise admins, not engineers, not scientists, but consumers of media and entertainment? Their needs don't match the rest anymore, so their money doesn't go there. Consumers have regressed in their needs in terms of the throughput and power they require and become critical in their demand for battery life.
There's no compelling need for a CPU with 24 cores in the consumer market because they don't do anything that challenging and a CPU with 24 cores is orders of magnitude beyond acceptable in terms of power consumption.
A CPU with 24 cores also currently consumes an order of magnitude beyond acceptable for the next big thing in HPC.

Quote:
And yet not so different CPU architectures are used in each of these markets. Computing is computing. You can find a mix of ILP/TLP/DLP workloads in any code.
The differentiators are the custom IP blocks, DRM and security, interconnect, and media/encryption offload capabilities. On top of this are non-technical questions like what media empire the chip is attached to. Even within this, there are now mobile and non-mobile cores that differ massively.

Quote:
Note that mobile phone CPUs start to use out-of-order execution and some are quad-core too.
The number of different architectures and substantially different dies is much larger.
You are not getting crossover from an Apple SOC with ARM cores, a GPU, and scads of hardware decoders for a consumer device and a 24 core Haswell that fails to supplant a GPU in a gaming desktop.

Quote:
Performance/Watt has consistently been catching up with the GPU in the past decade. Given that Haswell adds gather support and FMA, it should considerably reduce the remaining gap. I don't see how you could expect this to support your arguments.
Which is why Haswell's GPU is even bigger than before. It's even bigger than big for its low-power version.
Intel is very, very interested in continuing voltage scaling way below what is possible with standard circuits. It's not interesting or actively detrimental for workloads where a core's utility is heavily predicated on its straightline speed, but intensely interesting for workloads that are embarrassingly parallel. Push NTV up to where it becomes interesting in serial processing, and it loses to standard cores.

The goal for exascale computing has already extrapolated on the trends for best-case GPU evolution, and it was found wanting. The CPU trendline was even lower.
The desire for power efficiency is going beyond merely different designs on relatively comparable silicon, and demanding an actual specialization in the silicon--if it remains silicon at all in the long term.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 02-Nov-2012, 18:38   #65
Davros
Darlek ******
 
Join Date: Jun 2004
Posts: 9,495
Default

Quote:
Originally Posted by willardjuice View Post
but I believe Davros was purposely over exaggerating.
Actually I got i wrong he said it would be the death of GPGPU which isnt quite the same thing as death of discreet GPU's.

ps: nick theres another thread on b3d (formally) saying its voxels that will be the death of the GPU what are your thoughts on that ?
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™
Davros is offline   Reply With Quote
Old 02-Nov-2012, 21:24   #66
keldor314
Junior Member
 
Join Date: Feb 2010
Posts: 90
Default

Quote:
Originally Posted by Nick View Post
If that was the silver bullet, all we'd need is more than 2-way SMT on a CPU with very wide SIMD units.

But it's not that simple, and GPUs do stall. Complex code uses a big working set, and if your register file and/or caches are too small to hold that working set, then you use up precious bandwidth just to pull in frequently used data. Once you're out of bandwidth, and this can happen at many levels, the GPU stalls.
Hence the use for a medium size cache. But where does the 32MB cache help when you only have 8 or so active threads?

In any case, a high end GPU currently has close to 200 GB/s memory bandwidth, but a 6-core i7 tops around 40 GB/s. Did I mention that the GPU costs around $600, but the 6-core i7, once you throw in the motherboard and memory, is north of $1000? There are various reasons for this (and I don't really know what they are), but do you really think that Intel would chose *not* to make a 200 GB/s workstation if it were possible?

Quote:
The CPU's out-of-order execution enables it to keep running just one or two threads per core, thereby maximizing access localities and ensuring high cache hit rates, making it inherently bandwidth efficient at every level. Computing power is getting cheaper, while bandwidth is getting harder to scale, so inevitable the GPU has to learn tricks from the CPU. In fact this has already been going on for years; they try to keep the thread count low by decreasing the back-to-back execution latency. Eventually they'll want to bypass the results back to the top of the execution units, and it's a relatively small step from that to not always schedule instructions from different threads, but to also schedule independent instructions from the same thread.
You need the same amount of total bandwidth whether you do it multithreaded or single threaded. The only reason the GPU will run out of bandwidth where the CPU won't is that the GPU is computing the results too fast.

Bypassing is very expensive, especially when you have superscalar, SIMD, or long pipelines. It's usually cheaper to just double the size of the register file and be done with it, assuming you can afford to lose binary compatibility.

The main problem though is that memory, even with caches, is very high latency to access. Latency of 10/100/1000 cycles for L1/L2/RAM is fairly typical. Any time you hit a dependent memory pattern (hello there, linked list!), performance will tank unless you have a large number of other instructions in between. You can either schedule in more threads in hardware, or interleave work items in software. But once you have to interleave, you've defeated the entire point of most of the serial performance hardware on a CPU.

For graphics, there are actually algorithms that rely on having per pixel linked lists. Various order-independent transparency methods come to mind. Other algorithms with heavy chained memory dependency include relief mapping and ray-tracing.

Quote:
Actually the misprediction rate of a modern CPU is incredibly low. And graphics is far more regular than the average code.
There are cases where branches are pretty much impossible to predict. Think data-dependent switch statements, for instance. There's just not enough information available to do better than a guess. As for graphics, many complex algorithms are highly data-dependent. For instance, in ray-tracing, traversing the acceleration structure is highly data-dependent, both in branches and memory access. This makes branch prediction impossible, as well as preventing OOO from helping, not to mention destroying a large degree of ILP. High performance CPU ray-tracers actually operate on packets of rays, in very much of a GPU style execution pattern, but then why waste resources on the OOO and BP logic?

Graphics is also unfriendly to caches, since you tend to have an access once, use once pattern. There's short term cache coherence, such as between nearby pixels sampling the same texel, but that can be covered by a very small L1 cache. A larger, higher level cache isn't likely to help much more until the point where it becomes large enough to contain the entire scene, which is to say, the entire size of main memory, since we try to use as high resolution textures as will fit. This pattern also holds for many types of HPC problems. For instance, pretty much any sort of physical simulation looks at a small set of nearby elements, but operatates over a massive data-set (often so large that it doesn't even fit in RAM...).

Quote:
Which only works when you have enough threads to schedule between. Due to high memory access latencies, having many data-dependent branches will stall the GPU.
Data-dependent branches will *always* stall the CPU, since by definition, they can't be predicted. The GPU at least has a good chance of scheduling something else to cover this.

Quote:
Branch prediction is a necessity to keep the thread count low. Sooner or later GPUs will need it. Stalling all the time because your data is far away, is worse than mispredicting every now and then.

Not really. First of all, Sandy Bridge has an out-of-order window of 168 instructions. There's not a lot of ILP to miss. In fact Hyper-Threading only offers at most 30% speedup. If ILP was a major issue you'd expect that to be significantly more.

And again, the GPU's urge to switch to another thread to maximize ALU utilization can work against itself. The register file and cache contention create a bottleneck that is far worse than missing a bit of ILP.
ILP doesn't always exist. Any time you have difficult data-dependence, whether it comes in the form of unpredictable branches or dependent memory accesses, ILP goes out the window, no matter what fancy OOO you throw at it.

It's always better to switch to another thread than to branch predict, provided you have enough resources, since with BP, you run the risk of mispredicting and throwing away work.

Quote:
It's really not that complex. The NetBurst architecture was theoretically capable of executing four arithmetic operations per clock cycle. It had all the logic for it, 12 years ago. The effective IPC was far lower though, but this had several reasons beyond how wide the execution core was. Core 2 was less wide, but IPC went up, and it didn't cost O(N^2) in transistor budget. Haswell has four arithmetic execution ports again, but this isn't all that complex by today's standards. There's very little logic that has to scale by O(n^2) to make that happen.
That's kinda my point. In the amount of die space it takes for a single i7 core, you could fit *four* P4 cores. Not very good scaling.

Quote:
The CPU supports any programming model, including SIMT with predication.
Where are the predicated branches? You really need these to do SIMT with any sort of complex control flow.

Incidentally, predication hardware picks fights with OOO and bypassing.

Quote:

You're making (false) assumptions here. Aside from the lack of gather support, which will soon be fixed, the CPU is behind on the GPU due to difference in SIMD width. That can be fixed too while still calling it a CPU.

Yes, many differences between the CPU and GPU have disappeared one by one. But the convergence is still ongoing, and it's happening from both ends, so it's only a matter of years before those last few differences fade as well.
CPUs sacrifice general purpose use for serial performance, GPUs sacrifice it for parallel performance. The real question is which one future workloads will emphasize.

Last edited by keldor314; 03-Nov-2012 at 03:21.
keldor314 is offline   Reply With Quote
Old 04-Nov-2012, 17:07   #67
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by 3dilettante View Post
The discrete market is likely mostly if not totally going away.
Agreed.
Quote:
What number of OoO cores with 2 1024-bit vector units are you saying will provide equivalent performance to what level or size of IGP?
I'll let you know once we get there.
Quote:
The IGP or throughput portion of the die will be optimized for lower leakage and dynamic power. Haswell is slightly compromising some of its CPU performance by decoupling the L3 cache from the CPU domain, in order to allow scenarios that rely predominantly on the IGP to permit it to be active while the whole CPU section is dropped to an even lower power state. For power savings alone in a consumer or media box scenario, dedicating a significant area to specialized low-speed silicon and allowing only 4 OoO cores to gate off is already compelling.
CPU cores are also becoming more optimized for lower leakage and dynamic power with every generation. And as I said before, when running a workload with heavy SIMD usage, they could be clocked lower and have their voltage adjusted. So instead of having a "throughput portion of the die", every core can become throughput optimized. Of course you lose a bit of efficiency by making it capable of achieving a high frequency, but that's a small price to pay for the added programmability and increased bandwidth efficiency. Just think of what was sacrificed to unify vertex and pixel processing, and what was gained instead.
Quote:
Your bandwidth figures may not be entirely accurate if integration continues apace.
Bandwidth will definitely (have to) go up, but the pin count is limited and high frequency doesn't help power consumption. So the architecture itself has to become more frugal with bandwidth. CPUs are very bandwidth efficient thanks to running few threads.
Quote:
What is interesting in a P4's performance with a TDP of 7 watts at 32nm?
You're attacking a straw man. I never mentioned P4 in the context of NVT.
Quote:
NTV requires 10T SRAM, so 20% more area for memory...
The paper I linked earlier presents 8T SRAM capable of operating at 0.41 Volt and reaching 5.3 GHz, while Intel's NTV Pentium used 10T SRAM that required at least 0.55 Volt and only reached 915 MHz. There's even a 6T design that operates at 0.2 Volt. That said, more transistors are needed at smaller process nodes. But that means the density is still increasing. The exact cell size depends on transistor type and layout though, so you can't put a percentage on it based on transistor count alone.

Anyhow, I still don't see any reason to assume that GPUs are significantly more suitable or will significantly benefit more from NTV design.
Quote:
What do consumers need it for? Not media creators, not enterprise admins, not engineers, not scientists, but consumers of media and entertainment? Their needs don't match the rest anymore, so their money doesn't go there. Consumers have regressed in their needs in terms of the throughput and power they require and become critical in their demand for battery life.
There's no compelling need for a CPU with 24 cores in the consumer market because they don't do anything that challenging and a CPU with 24 cores is orders of magnitude beyond acceptable in terms of power consumption.
All the same things were said about 1 GHz processors. Yet today they're unacceptably slow.

Intel has been its own competition for the last 6+ years. How do they get people to buy their hardware? By producing ever faster processors. Note that this strategy has worked really well for them. So there must be demand for higher performance. Software becomes ever more demanding, people want faster response times, and consumers don't always just consume. There's probably a million more reasons. Note also that Haswell adds another scalar integer execution port, a third AGU, doubles the cache bandwidth, extends integer AVX instructions to 256-bit, adds FMA support, adds gather support, adds hardware transactional memory and lock elision, etc. If there was little or no demand for higher performance, why add these performance features, all at once? They must realize something you don't. They create demand, by creating powerful hardware that is easy to develop software for. It's obvious that AVX2 is more developer-friendly than heterogeneous computing, and TSX's only purpose is to enable scaling to higher core counts. So sooner or later there will be demand for 24-core CPUs, when Intel makes it so.

Also note that if consumers demand ever better battery life, then why are mobile phones aggressively increasing their CPU performance, while battery life only increases slightly? Apparently their needs in terms of performance hasn't regressed much at all.
Quote:
The differentiators are the custom IP blocks, DRM and security, interconnect, and media/encryption offload capabilities. On top of this are non-technical questions like what media empire the chip is attached to. Even within this, there are now mobile and non-mobile cores that differ massively.
There's convergence happening there as well. It's quickly becoming cheaper to pick an off-the-shelf SoC with a few too many features, than to design your own with just the right IP blocks (which quickly becomes outdated). It's really no coincidence that Intel announced they will be using a tick-tock development cycle for their mobile chips, and move to 14 nm as soon as possible. It's plain obvious they want to deliver a set of chips that will cover the needs of 90% of the mobile computing market in a couple years from now. This market was is a big flux lately, but it's now becoming clear what the needs will be. So those "differentiators" become standard features.

Anyway, to get back to my original point: despite the fact that there are now many more form factors than the desktop PC, they all still strive for higher performance.
Quote:
Which is why Haswell's GPU is even bigger than before. It's even bigger than big for its low-power version.
Yes, GPUs keep getting bigger, but you completely fail to acknowledge that the CPU is catching up faster. Fourfold throughput per core, plus gather, in a just a few years' time and with negligible transistor count increase. The GPUs can't outrun Moore's Law any more. They can only get bigger and more expensive. The large GT3 will come with a hefty price tag, in part also due to the necessity to add a chunk of eDRAM. It has the potential to wipe out the mid-end discrete mobile GPU market, but AVX2 will wipe out pretty much the entire consumer GPGPU market. So Intel's push for higher-end graphics market share doesn't change the balance. There's still GT2 and GT1, while the CPU cores make massive progress on all fronts.
Quote:
Intel is very, very interested in continuing voltage scaling way below what is possible with standard circuits. It's not interesting or actively detrimental for workloads where a core's utility is heavily predicated on its straightline speed, but intensely interesting for workloads that are embarrassingly parallel.
Again that's just a hollow claim. It is no more economical to sacrifice GPU frequency than CPU frequency. Intel's graphics cores operate at the same voltage as the CPU cores. And there's no point in having one use NTV technology before the other.
Nick is offline   Reply With Quote
Old 04-Nov-2012, 17:59   #68
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by Davros View Post
nick theres another thread on b3d (formally) saying its voxels that will be the death of the GPU what are your thoughts on that ?
First of all, I wouldn't make it all dramatic by calling it the "death" of the GPU. AVX2 is bringing a significant amount of GPU technology into the CPU cores, and there's bound to be much more to come. When the CPU and GPU unify, it won't be any more the death of the GPU as the death of the CPU as we know them. I'd like to think of it as the birth of something far superior to both.

That said, I don't think voxels will be the single motivation behind the unification of CPU and GPU technology. Polygon rasterization is here to stay for a very long time, if not indefinitely. But eventually we'll have TFLOPS of generic computing power, and people will no doubt use it for innovative graphics purposes, and other high throughput applications.
Nick is offline   Reply With Quote
Old 05-Nov-2012, 02:06   #69
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by keldor314 View Post
Hence the use for a medium size cache. But where does the 32MB cache help when you only have 8 or so active threads?
What 32 MB processor running only 8 threads are you talking about exactly?
Quote:
In any case, a high end GPU currently has close to 200 GB/s memory bandwidth, but a 6-core i7 tops around 40 GB/s.
Nobody's arguing that. But the integrated graphics market is growing at the expense of the discrete market. So you have to take into account that future architectures will have to deal with the CPU socket's bandwidth limitations. It will grow with DDR4 and such, but transistor density grows faster, while at the same time the compute density increases with wider SIMD units, FMA, and gather.
Quote:
Did I mention that the GPU costs around $600, but the 6-core i7, once you throw in the motherboard and memory, is north of $1000?
That's not even remotely a fair comparison. First of all, a 6-core i7 is in the same market as a professional graphics card of over 2000$. They're both overpriced because that's what that market segment will currently pay for them, not because that's what they have to cost. So lets not compare overpriced parts to make conclusions about the technology.

Secondly, the GPU is completely helpless without the CPU. It needs it to run the graphics driver, the application, and the operating system. It also dearly needs the motherboard and system RAM for the same reasons. And you also have to take into account that the CPU can do way more than that. So you can't just compare them without properly taking that into account.
Quote:
You need the same amount of total bandwidth whether you do it multithreaded or single threaded.
No. Using more threads causes cache contention, which means there will be more cache misses, which means it takes more bandwidth to fill them.
Quote:
Bypassing is very expensive, especially when you have superscalar, SIMD, or long pipelines. It's usually cheaper to just double the size of the register file and be done with it, assuming you can afford to lose binary compatibility.
Actually current GPU architectures have something similar to a bypass network, for operand staging purposes. This is very valuable since it can deal with register file bank conflicts, and keeps them smaller by reducing latency and therefore thread count. It's only a matter of time before GPUs will use a true bypass network. It's really not that expensive, and well worth the improvement in register file size per thread, cache coherency, and bandwidth savings.
Quote:
The main problem though is that memory, even with caches, is very high latency to access. Latency of 10/100/1000 cycles for L1/L2/RAM is fairly typical.
Actually for Sandy Bridge it's 4/12/40/215 for L1/L2/L3/RAM. And that's worst case, fully random. And of course prefetching brings down the average latency considerably.
Quote:
Any time you hit a dependent memory pattern (hello there, linked list!), performance will tank unless you have a large number of other instructions in between.
That greatly depends on the size of the linked list. Anything used in graphics will most likely fit in the lower cache levels and thus can easily be scheduled around. And then there's still 2-way SMT for the tougher cases.
Quote:
You can either schedule in more threads in hardware, or interleave work items in software. But once you have to interleave, you've defeated the entire point of most of the serial performance hardware on a CPU.
I wouldn't put it that way. Future architectures will have to deal with various workloads. With a CPU-like architecture, high serial performance is the default, and you can hide latency with software pipelining. With a GPU-like architecture, you don't get that choice. There are numerous OpenCL workloads where the CPU runs circles around the iGPU, and that's before AVX2.
Quote:
There are cases where branches are pretty much impossible to predict. Think data-dependent switch statements, for instance. There's just not enough information available to do better than a guess.
Sure, but these are extremely rare, especially in graphics, and again there's still 2-way SMT.

Also, an unpredictable branch means a highly divergent one, which are not wide-SIMD friendly. Some things in computing are just hard no matter what architecture you use. But as I said before, mispredicting a branch once in a while is less worse than running too many threads and having low cache hit rates and thus hitting the bandwidth wall.
Quote:
As for graphics, many complex algorithms are highly data-dependent. For instance, in ray-tracing, traversing the acceleration structure is highly data-dependent, both in branches and memory access. This makes branch prediction impossible, as well as preventing OOO from helping, not to mention destroying a large degree of ILP. High performance CPU ray-tracers actually operate on packets of rays, in very much of a GPU style execution pattern, but then why waste resources on the OOO and BP logic?
Because a bit of everything ends up being more efficient. There are no silver bullets. SMT is useful, but only in moderation, and out-of-order execution and branch prediction help keep the thread count low. That way much of the data needed by a few threads can reside in the caches, and this ends up being faster than accessing RAM every few instructions, for which there is not sufficient bandwidth or not sufficient storage to cover the latency.
Quote:
Graphics is also unfriendly to caches, since you tend to have an access once, use once pattern.
Actually tile based rendering saves a lot of bandwidth by 'caching' the frame buffer. Texture accesses have also evolved from being purely geometric surface parameters, to a wide variety of more generic data lookups, some of which are nicely cacheable.
Quote:
A larger, higher level cache isn't likely to help much more until the point where it becomes large enough to contain the entire scene, which is to say, the entire size of main memory, since we try to use as high resolution textures as will fit.
It really doesn't have to be that large. Mipmapping limits the amount of data that is accessed, which is about one unique texel per pixel, per lookup.
Quote:
This pattern also holds for many types of HPC problems. For instance, pretty much any sort of physical simulation looks at a small set of nearby elements, but operatates over a massive data-set (often so large that it doesn't even fit in RAM...).
Actually there are cache-aware algorithms for things like partial differential equations.
Quote:
It's always better to switch to another thread than to branch predict, provided you have enough resources, since with BP, you run the risk of mispredicting and throwing away work.
No. Switching threads will evict cache lines, which can be much worse than throwing away the work of a rare mispredicted branch.
Quote:
That's kinda my point. In the amount of die space it takes for a single i7 core, you could fit *four* P4 cores. Not very good scaling.
With Haswell, the peak floating-point throughput per core will be eight times higher than a P4 core. Effective throughput will be even higher due to a higher average IPC.
Quote:
Where are the predicated branches? You really need these to do SIMT with any sort of complex control flow.
No you don't.
Quote:
CPUs sacrifice general purpose use for serial performance, GPUs sacrifice it for parallel performance. The real question is which one future workloads will emphasize.
Neither and both. When you have to categorize workloads to run it on one or the other, you inherently lose performance. Not to mention it's a hell to have to deal with many different configurations. So the solution is to have a single unified homogeneous architecture which can dynamically adapt to any workload. AVX2 is a significant step in that direction.
Nick is offline   Reply With Quote
Old 05-Nov-2012, 06:05   #70
3dilettante
Senior Member
 
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,074
Default

Quote:
Originally Posted by Nick View Post
I'll let you know once we get there.
Do you feel it is within the next 3 or so process nodes?
I guess I'll try to bookmark this thread for whenever that is.

Quote:
CPU cores are also becoming more optimized for lower leakage and dynamic power with every generation.

And as I said before, when running a workload with heavy SIMD usage, they could be clocked lower and have their voltage adjusted.
This has been factored into the extrapolations made by HPC projects, Intel, and most everyone else.
That existing curve for regular cores with standard circuit and gate choices has been judged insufficient. Even specialized hardware was put against the demanded power/performance and found insufficient.
The big quadratic or cubic factor in power is voltage, and to really get the advance needed, Intel is actively exploring a method that allows circuits to function close to the point that silicon becomes fundamentally unreliable, and the choices it made to counteract the challenges at that level differ from what it does trying to get silicon to twitch at 4 GHz.


The same core does not look like it is going to be able to drop to near-threshold if it is also expected to run normally. The logic needs to be refactored, the gates themselves are pushed to allow for decent performance at very low voltage. Some of those choices, such as increasing the transistors per per pipe stage and choosing to replace high Vt gates with nominal threshold devices hurt clock speed and hurt known leakage-control measures that are important to cores that run at >1V.

Quote:
So instead of having a "throughput portion of the die", every core can become throughput optimized.
My point is that this is being strongly hinted by ongoing research as being akin to "having your cake and eating it too".

Quote:
Bandwidth will definitely (have to) go up, but the pin count is limited and high frequency doesn't help power consumption.
2.5D integration or 3D integration change the dynamic by removing pins and going to traces/vias. This also has the upshot of dropping the power per IO by an order of magnitude and the area cost of the PHYs. 2.5D at least seems to be coming down the pipe some time in near to medium term, with Intel likley introducing a low-power Haswell variant with a memory on interposer design.

Quote:
You're attacking a straw man. I never mentioned P4 in the context of NVT.
Relative to the last Pentium sold, the P4 is the first core to get near 4GHz. Your argument was that the GHz at a given voltage point would be proportionate between the low-clocked core and a multi-GHz one. The upper range of the short-pipeline Pentium at 1.2V was brought up to indicate that there was a vast amount of headroom.
I'm stating with an example that a core that is already up in the stratosphere doesn't have the ability to sit in place and maybe clock higher, and that it's likely not interesting in the range it has to drop to.

Quote:
The paper I linked earlier presents 8T SRAM capable of operating at 0.41 Volt and reaching 5.3 GHz, while Intel's NTV Pentium used 10T SRAM that required at least 0.55 Volt and only reached 915 MHz.
Intel disclosed a whole processor at NTV and analyzed the decisions made to produce a functioning and reliable pipeline. SRAM arrays or ALUs in isolation have been downvolted and upclocked in isolation.
Did Intel explicitly state that the SRAM could not scale below .55V? Is it possible that lower voltages are not mentioned because when integrated into a processor this is counterproductive?

Quote:
Anyhow, I still don't see any reason to assume that GPUs are significantly more suitable or will significantly benefit more from NTV design.
There are GPUs that operate at sub-GHz speeds and are considered adequate or interesting for their workload. Their pipelines are simple, and some are nearly as short as that of the NTV Pentium.
They have leeway to give up at least some straightline speed for an EP workload.

A high-performance core that exists to provide top of the line straightline performance stops being interesting for the workload if it gives up straightline speed.

Quote:
All the same things were said about 1 GHz processors. Yet today they're unacceptably slow.

Intel has been its own competition for the last 6+ years. How do they get people to buy their hardware? By producing ever faster processors. Note that this strategy has worked really well for them. So there must be demand for higher performance.
They're still pretty common in consumer mobile devices. Even the ones that clock a bit higher are pretty weak. They get the cash, though.


Performance is not universally demanded. What Intel provided earlier was greater utility for the customer. Back when silicon scaling didn't drop off a cliff and we had kilobytes of storage, greater performance had an immediate increase in the utility that Intel's product had over preceding designs for pretty much every portion of its market and every market it expanded into.

Utility is what drives purchases, or planned obsolescence and "blue crystals".
The GPU in the latest Intel chips provided far more utility than another 4-6 CPU cores for consumers and corporate procurement. Highly performant cores that outsripped consumer and business demand have flatlined the utility argument for PC refreshes, and in mature markets the upgrade cycle for the generic programmable CPU has grown longer and longer.

A chip with special checkbox features, features that become obsolete, or have proprietary advantage with little technical justification is preferable in certain markets.


Quote:
Software becomes ever more demanding, people want faster response times, and consumers don't always just consume.
If you lump in the sales of smartphones, tablets, and small-form computers, the consumer market's dollars per unit of CPU performance has taken a sharp hit.
In terms of core count, mobile users have dropped from 4-6 cores after grudgingly going above two back down to 1-2 and sort of trying to justify going to mainstream 4 cores by 20nm.

I think that the time period around 20nm was predicted to have mainstream octocores or somesuch a years or so ago.

Quote:
If there was little or no demand for higher performance, why add these performance features, all at once? They must realize something you don't.
They added these features because they can, and must for markets that still care about performance. Given the lead-time on architectural decisions, it's also the case that some of these choices were baked in before a lot of market movement shifted in a different direction.
For markets that aren't entirely hostile to a core the cost or power consumption of a Haswell device, it saves on engineering to use a core in multiple products in various markets.

In markets that don't meet the pricing and power consumption of Haswell, Intel has built a specialized line of cores.
To counter, if Intel cared so much about these features, why does it fuse off so many features in half its SKUs?
Depending on the entry in the decoder wheel, you can get a core with or without multithreading, with or without encryption accelleration, with or without graphics, with or without virtualization, and with or without a large number of cores.
Surely if it cared about generating demand for 24-core chips it would just sell 24 core chips.

Or maybe its about the marketing of a product and segmentation to generate the needed margins, and 24-core chips are not a promising direction.



Quote:
They create demand, by creating powerful hardware that is easy to develop software for. It's obvious that AVX2 is more developer-friendly than heterogeneous computing, and TSX's only purpose is to enable scaling to higher core counts. So sooner or later there will be demand for 24-core CPUs, when Intel makes it so.

Also note that if consumers demand ever better battery life, then why are mobile phones aggressively increasing their CPU performance, while battery life only increases slightly? Apparently their needs in terms of performance hasn't regressed much at all.
There is demand for 24-core CPUs now, in certain specific areas.
Let me know when we get there for the consumer market.

I would suggest comparing a phone's processor cores to an Intel CPU from seven years ago and explaining why there wasn't a regression.
The GPU portion might have a better argument.


Quote:
There's convergence happening there as well. It's quickly becoming cheaper to pick an off-the-shelf SoC with a few too many features, than to design your own with just the right IP blocks (which quickly becomes outdated).
The part in parentheses is not always a bad thing for those selling the kit.

Quote:
Anyway, to get back to my original point: despite the fact that there are now many more form factors than the desktop PC, they all still strive for higher performance.
They moved the goalposts.
I can go back to elementary school and strive for graduation again, too.
Maybe they're striving for something, but performance is not the primary driver.

Quote:
Yes, GPUs keep getting bigger, but you completely fail to acknowledge that the CPU is catching up faster. Fourfold throughput per core, plus gather, in a just a few years' time and with negligible transistor count increase.
Negligible by whose count?
Does it help that there's been negligible decrease in TDP at the same time?

Quote:
The GPUs can't outrun Moore's Law any more. They can only get bigger and more expensive. The large GT3 will come with a hefty price tag, in part also due to the necessity to add a chunk of eDRAM.

Again that's just a hollow claim. It is no more economical to sacrifice GPU frequency than CPU frequency. Intel's graphics cores operate at the same voltage as the CPU cores. And there's no point in having one use NTV technology before the other.[/
I'm not sure it's going to be eDRAM, and the reasons for it are more long-term than just graphics.
It does help that graphics is a workload that stil may garner the margins necessary to offset the risks and high costs of the initial effort.
More general use of the tech is not ready for prime time, and it's one of the few use cases where consumers may be able to notice the difference in terms of experience and battery life versus a solution that does not have it.

A GPU at 300-700 MHz is acceptable, and the NTV Pentium works in that range. There's decent demand for GPUs with that speed range right now for mainstream laptops, value desktops, and mobile and embedded.
__________________
Dreaming of a .065 micron etch-a-sketch.
3dilettante is offline   Reply With Quote
Old 05-Nov-2012, 07:27   #71
Blazkowicz
Senior Member
 
Join Date: Dec 2004
Location: Toulouse
Posts: 4,141
Default

Quote:
Originally Posted by Nick View Post
That's not even remotely a fair comparison. First of all, a 6-core i7 is in the same market as a professional graphics card of over 2000$. They're both overpriced because that's what that market segment will currently pay for them, not because that's what they have to cost. So lets not compare overpriced parts to make conclusions about the technology.
Wow, a 6 core i7 + mobo is twice the cost of an 4 core i7 + mobo, but it's still something an individual can work out. It would be about the $1000 mark with 16GB, a 350W PSU, a low end vid card and one small fixed drive. (hmm, I believe that would be good for content creation, add more storage as needed but that's not the topic?)

That 4 core i7 is overpriced too, because disabling the SMT to make an i5 from it is dubious. I'm sure near every CPU is overpriced. This is just not 5x markup for less performance, as with professional cards.

Last edited by Blazkowicz; 05-Nov-2012 at 07:34.
Blazkowicz is online now   Reply With Quote
Old 08-Nov-2012, 03:45   #72
keldor314
Junior Member
 
Join Date: Feb 2010
Posts: 90
Default

One thing to note is that most games are currently optimized for XBOX 360 and PS3 level hardware. Keep in mind that those are nearly a decade old! If you went back the same amount of time from there, you'd find the SNES just being replaced by the N64, and the first Playstation! Quake, acclaimed as the first game with true 3D graphics, with triangles and stuff, had just appeared! So it's not surprising that even a very low end system can handle things well enough, hence the rise of integrated graphics. Now, when the next XBOX and the PS4 finally appear, things are going to change.

Imagine, a current high end CPU is nearly as powerful as a GPU from 7 years ago. Let the GPU world tremble.

Last edited by keldor314; 08-Nov-2012 at 03:54.
keldor314 is offline   Reply With Quote
Old 08-Nov-2012, 10:59   #73
Davros
Darlek ******
 
Join Date: Jun 2004
Posts: 9,495
Default

Maybe on paper but look at some radeon 9800pro ut24k benchmarks
http://forums.epicgames.com/threads/...004-Benchmarks

an i7 running swiftshader wouldn't come close
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™
Davros is offline   Reply With Quote
Old 08-Nov-2012, 14:39   #74
Nick
Senior Member
 
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
Default

Quote:
Originally Posted by 3dilettante View Post
Do you feel it is within the next 3 or so process nodes?
I guess I'll try to bookmark this thread for whenever that is.
Why this obsession with trying to put a date on it? It's not as if everything becomes meaningless if the steps that will lead to the unification of CPU and GPU happen a day later than I anticipate.

I can tell you though that I was quite surprised that Intel will introduce 256-bit integer operations, FMA, and gather support, all in one generation and as early as next year. And it was yet another pleasant surprise that Haswell will have two FMA units, with no latency penalty, a third AGU, and a fourth integer execution port to offload the vector ports. And let's not forget TSX. I expected these features to be introduced over the course of several architecture generations. So things are actually happening much faster than anticipated.

There's still a long way to go, but at least the discussions are no longer stuck on fundamental things like the feasibility of gather support or half-decent floating-point throughput. The convergence is strong, and we're starting to get to the finer details of discussing what it would take to unify them.
Quote:
This has been factored into the extrapolations made by HPC projects, Intel, and most everyone else.
I was mainly talking about the desktop market, where unification is the most meaningful due to diverse workloads. Diverging the discussion to the HPC market doesn't help you make your point though. Intel isn't about to gift wrap the HPC market and hand it to the GPU manufacturers. Much of the AVX extensions was inspired by the needs of the HPC market. Their NTV research, funded by a US government grant, also concentrates around HPC. So that's quite a bold claim you're making, and I can't find any evidence of it. Where's the GPU manufacturer's NTV research?

Also note that the top supercomputer to date, the IBM Sequoia, is a GPU-less design. It is more power efficient than the top CPU-GPU design, which takes spot number five. It's clear that it won't take a lot of GPU technology to be integrated into the CPU cores, to keep it that way. Heterogeneous computing was a short-lived hype.
Quote:
My point is that this is being strongly hinted by ongoing research as being akin to "having your cake and eating it too".
Could you please point me to this ongoing research? I have yet to come across reputable research which suggests to combine traditional CPU cores with very wide SIMD units and use core-by-core dynamic frequency/voltage regulation based on the workload type.

By the way, note that Intel is already dynamically switching between using only the lower 128-bit execution units for AVX (issuing them in two cycles), or the full 256-bit. When not executing 256-bit AVX code, the core can put the upper 128-bit lane to sleep. As far as I know they're not yet adjusting the frequency or voltage, but it's obviously a first step towards cores that adjust to the workload.
Quote:
2.5D integration or 3D integration change the dynamic by removing pins and going to traces/vias. This also has the upshot of dropping the power per IO by an order of magnitude and the area cost of the PHYs. 2.5D at least seems to be coming down the pipe some time in near to medium term, with Intel likley introducing a low-power Haswell variant with a memory on interposer design.
Sure, but it's only a one-off solution to avoid hitting the memory wall. It's essentially just another cache level. Which is a reasonably cost effective way to scale another one or two process nodes, but it's not a permanent solution. It's quite ironic that GPU manufacturers used to laugh at how much die space CPU designs were spending on cache memory. Now they need massive caches themselves, which don't come for free. Again it's all just inevitable convergence. Logic gets cheaper faster, than bandwidth. So you can spend those transistors by making things more programmable, more CPU-like, and thus also making it more bandwidth efficient.
Quote:
Relative to the last Pentium sold, the P4 is the first core to get near 4GHz. Your argument was that the GHz at a given voltage point would be proportionate between the low-clocked core and a multi-GHz one. The upper range of the short-pipeline Pentium at 1.2V was brought up to indicate that there was a vast amount of headroom.
I'm stating with an example that a core that is already up in the stratosphere doesn't have the ability to sit in place and maybe clock higher, and that it's likely not interesting in the range it has to drop to.
I said there's plenty of headroom to achieve 4 GHz as well as NTV operation for current architectures, with the next process node. The NetBurst architecture isn't interesting by today's metrics, so why expect it to become interesting with NTV technology? Again, that's just a straw man you created. You're not any closer to proving that NTV is going to be any more valuable to GPU architectures than CPU architectures.
Quote:
Intel disclosed a whole processor at NTV and analyzed the decisions made to produce a functioning and reliable pipeline. SRAM arrays or ALUs in isolation have been downvolted and upclocked in isolation.
Did Intel explicitly state that the SRAM could not scale below .55V? Is it possible that lower voltages are not mentioned because when integrated into a processor this is counterproductive?
SRAM samples are pretty much on the mark for what to expect in production. The difference has to be due to higher susceptibility to process variation at smaller process nodes. This is especially true for SRAM, which has a highly custom layout and pushes the boundaries of the design rules to achieve maximum density (which directly relates to cost). So it all depends on what balance they want between power consumption, reliability, and cost. And since this is an early prototype, they may simply have aimed to high or too low on a few parameters, depending on your point of view.

My point was that you don't necessarily need 10T SRAM to achieve NTV operation, or achieve any benefit at all. It's a pretty wide design space. You don't necessarily need NTV in the strict sense to vastly improve performance/Watt. The NTV Pentium has its optimal performance/Watt at 0.45 Volt for logic, which isn't all that far from the 0.55 Volt required to keep the SRAM reliable, but most importantly it's not exactly near the threshold voltage. Pushing for full NTV operation isn't worth the drop in absolute performance for any consumer market any time soon. But Intel has hugely benefited from its 8T SRAM design for years now, despite not counting as NTV design. So I expect we'll simply ease into it over the course of many years. But there are no signs of GPUs adopting this technology any faster than CPUs.
Quote:
There are GPUs that operate at sub-GHz speeds and are considered adequate or interesting for their workload. Their pipelines are simple, and some are nearly as short as that of the NTV Pentium.
They have leeway to give up at least some straightline speed for an EP workload.

A high-performance core that exists to provide top of the line straightline performance stops being interesting for the workload if it gives up straightline speed.
You're starting to cherry-pick again. "There are" low frequency CPUs too that are considered adequate for their job. But lets compare things in the same market segment. Even my desktop's i7-2600 operates between 3.8 GHz and 1.6 GHz. I consider that quite a bit of "leeway" to give up straight-line speed. I doubt the iGPU's frequency delta is much different, and it probably gets adjusted under similar circumstances. Intel's 10-core chips also operate at only about 2 GHz. Same architecture as the ones aiming at 4 GHz.
Quote:
Performance is not universally demanded.
Yes it is. Supply creates it own demand. Your next PC, tablet, laptop, workstation or phone will be more powerful than your previous one, simply because there will be faster hardware that is cheaper. It's more of a 'push' demand than a 'pull' demand, but it's demand nonetheless. Anyone offering slower products than the competition will not get that demand. So it's easy to conclude there is demand for higher performance.
Quote:
Back when silicon scaling didn't drop off a cliff and we had kilobytes of storage, greater performance had an immediate increase in the utility that Intel's product had over preceding designs for pretty much every portion of its market and every market it expanded into.
First of all, silicon scaling isn't dropping off a cliff. Secondly, I don't think that the utility of higher performance was significantly better in the past. It's not like anyone was interested in playing Pong at a million FPS, or wanted missile trajectories to be computed faster than they could be printed. Of course in hindsight they probably wanted to play Battlefield 3, and have missiles seek their target autonomously. But hindsight is 20-20, and they just didn't have the software or the peripheral hardware to make good use of faster processors, at that time. Yet it still evolved into the pocket supercomputers we have today. Smartphones only came to be when Apple combined a high accuracy capacitative touch screen with an innovative and intuitive user interface. The race for higher mobile CPU performance has been on ever since. But these things are all interdependent. You have to put the foot that's behind, in front of the other, to move forward.

The only major glitch in this cycle is that currently multi-threading is uninviting to the average developer. It's way too much effort for too little gain. We need a change in hardware support for synchronization primitives, a change in programming methodology, and a change in market adoption. All of these things take time. It will be 8 years between the first commodity dual-core processors, and the ability to atomically transfer more than one word of information between cores in an efficient manner. It's insane how much has already been achieved without that fundamental capability, which illustrates there's no lack of trying. Likewise, we've slowly but surely seen more than two cores become the standard, increasing the incentive for developers to take advantage of it and stay ahead/on par with the competition. But like I said before, the biggest change is yet to come, with new programming paradigms and a matching tool chain. This can easily take several decades. But rest assured that our grandchildren will look at Battlefield 3 the same way we look at Pong, and they won't have to deal with the limitations of heterogeneous computing.
Quote:
I think that the time period around 20nm was predicted to have mainstream octocores or somesuch a years or so ago.
Yes, and AMD dove into that headfirst. They strived for high core count (TLP) with Bulldozer, at the expense of ILP and DLP. It should be ILP > DLP > TLP, in that order, to be programmer-friendly and maximize real-world performance. The average IPC (cf. ILP) has doubled since the early quad-cores, so there's no need for an octa-core just yet, but you do get that level of theoretical performance. Likewise, the theoretical floating-point vector performance (cf. DLP) per core will have quadrupled with Haswell, augmented by a frequency increase, IPC increase, and parallel memory gather performance. So they've created something vastly better than a straightforward octa-core. Increasing the core count is a last resort, when they're out of other ideas. But they're already preparing to make that more scalable with TSX.
Quote:
They added these features because they can, and must for markets that still care about performance. Given the lead-time on architectural decisions, it's also the case that some of these choices were baked in before a lot of market movement shifted in a different direction.
That doesn't explain why they'd add it all at once. Given their dominance in "markets that still care about performance", whatever you mean by that, they could have leaned back and introduced these features over multiple chip generations. The reason they didn't do that, must be because they realize they have to push the envelope of technology to spark continuous demand. AVX2 and TSX will not be relevant to a large portion of consumers next year, but they will be hugely relevant several years later, so they have to make Haswell attractive today by increasing the IPC as well and by lowering the power consumption. Also, I doubt any shift in market movement makes a difference. AVX2 is all about increased performance/Watt for parallel workloads, and this is just as important, if not more so, for where the market is heading. So unlike how you seem to portray it, I sincerely doubt they regret adding any of these features.
Quote:
The part in parentheses is not always a bad thing for those selling the kit.
Short term, maybe, but not in the long term. Planned obsolescence of this kind only works out when there's little competition. There is fierce competition right now, so consumers figure out which products stay relevant the longest and try to stay clear of the rest, even though it might be cheaper. Just look at Apple's reputation and their price tag. There's still money to be saved by designing a custom SoC, but not for long. As we run out of checkbox features, only huge companies like Apple can afford to create something custom (and not for the purpose of removing features but rather to up the ante even more), while everyone else has to pick a cheap full-featured off-the-shelf part to have any consumer acceptance. I'm pretty sure the latter is what Intel is aiming for with it's mobile tick-tock strategy and push for 14 nm.
Quote:
Negligible by whose count?
Does it help that there's been negligible decrease in TDP at the same time?
Negligible by the GPU manufacturer's count. Intel will have vastly increased the effective vector throughput per CPU core, relative to their area. GPU manufacturers can only dream about such an increase now. They've maximized the compute density for years, and they even overshot things by rarely achieving peak throughput (which I realize is a valid tactic, but they've really reached the end of it now). So I will conclude it once more: the CPU is catching up. As for TDP, Haswell is setting a new low and this progress isn't negligible. The fact that desktop CPUs may not lower TDP is irrelevant. A desktop GPU that doesn't consume the accepted TDP for its class (which also takes the thermal solution cost and noise production into account), is leaving performance on the table. So why expect anything different from desktop CPUs?
Quote:
I'm not sure it's going to be eDRAM, and the reasons for it are more long-term than just graphics.
It does help that graphics is a workload that stil may garner the margins necessary to offset the risks and high costs of the initial effort.
Call it what you want, it's a big cache intended to overcome the bandwidth issue, and it's not free. It's just cheaper than the alternative of increasing pin count or frequency, and more power efficient. I assume that's what you were referring to, so I fully agree on that aspect. And I also agree GT3 is a field experiment, offset by high margins. But my point was that it represents yet more convergence between the CPU and GPU. They now both need large caches. It's inevitable that eventually other CPU-like technology will (have to) find its way to the GPU.
Quote:
A GPU at 300-700 MHz is acceptable, and the NTV Pentium works in that range. There's decent demand for GPUs with that speed range right now for mainstream laptops, value desktops, and mobile and embedded.
Yes, but that's a drop in frequency versus the desktop GPU. It doesn't mean the desktop GPU will adopt NTV technology any time soon and drop its frequency to that range. You'd need a humongous die size to make up for that. Like I said, that is not economical. If NTV is adopted now, it's to extend things into a new ultra-low power market, or to lower idle power. The lowering of the nominal voltage will be far more gradual to keep an interesting balance. And it's not looking like CPUs are left out of any of the advantages due to a higher nominal frequency.
Nick is offline   Reply With Quote
Old 08-Nov-2012, 19:36   #75
Sulik
Junior Member
 
Join Date: May 2007
Location: Vault 13
Posts: 11
Default

We've pretty much hit the wall in terms of serial execution performance. So the only room to grow is through parallel execution: more cores, wider SIMDs, so futures CPU will have to adopt GPU-like features and GPUs will have to adopt CPU-like features (more complex cache hierarchy to avoid going off-chip)

It's neither the death of the GPU nor the death of the CPU. CPUs will continue to be more efficient for serial processing and GPUs will continue to be more efficient for embarrassingly parallel problems.

As always, the most efficient will always be dedicated, non-programmable silicon if a particular problem warrants the costs (which is predominently found today for things like video codecs and crypto functionality).
Sulik is offline   Reply With Quote

Reply

Tags
3d rendering, software based rendering, the future of 3d

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 04:48.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.