If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#51 |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,495
|
Last time I ran outcast there was a 1024x768 patch
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
#52 | ||||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Quote:
It's a very interesting technology for lowering the power consumption during standby operation of mobile devices, but as soon as you want some real work to be done the voltage and frequency have to swing up to deliver cost-effective performance. Note that Intel already adopted an 8T SRAM design, precisely to allow lowering the voltage when parked. This kind of technology doesn't apply any more or any less to the GPU or CPU. Quote:
Also note that clock frequency and voltage can be regulated on a core by core basis. When the workload is SIMD-intensive, the operating parameters can be adjusted. And again, executing very wide vector operations on less wide execution units reduces the switching activity elsewhere. So there's no reason why such a CPU wouldn't be able to be as power efficient as a GPU (and I do mean a contemporary GPU, the convergence is still happening from both ends). Die budget isn't an issue either. Between Yonah and Haswell there will be an eightfold increase in floating-point throughput per core. That's not costing an eightfold increase in transistor budget. Extending the SIMD units further will also be relatively cheap. Quote:
That said, looking at the very rapid evolution from fixed-function to programmable to unified that happened to the GPU, I don't think we'll run out of silicon nodes before the CPU and GPU will unify. Note that if the iGPU was ditched and replaced with CPU cores, a mainstream Haswell chip could already achieve close to 1 TFLOP. It won't happen in the next few years, but we're certainly closer to unified graphics than you think. |
||||
|
|
|
|
|
#53 |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,495
|
Hang on you said earlier avx2 will kill discreet gpu's
and avx2 is coming next year
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
#54 | |||||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,074
|
Quote:
The ROI for such chips for the consumer market is uncompelling; the set of applications users have and those that need that many cores is aside from niche cases software that is often a solution searching for a problem; those chips don't work for more power-constrained environments in a market where a big chunk of the new revenue is power-constrained, and an extra generic core on top of many provides no signifcant utility, "blue crystals", or strong enough means to encourage upgrades. Quote:
One of the demonstration circuits Intel had for the technology besides a 10-100 MHz Pentium was a lighting accellerator for a low-voltage graphics solution. Quote:
For serial performance, it appears to be a non-starter because ticking along anywhere near desktop CPU speeds is physically penalized or impossible because of the process and design features of the tech. It's much more interesting for specialized or throughput-oriented uses, hence the emphasis on mobile graphics and HPC. Quote:
If you need the mass throughput of a multicore, Intel wants you to get a SB-E or somesuch. Failing that, there are cloud servers using Xeons with many cores. Consumers on average have dropped back to the dual to quad core transition with the proliferation of mobile media consumption devices with rapidly advancing GPUs. We can do core upon core upon core, with the biggest challenge being that, beyond the low single digits, core count is a Do Not Care. Quote:
This is very interesting for actual workloads people care about.
__________________
Dreaming of a .065 micron etch-a-sketch. Last edited by 3dilettante; 01-Nov-2012 at 16:03. Reason: punctuation |
|||||
|
|
|
|
|
#55 | ||||||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Quote:
So exactly what operating parameters do you believe to be "far" closer to what is necessary for near-threshold operation on a GPU versus a CPU? Peak clock frequency is affected by pipeline length but otherwise it seems to me that a CPU is just as close to being able to operate at near-threshold voltage than a GPU. Quote:
Quote:
Quote:
Quote:
|
||||||
|
|
|
|
|
#56 |
|
Junior Member
Join Date: Feb 2010
Posts: 90
|
That CPUs are better than GPUs at complex code patterns is somewhat of a misconception. If anything, GPUs can perform better here.
The big reason is that additional parallelism gives the processor much more flexibility to schedule instructions. Whereas a CPU must resort to complex OOO schemes to avoid stalls in the pipeline, which are less efficient with complex code, a GPU simply schedules an instruction from another thread, which, provided there are enough threads ready to run, completely masks a stall. Consider a branch with nasty, data-dependent, behavior. In this case, the CPU can only guess which way the branch will go, meaning there's a high chance for a mispredict, which costs perhaps 15 cycles x 6 way superscalar = 90 instructions. A GPU would just schedule another thread, and not even attempt to predict the branch, meaning that as long as you can somehow keep the SIMD lanes coherent (which is sometimes actually possible in this case!), you take no branch penalty. This is even more troublesome in a difficult to decode architecture like x86, where it may take several cycles to decode the instruction to the point where you even know that it's a branch in the first place, meaning you can sometimes mispredict on static branches, or even non-branches! Another tough issue with superscalar architectures is that you can very easily run out of instruction level parallelism. Worse, the amount of silicon it takes to achieve ILP of width N is of complexity order O(N^2), meaning you'll fall further and further behind a simple multithreaded architecture, which has complexity closer to O(N) or perhaps O(N log N). One big argument for CPUs is that GPUs rely on SIMD instructions, which break down in highly irregular control flow situations. Keep in mind that CPUs also rely on SIMD, though not as wide (4 or 8 in the case of AVX, vs. 32 or 64 for Nvidia and ATI, respectively). Thing is, even with only 1 live SIMD lane, on something like a GTX 680, this still translates to 48 operations per clock (8 SMX x 4 instructions/SMX x 2 wide superscalar, with 6 ALU banks per SMX), which is still very near to CPU scalar performance in the best case (6 cores x 6 wide superscalar = 36 operations/clock, though at a higher clock frequency). Keep in mind that this assumes that the CPU is able to avoid stalls, as well as fill in all of the instruction slots every cycle, which is rather difficult, especially for complex code, meaning that the CPU numbers are inflated. The GPU is much harder to stall, due to the aggressive hyperthreading, and code that actually fragments control flow to this degree is really quite rare outside of pathogenic cases, so it's likely to perform much better that these numbers suggest. As for programability, the SIMT model is far easier to use than the explicit SIMD model CPUs use. As for efficiency, remember that you can always force the SIMT model to work as an explicit SIMD model is software, but you cannot necessarily do the same in reverse, since SIMT requires full lane level predication (including predicated branches). The bottom line is that for any problem where it is possible to break it down into thousands of threads, a GPU will almost always outperform a CPU, and the margin by which it does will tend to grow. Since pretty much any rendering algorithm operates on millions of independent pixels (with multiple independent samples for each one!), GPUs will pretty much always trump CPUs for rendering. GPUs have changed a lot in the last 5 years, to the point that it's very hard to tell the difference between a GPU and a CPU at a functional level. The difference is in the microarchitecture, where GPUs are optimized for running many simultaneous threads, while CPUs are optimized for running only a few. |
|
|
|
|
|
#57 | ||||||||||||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,074
|
My initial statement wasn't in response to a claim you made. If a one-sentence reply is all that is made to contradict my claim, I don't find it unreasonable to assume you are taking the same position.
Quote:
Quote:
The time frame I gave for forseable could have been tightened down to indicate that it was based on data that could be pulled from publically available process roadmaps from the major foundries and Intel, along with any product roadmaps--although this detail peters out earlier. That's about the 14nm node or equivalent for the foundries, maybe one more for Intel. A user looking to replace their Radeon 7970 or GeForce 680 on a quad or hex core system is not going to be in the same market for a 24-core CPU. I'm not sure what you mean by a Haswell successor. Haswell will most likely have SKUs that can max out consumer TDPs with 4 cores, and its immediate successor isn't going to make six times as many cores more palatable. Go much further and it's less likely to be a successor than a replacement. Quote:
To then compare discrete desktop product to an ultraportable platform is pointless, and ignores that Haswell's portable variant actually has more GPU as a fraction of areas specifically because having more low-clocked silicon saves power. Quote:
NTV adds area and complexity costs, and it becomes a negative once it approaches regular speeds and voltages. Quote:
Quote:
Configuring the chip for NTV requires trade-offs against high-speed operation, and forcing it to those speeds actually makes it less efficient or less manufacturable. Quote:
Quote:
Quote:
There was no compelling need for the physically impossible, or at least no more than any other thing requiring unicorns. Quote:
Until recently, the PC had the anomolous benefit of being a business, media, and personal use portal. It was an open and fragmented era where creative, commercial, and individual use flexibility and capabilities were satisfied and funded by the same pool of silicon and the same pool of dollars. This is not the same era. The drivers for creative computing or scientific computing are no longer the same as consumer computing, or the same as business computing or enterprise system computing. It used to be that engineering and revenue went into and came from this one big pool where all stakeholders could benefit from the PC chip as a disruptive technology. If any sector stagnated, there were other needs or other customers who wanted more, and their contribution pushed the whole forward. The marginal utility of the next big thing drove rapid upgrade cycles across the whole domain. The market trends now are for a fragmentation of a mature platform, one that is no longer disruptive but mundane and plodding. For various reasons, we see spending going away from the single clunky box or merchant chip that does everything inconveniently for the consumer. The consumer market is at least in part regressing, because silicon integration has advanced so far that people now have portable devices that can do just enough of the job of that clunky box that does everything, just not very prettily. The new platform is an inflexible portal for consumption, locked down, and hostile to creating content or processing it. It doesn't need to last, and it is better the more disposable it becomes. Their money is not going to bring about a need for 24-core PC chips. Their devices do not necessarily want cloud servers running on those either. The supercomputers want more than those chips can provide. There is still a need for pushing the envelope here, but it is not universally beneficial, so it is not going to be the product priced for the consumer. Quote:
You can then run the same games with the GPU on. Log battery life. Quote:
__________________
Dreaming of a .065 micron etch-a-sketch. |
||||||||||||
|
|
|
|
|
#58 |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,495
|
Not sure about the 10w but I can take a very, very (cue a few more very's) rough guess for a 4.8ghz haswell (if such a thing was to ever exist) running UT2004 1680x1050 max details, a small 2 player level = 15-20fps
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
#59 |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
|
|
|
|
|
|
#60 | ||||||||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
But it's not that simple, and GPUs do stall. Complex code uses a big working set, and if your register file and/or caches are too small to hold that working set, then you use up precious bandwidth just to pull in frequently used data. Once you're out of bandwidth, and this can happen at many levels, the GPU stalls. The CPU's out-of-order execution enables it to keep running just one or two threads per core, thereby maximizing access localities and ensuring high cache hit rates, making it inherently bandwidth efficient at every level. Computing power is getting cheaper, while bandwidth is getting harder to scale, so inevitable the GPU has to learn tricks from the CPU. In fact this has already been going on for years; they try to keep the thread count low by decreasing the back-to-back execution latency. Eventually they'll want to bypass the results back to the top of the execution units, and it's a relatively small step from that to not always schedule instructions from different threads, but to also schedule independent instructions from the same thread. Quote:
Quote:
Branch prediction is a necessity to keep the thread count low. Sooner or later GPUs will need it. Stalling all the time because your data is far away, is worse than mispredicting every now and then. Quote:
And again, the GPU's urge to switch to another thread to maximize ALU utilization can work against itself. The register file and cache contention create a bottleneck that is far worse than missing a bit of ILP. Quote:
Quote:
Quote:
Quote:
|
||||||||
|
|
|
|
|
#61 | |||
|
super willyjuice
Join Date: May 2005
Location: Astoria, NY
Posts: 986
|
Quote:
Quote:
Quote:
|
|||
|
|
|
|
|
#62 | ||||||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Quote:
If you're still not convinced: A 5.3GHz 8T-SRAM with Operation Down to 0.41V in 65nm CMOS. A lower starting frequency doesn't make the GPU any more amenable for NTV operation. Quote:
Quote:
Quote:
Note that mobile phone CPUs start to use out-of-order execution and some are quad-core too. Technology like AVX2 also makes sense from a performance/Watt perspective. The HPC market has a need for it too. So despite all this diversification in devices, the drivers for the computing technology are still the same. Quote:
|
||||||
|
|
|
|
|
#63 | |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
CPUs will indeed also become "relatively" competitive with their iGPU counterpart. But not yet in every single aspect or for every model. It makes a lot of sense to not take any risks and keep the iGPU for a few more generations. All I'm saying is that AVX2 should be an eye-opener for what the possibilities of a unified architecture will be. |
|
|
|
|
|
|
#64 | |||||||||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,074
|
Quote:
Quote:
The IGP or throughput portion of the die will be optimized for lower leakage and dynamic power. Haswell is slightly compromising some of its CPU performance by decoupling the L3 cache from the CPU domain, in order to allow scenarios that rely predominantly on the IGP to permit it to be active while the whole CPU section is dropped to an even lower power state. For power savings alone in a consumer or media box scenario, dedicating a significant area to specialized low-speed silicon and allowing only 4 OoO cores to gate off is already compelling. If 20 of those overbuilt cores are only really needed to substitute for an on-die GPU or throughput section, the vast majority of the market is going to question the need for the 20 cores. Your bandwidth figures may not be entirely accurate if integration continues apace. Quote:
Maybe it's really not that simple since nothing does, and there's a non-linear amount of effort to get up to that level of speed. The last Pentium at the .25 micron node could hit 300 MHz with a TDP of 7-8 Watts with a core with 4.5 million transistors, with 1.8-2V. The first Pentium 4 at ~4 GHz was at .09microns, had a TDP of 115 Watts, and 125 million transistors and 1.25/1.4V. I dispute the claim that there is a simplistic linear translation in clock speed when faced with increased delay at voltages above near threshold and the increased power cost related to driving larger gates, more complex circuits, and actual changes to the physical and chemical processes that have been mooted for really ramping down NTV power consumption. It's especially critical since a lot less can be hidden in .25 ns versus 1 ns, and honestly the NTV chip's really interesting around 2-3ns per cycle. Note that the NTV Pentium had about 25% more transistors, consistent with certain circuit design tradeoffs to counter variability and increase reliability at very low voltage. Reducing leakage and variability at .6V means adding changes that drive up the power cost of operating at regular speeds. Nonetheless, even if we assume something that is that dubious, the P4 core is almost 28 times larger than the most recent non-experimental Pentium core. Granted, that is mostly cache. Stripping it down to core and L1 might put the P4 core at 30-40 million, so just shy of ten times the size. Add on 25% and the P4 is getting near 50 million transistors. What is interesting in a P4's performance with a TDP of 7 watts at 32nm? It's not even interesting compared to cores running at normal voltages. Quote:
There are plenty of non GHz cores that get equal or better IPC to aggressive cores, they just can't hit high clocks. Without the high clocks, they need to reorder even less since the biggest driver is memory latency relative to the cores. Quote:
There are >6 million transistor cores that get around 700mW, because that's not the realm that the NTV Pentium is interesting at. The interesting point is 500MHz and below, where the circuits can take their time and a high-speed complex core is pointless fluff. Quote:
There's no compelling need for a CPU with 24 cores in the consumer market because they don't do anything that challenging and a CPU with 24 cores is orders of magnitude beyond acceptable in terms of power consumption. A CPU with 24 cores also currently consumes an order of magnitude beyond acceptable for the next big thing in HPC. Quote:
Quote:
You are not getting crossover from an Apple SOC with ARM cores, a GPU, and scads of hardware decoders for a consumer device and a 24 core Haswell that fails to supplant a GPU in a gaming desktop. Quote:
Intel is very, very interested in continuing voltage scaling way below what is possible with standard circuits. It's not interesting or actively detrimental for workloads where a core's utility is heavily predicated on its straightline speed, but intensely interesting for workloads that are embarrassingly parallel. Push NTV up to where it becomes interesting in serial processing, and it loses to standard cores. The goal for exascale computing has already extrapolated on the trends for best-case GPU evolution, and it was found wanting. The CPU trendline was even lower. The desire for power efficiency is going beyond merely different designs on relatively comparable silicon, and demanding an actual specialization in the silicon--if it remains silicon at all in the long term.
__________________
Dreaming of a .065 micron etch-a-sketch. |
|||||||||
|
|
|
|
|
#65 |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,495
|
Actually I got i wrong he said it would be the death of GPGPU which isnt quite the same thing as death of discreet GPU's.
ps: nick theres another thread on b3d (formally) saying its voxels that will be the death of the GPU what are your thoughts on that ?
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
#66 | ||||||||
|
Junior Member
Join Date: Feb 2010
Posts: 90
|
Quote:
In any case, a high end GPU currently has close to 200 GB/s memory bandwidth, but a 6-core i7 tops around 40 GB/s. Did I mention that the GPU costs around $600, but the 6-core i7, once you throw in the motherboard and memory, is north of $1000? There are various reasons for this (and I don't really know what they are), but do you really think that Intel would chose *not* to make a 200 GB/s workstation if it were possible? Quote:
Bypassing is very expensive, especially when you have superscalar, SIMD, or long pipelines. It's usually cheaper to just double the size of the register file and be done with it, assuming you can afford to lose binary compatibility. The main problem though is that memory, even with caches, is very high latency to access. Latency of 10/100/1000 cycles for L1/L2/RAM is fairly typical. Any time you hit a dependent memory pattern (hello there, linked list!), performance will tank unless you have a large number of other instructions in between. You can either schedule in more threads in hardware, or interleave work items in software. But once you have to interleave, you've defeated the entire point of most of the serial performance hardware on a CPU. For graphics, there are actually algorithms that rely on having per pixel linked lists. Various order-independent transparency methods come to mind. Other algorithms with heavy chained memory dependency include relief mapping and ray-tracing. Quote:
Graphics is also unfriendly to caches, since you tend to have an access once, use once pattern. There's short term cache coherence, such as between nearby pixels sampling the same texel, but that can be covered by a very small L1 cache. A larger, higher level cache isn't likely to help much more until the point where it becomes large enough to contain the entire scene, which is to say, the entire size of main memory, since we try to use as high resolution textures as will fit. This pattern also holds for many types of HPC problems. For instance, pretty much any sort of physical simulation looks at a small set of nearby elements, but operatates over a massive data-set (often so large that it doesn't even fit in RAM...). Quote:
Quote:
It's always better to switch to another thread than to branch predict, provided you have enough resources, since with BP, you run the risk of mispredicting and throwing away work. Quote:
Quote:
Incidentally, predication hardware picks fights with OOO and bypassing. Quote:
Last edited by keldor314; 03-Nov-2012 at 03:21. |
||||||||
|
|
|
|
|
#67 | ||||||||||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Anyhow, I still don't see any reason to assume that GPUs are significantly more suitable or will significantly benefit more from NTV design. Quote:
Intel has been its own competition for the last 6+ years. How do they get people to buy their hardware? By producing ever faster processors. Note that this strategy has worked really well for them. So there must be demand for higher performance. Software becomes ever more demanding, people want faster response times, and consumers don't always just consume. There's probably a million more reasons. Note also that Haswell adds another scalar integer execution port, a third AGU, doubles the cache bandwidth, extends integer AVX instructions to 256-bit, adds FMA support, adds gather support, adds hardware transactional memory and lock elision, etc. If there was little or no demand for higher performance, why add these performance features, all at once? They must realize something you don't. They create demand, by creating powerful hardware that is easy to develop software for. It's obvious that AVX2 is more developer-friendly than heterogeneous computing, and TSX's only purpose is to enable scaling to higher core counts. So sooner or later there will be demand for 24-core CPUs, when Intel makes it so. Also note that if consumers demand ever better battery life, then why are mobile phones aggressively increasing their CPU performance, while battery life only increases slightly? Apparently their needs in terms of performance hasn't regressed much at all. Quote:
Anyway, to get back to my original point: despite the fact that there are now many more form factors than the desktop PC, they all still strive for higher performance. Quote:
Quote:
|
||||||||||
|
|
|
|
|
#68 | |
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
That said, I don't think voxels will be the single motivation behind the unification of CPU and GPU technology. Polygon rasterization is here to stay for a very long time, if not indefinitely. But eventually we'll have TFLOPS of generic computing power, and people will no doubt use it for innovative graphics purposes, and other high throughput applications. |
|
|
|
|
|
|
#69 | |||||||||||||||||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
Quote:
Quote:
Secondly, the GPU is completely helpless without the CPU. It needs it to run the graphics driver, the application, and the operating system. It also dearly needs the motherboard and system RAM for the same reasons. And you also have to take into account that the CPU can do way more than that. So you can't just compare them without properly taking that into account. Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Also, an unpredictable branch means a highly divergent one, which are not wide-SIMD friendly. Some things in computing are just hard no matter what architecture you use. But as I said before, mispredicting a branch once in a while is less worse than running too many threads and having low cache hit rates and thus hitting the bandwidth wall. Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
|
|||||||||||||||||
|
|
|
|
|
#70 | ||||||||||||||
|
Senior Member
Join Date: Sep 2003
Location: Well within 3d
Posts: 4,074
|
Do you feel it is within the next 3 or so process nodes?
I guess I'll try to bookmark this thread for whenever that is. Quote:
That existing curve for regular cores with standard circuit and gate choices has been judged insufficient. Even specialized hardware was put against the demanded power/performance and found insufficient. The big quadratic or cubic factor in power is voltage, and to really get the advance needed, Intel is actively exploring a method that allows circuits to function close to the point that silicon becomes fundamentally unreliable, and the choices it made to counteract the challenges at that level differ from what it does trying to get silicon to twitch at 4 GHz. The same core does not look like it is going to be able to drop to near-threshold if it is also expected to run normally. The logic needs to be refactored, the gates themselves are pushed to allow for decent performance at very low voltage. Some of those choices, such as increasing the transistors per per pipe stage and choosing to replace high Vt gates with nominal threshold devices hurt clock speed and hurt known leakage-control measures that are important to cores that run at >1V. Quote:
Quote:
Quote:
I'm stating with an example that a core that is already up in the stratosphere doesn't have the ability to sit in place and maybe clock higher, and that it's likely not interesting in the range it has to drop to. Quote:
Did Intel explicitly state that the SRAM could not scale below .55V? Is it possible that lower voltages are not mentioned because when integrated into a processor this is counterproductive? Quote:
They have leeway to give up at least some straightline speed for an EP workload. A high-performance core that exists to provide top of the line straightline performance stops being interesting for the workload if it gives up straightline speed. Quote:
Performance is not universally demanded. What Intel provided earlier was greater utility for the customer. Back when silicon scaling didn't drop off a cliff and we had kilobytes of storage, greater performance had an immediate increase in the utility that Intel's product had over preceding designs for pretty much every portion of its market and every market it expanded into. Utility is what drives purchases, or planned obsolescence and "blue crystals". The GPU in the latest Intel chips provided far more utility than another 4-6 CPU cores for consumers and corporate procurement. Highly performant cores that outsripped consumer and business demand have flatlined the utility argument for PC refreshes, and in mature markets the upgrade cycle for the generic programmable CPU has grown longer and longer. A chip with special checkbox features, features that become obsolete, or have proprietary advantage with little technical justification is preferable in certain markets. Quote:
In terms of core count, mobile users have dropped from 4-6 cores after grudgingly going above two back down to 1-2 and sort of trying to justify going to mainstream 4 cores by 20nm. I think that the time period around 20nm was predicted to have mainstream octocores or somesuch a years or so ago. Quote:
For markets that aren't entirely hostile to a core the cost or power consumption of a Haswell device, it saves on engineering to use a core in multiple products in various markets. In markets that don't meet the pricing and power consumption of Haswell, Intel has built a specialized line of cores. To counter, if Intel cared so much about these features, why does it fuse off so many features in half its SKUs? Depending on the entry in the decoder wheel, you can get a core with or without multithreading, with or without encryption accelleration, with or without graphics, with or without virtualization, and with or without a large number of cores. Surely if it cared about generating demand for 24-core chips it would just sell 24 core chips. Or maybe its about the marketing of a product and segmentation to generate the needed margins, and 24-core chips are not a promising direction. Quote:
Let me know when we get there for the consumer market. I would suggest comparing a phone's processor cores to an Intel CPU from seven years ago and explaining why there wasn't a regression. The GPU portion might have a better argument. Quote:
Quote:
I can go back to elementary school and strive for graduation again, too. Maybe they're striving for something, but performance is not the primary driver. Quote:
Does it help that there's been negligible decrease in TDP at the same time? Quote:
It does help that graphics is a workload that stil may garner the margins necessary to offset the risks and high costs of the initial effort. More general use of the tech is not ready for prime time, and it's one of the few use cases where consumers may be able to notice the difference in terms of experience and battery life versus a solution that does not have it. A GPU at 300-700 MHz is acceptable, and the NTV Pentium works in that range. There's decent demand for GPUs with that speed range right now for mainstream laptops, value desktops, and mobile and embedded.
__________________
Dreaming of a .065 micron etch-a-sketch. |
||||||||||||||
|
|
|
|
|
#71 | |
|
Senior Member
Join Date: Dec 2004
Location: Toulouse
Posts: 4,141
|
Quote:
That 4 core i7 is overpriced too, because disabling the SMT to make an i5 from it is dubious. I'm sure near every CPU is overpriced. This is just not 5x markup for less performance, as with professional cards. Last edited by Blazkowicz; 05-Nov-2012 at 07:34. |
|
|
|
|
|
|
#72 |
|
Junior Member
Join Date: Feb 2010
Posts: 90
|
One thing to note is that most games are currently optimized for XBOX 360 and PS3 level hardware. Keep in mind that those are nearly a decade old! If you went back the same amount of time from there, you'd find the SNES just being replaced by the N64, and the first Playstation! Quake, acclaimed as the first game with true 3D graphics, with triangles and stuff, had just appeared! So it's not surprising that even a very low end system can handle things well enough, hence the rise of integrated graphics. Now, when the next XBOX and the PS4 finally appear, things are going to change.
Imagine, a current high end CPU is nearly as powerful as a GPU from 7 years ago. Let the GPU world tremble. Last edited by keldor314; 08-Nov-2012 at 03:54. |
|
|
|
|
|
#73 |
|
Darlek ******
Join Date: Jun 2004
Posts: 9,495
|
Maybe on paper but look at some radeon 9800pro ut24k benchmarks
http://forums.epicgames.com/threads/...004-Benchmarks an i7 running swiftshader wouldn't come close
__________________
Guardian of the Most holy Two Terabytes of Gaming Goodness™ |
|
|
|
|
|
#74 | |||||||||||||||
|
Senior Member
Join Date: Jan 2003
Location: Ottawa, Ontario
Posts: 1,783
|
Quote:
I can tell you though that I was quite surprised that Intel will introduce 256-bit integer operations, FMA, and gather support, all in one generation and as early as next year. And it was yet another pleasant surprise that Haswell will have two FMA units, with no latency penalty, a third AGU, and a fourth integer execution port to offload the vector ports. And let's not forget TSX. I expected these features to be introduced over the course of several architecture generations. So things are actually happening much faster than anticipated. There's still a long way to go, but at least the discussions are no longer stuck on fundamental things like the feasibility of gather support or half-decent floating-point throughput. The convergence is strong, and we're starting to get to the finer details of discussing what it would take to unify them. Quote:
Also note that the top supercomputer to date, the IBM Sequoia, is a GPU-less design. It is more power efficient than the top CPU-GPU design, which takes spot number five. It's clear that it won't take a lot of GPU technology to be integrated into the CPU cores, to keep it that way. Heterogeneous computing was a short-lived hype. Quote:
By the way, note that Intel is already dynamically switching between using only the lower 128-bit execution units for AVX (issuing them in two cycles), or the full 256-bit. When not executing 256-bit AVX code, the core can put the upper 128-bit lane to sleep. As far as I know they're not yet adjusting the frequency or voltage, but it's obviously a first step towards cores that adjust to the workload. Quote:
Quote:
Quote:
My point was that you don't necessarily need 10T SRAM to achieve NTV operation, or achieve any benefit at all. It's a pretty wide design space. You don't necessarily need NTV in the strict sense to vastly improve performance/Watt. The NTV Pentium has its optimal performance/Watt at 0.45 Volt for logic, which isn't all that far from the 0.55 Volt required to keep the SRAM reliable, but most importantly it's not exactly near the threshold voltage. Pushing for full NTV operation isn't worth the drop in absolute performance for any consumer market any time soon. But Intel has hugely benefited from its 8T SRAM design for years now, despite not counting as NTV design. So I expect we'll simply ease into it over the course of many years. But there are no signs of GPUs adopting this technology any faster than CPUs. Quote:
Quote:
Quote:
The only major glitch in this cycle is that currently multi-threading is uninviting to the average developer. It's way too much effort for too little gain. We need a change in hardware support for synchronization primitives, a change in programming methodology, and a change in market adoption. All of these things take time. It will be 8 years between the first commodity dual-core processors, and the ability to atomically transfer more than one word of information between cores in an efficient manner. It's insane how much has already been achieved without that fundamental capability, which illustrates there's no lack of trying. Likewise, we've slowly but surely seen more than two cores become the standard, increasing the incentive for developers to take advantage of it and stay ahead/on par with the competition. But like I said before, the biggest change is yet to come, with new programming paradigms and a matching tool chain. This can easily take several decades. But rest assured that our grandchildren will look at Battlefield 3 the same way we look at Pong, and they won't have to deal with the limitations of heterogeneous computing. Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
|
|||||||||||||||
|
|
|
|
|
#75 |
|
Junior Member
Join Date: May 2007
Location: Vault 13
Posts: 11
|
We've pretty much hit the wall in terms of serial execution performance. So the only room to grow is through parallel execution: more cores, wider SIMDs, so futures CPU will have to adopt GPU-like features and GPUs will have to adopt CPU-like features (more complex cache hierarchy to avoid going off-chip)
It's neither the death of the GPU nor the death of the CPU. CPUs will continue to be more efficient for serial processing and GPUs will continue to be more efficient for embarrassingly parallel problems. As always, the most efficient will always be dedicated, non-programmable silicon if a particular problem warrants the costs (which is predominently found today for things like video codecs and crypto functionality). |
|
|
|
![]() |
| Tags |
| 3d rendering, software based rendering, the future of 3d |
| Thread Tools | |
| Display Modes | |
|
|