Beyond3D - Handheld CPUs: Past, Present & Future

Handheld CPUs: Past, Present & Future - Page 4

Published on 7th Feb 2011, written by Arun for Processor - Last updated: 22nd Feb 2011

Intel

The biggest technical problem with Intel's ambitions in the smartphone market is not the processor itself, but system architecture. It was a complete joke on Menlow and wasted so much power it was understandable that many people decided to simply write off Intel forever. With Moorestown, it's getting much better but still wasn't good enough - and x86 Android/MeeGo weren't ready in that timeframe which didn't help. So Intel's first real chance in handhelds is the 32nm Medfield SoC (single-chip with no southbridge) which Intel is still expecting to be available in phones by the end of 2011. And this time they *might* finally have a 'good enough' system architecture (especially in terms of standby power) even if it isn't perfect, so the bigger question becomes... what about the processor itself? Is it competitive?

Atom: Overview

The most surprising aspect of the original Intel Atom is its die size: 24.2mm² for the whole chip (on 45nm) and 6.8mm² for the core (includes L1 but excludes bus interface). While that might seem small, it's ridiculously big compared to any modern ARM core - in fact, 6.8mm² is substantially larger than the Cortex-A15 on 40nm! And what does Intel have to show for it? A mere dual-issue in-order integer pipeline with SMT (75% of the core area including shared front-end) and a slightly more impressive mostly-128-bit MUL+ADD SSE3 pipeline (25% of the core area but much more excluding L1). There's certainly nothing to brag about with that kind of die size and it doesn't bode well for power efficiency either.

But there are a few mitigating factors. First of all, Intel's 45nm process is not very dense compared to TSMC's 40nm process (no immersion lithography and more restrictive design rules - it's very different for Intel 32nm vs TSMC 28nm as TSMC introduced restrictive design rules while Intel uses dual patterning lithography and TSMC apparently doesn't). Second, Intel optimised the design to support extremely low voltages (possibly lower than many ARMs on a process less maybe used to this kind of thing) which probably reduces the density of the L1 SRAM substantially. Third, they probably sacrificed area at every turn to save power and compensate for their inherent architectural inefficiency.

Atom: Deep-Dive

Atom is an in-order dual-issue architecture but it has an ace up its sleeve: CISC! The classical argument against x86 (and CISC in general) is that many instructions are very complex and the processor must waste transistors to decode them into multiple simpler RISC-like operations. This is true but it also means x86 and ARM instructions are not directly comparable because you might need more of the latter to do the same thing. And this is especially true for Atom because its internal representation is still CISC: amazingly enough it can even process many 'load-op-store' instructions (e.g. "ADD [memory], register" or "value_at_memory=value_at_memory+register") in a single cycle (using only 1 of its 2 pipelines although the other can't handle a load/store at the same time).

That means the load and store units must be placed just before and after the integer ALUs respectively (in the same pipeline) rather than in parallel with them (as a separate pipeline), which results in an effective load-to-use latency of 0 cycles (only for instructions combining load & execute) but also 3 more pipeline stages. Atom has a very high 13 cycle misprediction penalty on a 16 stage integer pipeline (3xFetch/3xDecode/3xDispatch/3xLoad/1xExecute/2xExceptions-SMT/1xCommit) but this still seems like a good trade-off. However there's yet another catch: this CISC trick means Intel decided not to also add Out-of-Order Completion for integer/load-store instructions (unlike the ARM11!) so the pipeline will stall completely on a cache miss. And while integer multiplication is the only operation with multi-cycle latency, it will stall all single-cycle instructions until it's finished because they cannot be allowed to complete first. At least INT can issue ahead of FP and vice-versa.

Atom's performance is mitigated further by two other factors: x86 only has 8 registers (versus 16 for x64 & 15 for ARM) and x86 instructions only use 2 operands (versus 3 for ARM). The first means a lot more register spilling (data which could be kept in registers will have to be stored to/loaded back from memory - at least the faster load/store comes in handy) and the second means a lot more instructions will be wasted moving data around the few registers (e.g. if you want to keep the value of both B and C when executing A=B+C, you need an extra MOV to copy B to another register). This also makes it harder to have independent instructions to dual-issue. And even though Atom supports x64 (and with a good implementation it could benefit more than OoOE CPUs), neither MeeGo nor Android will in the Medfield timeframe.

Finally, there's Hyper-Threading (aka SMT). This is a huge deal because it's perfect fit for Atom's architecture including most of the problems we've just highlighted. Hard to dual-issue instructions? You can issue instructions from two different threads on the same cycle (while the decoder will take turns between them). Heavy branch misprediction penalty or (most important of all) costly pipeline stall after a cache miss? The other thread is unaffected and hopefully ready to pick up some of the slack. If it wasn't for SMT, an in-order dual-issue CISC architecture wouldn't make nearly as much sense. So how good is it in the real world?

Estimates: Let's dare to be stupid!

Comparing Atom's power efficiency to the Cortex-A9's requires comparable performance and power numbers. Here's a very rough way to estimate performance based on our analysis of a few real-world benchmarks and CoreMark: for multithreaded workloads, a single-core 2GHz Atom is quite comparable to a dual-core 1GHz Cortex-A9. And for single-threaded workloads, the 2GHz Atom would be significantly but not unbelievably faster. This is (to massively oversimplify the problem) because SMT and OoOE both do a reasonably good job of using the resources available in their own very different ways.

It is true that Intel's own numbers for SPECint2000 tell a very different story, but SPECint is also very much a compiler benchmark, ARM has never released any numbers of its own, and Intel's numbers are not credible (Atom is 33% faster per-clock than an A9 for single-threaded workloads? Don't be silly!) so it's just one more data point at best. We might be underestimating Atom's performance but don't expect miracles.

Unfortunately, this means power efficiency doesn't look good at all for Intel. Let's consider 3 Atom SKUs: Z500 (800MHz @ 0.65W), Z530 (1.6GHz @ 2W), and Z550 (2GHz @ 2.4W). As for the Cortex-A9, ARM sells two dual-core macros (aka Osprey) on the TSMC 40G process: one is power optimised (800MHz @ 0.5W) and the other is performance optimised (2GHz @ 1.9W). Since we previously concluded Atom would have to run at nearly twice the frequency of a dual-core A9 to achieve the same performance, ARM's power efficiency is about twice as good as Intel's. In reality these power numbers are not comparable (full-chip TDP including L2 vs core-only 'Total Power' which is still closer than you might think) and we might be underestimating Atom performance so ARM's power efficiency might 'only' be 50% higher here.

These estimates are extremely approximate but the conclusion is clear: although it's apparently not as bad as some critics and/or competitors make it out to be, Intel's Atom is still far from competitive in either die size or power consumption. While many of Intel's design choices (e.g. SMT) seem to make sense on paper, the devil is in the details and the result is disappointing. And that's on a 45nm process which is not be as dense as TSMC's 40nm but would probably be *more* power efficient thanks to High-K. On the other hand, AMD's Bobcat is more impressive (at least in terms of perf/mm² - that goes beyond the scope of this article), so it's unlikely the 45nm Atom is as good as it gets. The real question is how much better and how soon.

Medfield: A New Hope

Will the 32nm Medfield have not only an improved system architecture but also an improved Atom processor? Signs point to no. At best we'll probably see a few minor changes like an increase from 24KB to 32KB of L1 data cache and the only potential surprise lies in the Atom core's synthesis (it's very possible that the quality of the implementation on 45nm was a significant factor). But Intel might actually get away with this thanks to their process advantage: if they and their partners do manage to get Medfield phones out in 2H11, that's quite a bit earlier than any phone based on 28nm ARM processors (at the earliest: Q1 2012 for Tegra3 & Q2/Q3 2012 for OMAP5/Snapdragon2). So how would it compare to the current 40nm ARM generation?

We estimated the original 45nm Atom requires at least 50% more power than a dual-core Cortex-A9 at half the frequency. It is conceivable that Atom on 32nm could get very close to the power efficiency of competing 40nm ARM processors. As for performance, it should actually be higher if the smartphone variant can reach 2GHz. So amazingly enough (and contrary to my own expectations when I began writing this article!) Intel does have a window of opportunity with Medfield.

On the other hand, we have assumed lots of luck and perfect execution for everything so far, and reality does tend to get in the way. The best case scenario is probably that Medfield is slightly worse than existing 40nm SoCs in every way except for slightly higher CPU/GPU performance, and then they might get design wins at a few major OEMs (e.g. Nokia) and achieve double-digit market share of the high-end (>$400?) Android/MeeGo phone market. The worst case scenario is obviously that they fail completely (see: Moorestown).

It's complicated

Even if Intel executes properly on the hardware front, an immature software ecosystem (e.g. x86 Android although MeeGo should be fine) might delay their partners' projects until everyone decides to pull the plug in favour of unambiguously superior competitors (e.g. 28nm Tegra3). That scenario would be somewhat comparable to that experienced by the NVIDIA Tegra1 or Freescale i.MX5x (although it goes beyond the scope of this article). And if Intel doesn't manage to improve the system architecture/efficiency enough on Medfield or the Atom core still has massively higher power than a dual-core A9 on 40nm, then they're in big trouble (and many people think that will obviously be the case - it's certainly a possibility but I don't think it's quite that likely).

The original selling points that Intel had in mind for Atom (Flash support and significantly higher performance than ARM) are long gone, but they are attacking the handheld market more aggressively than ever before. Is this a turning point? Maybe. Intel can't have much of an impact in a single generation and they've kept the roadmap beyond Medfield very close to their chests. This implies we might finally see some real processor architecture changes but it's not clear what (3-issue in-order is bizarre, 2-issue Out-of-Order is overkill with SMT, and 3-issue Out-of-Order is just too much). Even if we hope for more, Incremental changes are still the most likely and it remains to be seen whether that can be enough to compete with dual-core A15s and quad-core A9s on 28nm (and quad-core A15s on 20nm but probably only one year after Intel's 22nm platform).

With luck, Intel will carve itself a niche in the handheld market and grow it further over the next several years. But another threat is emerging at the same time: ARM is attacking netbooks, notebooks, servers, and even desktops...

ARM Going Heavy

Servers

ARM has been making noise about entering the server market for quite some time. Do they stand a chance? The current generation from Marvell (quad-core PJ4) and Calxeda (quad-core Cortex-A9) probably won't set the world on fire. They are still 32-bit processors and don't even have the 40-bit addressing capability of the Cortex-A15, and their performance per chip isn't very impressive to say the least. If someone is going to invest in a different architecture for cloud computing or networking, today they'd be much better off switching to Tilera.

On the other hand, ARM is a mainstream ISA and an investment in it today will certainly be more useful 5 or 10 years from now. Calxeda presumably has a roadmap to 8-Core Cortex-A15s to bring that point home to potential customers. But another risk is that Intel might decide to release a many-core Atom chip for servers and that would probably close the gap enough that the switch would no longer be worth the effort. But for now, there does seem to be a small opportunity here.

NVIDIA Project Denver: HPC & Gaming

Another part of the server market to consider is High Performance computing. NVIDIA took everyone by surprise at CES 2011 when they announced Project Denver, their own high-performance ARMv8 core, and Mike Rayfield has confirmed that it would first be used on NVIDIA's Maxwell GPU generation, so we're looking at 2013-2014/20nm. This will allow NVIDIA to provide the entire platform to server customers rather than depend on x86 processors and chipsets from Intel/AMD, as well as increasing CUDA performance and programming flexibility by reducing the latency between the CPU and GPU. The fact it's ARM-based doesn't really have any disadvantage in this market (it's custom software on Linux anyway) and it's certainly easier and more credible than a proprietary ISA.

But Project Denver isn't only for HPC on Linux: it's also eventually for consumers on Windows 8 (indeed it's not obvious NVIDIA could justify the R&D cost with only HPC). And historically, NVIDIA's HPC chips have always doubled up as ultra-high-end gaming chips - so might NVIDIA try to attack the desktop CPU market at the same time and with the same chip? No, because Windows 8 for ARM won't be able to run all existing x86 software (it won't have an emulation layer as per Microsoft's own admission), so all existing PC games won't work - that's not exactly a great selling point for an ultra-high-end gaming rig. The only way around that would be if NVIDIA built their own x86 emulation layer on top (software-centric to bypass licensing issues but with some dedicated instructions ala China's Longsoon processor?) and that could be a viable way to run a lot of legacy software, but it could still be too slow for a high-end gaming platform.

They can push developers to release universal binaries that work on both ARM and x86 (which isn't that difficult, and would be helped further if one of the next-generation consoles used ARM CPUs) but that still won't make older games work. The only solution (besides x86 support) is fast enough ARM performance that all newer games run perfectly, and fast enough x86 emulation that all older games work. But given the performance requirements of games that will be released within the next 2+ years (>3GHz Quad-Core Sandy Bridge for 30FPS stable?) before anyone could even consider universal binaries, that's easier said than done. It looks like there might no good solution here.

Windows 8: It's not a bug, it's a feature!

Microsoft's software strategy for Windows 8 is an interesting and complicated question mark. There are some good leaks about Jupiter (new UI framework/layer) and Mosh (presumably tablet-only tile-based OS UI) along with a possible application store, and we know Steve Ballmer called Windows 8 their 'riskiest product bet'. Simply adding ARM support to Windows isn't that risky; at worst, it will fail and they'll be stuck with the x86 version for all intents and purposes. What would certainly be a lot more risky is forcing a new software development paradigm on everyone - and it looks like that's exactly what Microsoft wants to do with Jupiter and their app store, and combined with Mosh it's unlikely an outside observer would even recognise this as Windows.

In theory the managed applications (i.e. NET) are compiled Just-in-Time and therefore ISA-independent, so it's possible that Microsoft will get those working on ARM. But it's likely that Microsoft's strategy (not just for tablets but for all consumer applications) is Jupiter, and that implies an ironic fact: from Microsoft's perspective, an x86-to-ARM emulator for native applications would probably be a bad thing because it reduces the incentive for developers to create new applications with Jupiter.

This makes for a viable platform in tablets, netbooks, and maybe even low-cost 14-to-15" notebooks. But it's completely unthinkable for desktops (except maybe HTPCs) and it seems both Microsoft and ARM are aware of that. It remains to be seen what happens in the Windows 9 timeframe, but for now we must attend to reality.

Returns: Don't let the door hit you on the way back

But before any of that becomes a concern, we'll first have Windows 8 tablets running on the same SoCs as Android tablets. Their success will depend almost exclusively on the quality of the user interface and developer environment (both of which will be very different from Windows 7). And then we'll also have ARM Windows 8 netbooks and notebooks. Most of these will probably use 1.5-2.0GHz Dual-core Cortex-A15s on 28nm (quad-core A9s are also possible but not quite as desirable) or 32/22nm Intel Atom SoCs. While the lack of x86 compatibility is unlikely to be a big deal for mainstream consumers on tablets, it nearly certainly will be for netbooks and especially notebooks.

There's only one thing worse than weak sales: strong sales with a very high return rate. This is what doomed Linux on netbooks and let Microsoft eventually take back the entire market with a cheaper Windows XP license. The fact it's still Windows would increase the number of satisfied customers compared to Linux, but there will also be more people who buy it without understanding the trade-off at purchase time. If ARM wants to achieve sustainable success in these new markets, they need the difference to be as obvious as possible (e.g. very different Windows SKU name, OEMs indicating it as clearly as possible in the specs and on stickers, and educated retailers). They must not let history repeat itself or the market might never recover.

Overall, ARM is in but x86 isn't out. The barrier to entry for these markets remains significant but it is no longer overwhelming. With luck, ARM could steal quite a lot of share from x86 within 5 years. But the level of uncertainty is extremely high on all sides so it's impossible for anyone to make any reliable prediction. And of course, that's what makes this so interesting.