Marvell
Marvell once claimed that they might have more engineers working on their ARM cores than ARM itself. While that is nearly certainly an exaggeration, this isn't: as of late 2009, Marvell was shipping 1 billion (yes, billion!) chips annually out of which 650 million included their in-house ARM cores. That represents ~15% market share for the ARM ISA (4B ARM processors in 2009) - and yet Marvell is a low-key company that few people think about when considering the ARM market or have even heard of. And that's because Marvell sells a lot of chips that are deeply embedded inside a huge variety of products (e.g. networking/printers/storage/CE) while the ARM cores themselves are also deeply embedded within those chips.
Marvell's processor expertise comes from two distinct acquisitions: the widely publicised purchase of Intel's XScale business in 2006 and the rarely mentioned acquisition of the much smaller ASICA in 2003. By the time Marvell acquired XScale they already had their own Out-of-Order Execution architecture with both single-issue and dual-issue variants. They then merged the teams which resulted primarily in the Sheeva PJ1 in 2008 and the PJ4 in 2009.
From DEC to Intel to Marvell: StrongARM & XScale
But first let's look at the original high performance ARM cores: StrongARM and XScale. The StrongARM is very similar to the ARM9 (5-stage pipeline/Harvard architecture/No branch prediction) but was available about two years earlier in mid-1996 (and so it used ARMv4 rather than ARMv5). It was the only ARM processor which could compete on performance with MIPS cores (which had a 5 stage pipeline from the start).
XScale is also very similar to the ARM11 (once again being available much earlier) and achieved higher clocks despite having 7 stages rather than 8 (no separate issue stage) thanks both to good design and custom logic. But while it was very impressive on release, the rate of innovation over the years wasn't. The only major improvements were made in 2004 with the PXA27x: Wireless SpeedStep (downclock the processor based on load) and the ill-advised WirelessMMX (a combination mostly of the full MMX ISA and integer SSE used mostly for video decoding - a ridiculous idea as they would have been much better served by fixed-function hardware).
And even less excusable is the rate of process migration: XScale started on 180nm and was still on 130nm when Marvell bought it in 2006 (despite Intel ramping 65nm hard on the desktop) so it was actually at a process disadvantage versus Texas Instruments and Qualcomm! This is a mistake Intel is trying not to repeat with Atom although Atom tape-outs are still somewhat behind desktop processors (but even at TSMC the 'High Performance' variants of 28nm High-K and 20nm will be available first so Intel isn't alone there).
ASICA: OoOE! Or not, kinda!
Surprisingly enough, the Feroceon processors designed by the ASICA team are a bit more exciting. Marvell describes most of them as sporting Out-of-Order Execution but in most cases that's very misleading as instructions are still issued/dispatched in-order. There is some ambiguity as to whether some of their cores support basic Out-of-Order Issue (Instruction Shelving), but it seems extremely unlikely that any of them support register renaming like the Cortex-A9/A15.
They have a basic ROB (to maintain in-order retirement) that allows for a Variable Length Pipeline and that means some execution units will still be executing an older instruction when a newer one is already done. Arguably this should be called Out-of-Order Completion but that term is often used for a more specific trick that doesn't require a ROB so Marvell's marketing decision is understandable. They also have a completely in-order variant ('Dragonite') for minimum cost applications, it is most comparable to the ARM7TDMI and Cortex-R4 (storage and wireless, e.g. the highly successful 88W8686 which kickstarted the mobile WiFi market and was used in the early iPhones and iPod Touch).
Here's a quick overview of the various cores with a variable pipeline length:
- 88FR301: First Feroceon core, 5-8 stages, single-issue (ARMv5) (refresh: 88FR331)
- 88FR531: Also known as Jolteon, 5-8 stages, dual-issue (ARMv5) (refresh: 88FR571)
- PJ1: Also known as 88FR131/Mohawk, 5-8 stages, single-issue (ARMv5 with Wireless MMX).
- PJ4: Unknown codename, 6-9 stages, dual-issue (ARMv7 with Wireless MMX & Optional NEON).
Surprisingly Fast
We know the DMIPS scores of the PJ1 and PJ4: 1.46/MHz and 2.41/MHz respectively. The former is about what you'd expect for a single-issue core with a short pipeline length, but the latter is more surprising. The Cortex-A8/A9 achieve 2.0/MHz and 2.5/MHz respectively, whereas Qualcomm's Snapdragon does 2.1/MHz. Could Marvell really be that close without any Out-of-Order Issue or register renaming logic? Not likely but maybe, and that says more about how bad Dhrystone is as a benchmark than it does about the PJ4. For example, Dhrystone fits completely within any modern CPU's L1 cache, so OoOE with register renaming doesn't help as much as in the real world.
But an interesting possibility is that the PJ4 does support some kind of Out-of-Order Issue without register renaming. The architecture requires a register scoreboard anyway (it keeps track of data dependencies by tracking which registers will be written to by instructions already in progress - which means new instructions reading those registers must wait) and there's also a ROB to make sure instructions always retire in-order to the register file. So if you added a small shelving buffer, you could issue any two independent instructions in a given cycle (unlike the Cortex-A8 which must always issue two adjacent instructions).
It's not obvious that this is the case but if so the PJ4 is a nice step between the A8 and A9. Without register renaming, Out-of-Order Issue will be limited when instructions reuse old registers (a false dependency) but ARM has 15 General Purpose Registers versus only 8 for 32-bit x86, so with a smaller window size than the A9 (which you'd want for lower cost anyway) it's probably a good trade-off. And the PJ4 is actually clocked higher than the Cortex-A9 on the same process (1GHz on 55nm/1.5GHz on 40nm - not clear how much custom logic this requires) so it clearly has excellent performance per mm².
Both the PJ1 and the PJ4 exist in multi-core configurations (optionally with heterogeneous multiprocessing as explained on the previous page) and the PJ4 has a NEON variant (used on the Armada 628), so they're certainly very modern processor cores that give Marvell a small competitive advantage in a wide variety of markets from embedded applications to handhelds and even servers (with the 4xPJ4-based ARMADA XP). While Marvell's success with the ARMADA 6xx has been disappointing for reasons that go beyond the scope of this article, the PJ4 itself remains competitive for now and it will be interesting to see what Marvell comes up with in the Cortex-A15 timeframe.
Qualcomm
Qualcomm presumably started work on their Scorpion CPU core in 2003-2004. They first announced it in late 2005, then announced the associated 65nm Snapdragon SoC in late 2006, started sampling Snapdragon in late 2007, and it was finally available in the windows Mobile-based Toshiba TG01 in mid-2009 before taking over the high-end Android market in early 2010. And that's just the first-generation QSD8x50 chip; needless to say these things take time!
The second-generation 45nm MSM7x30/MSM8x55 (same chip/different SKUs) started sampling in mid-2009 and was first available on the T-Mobile G2 in late 2010. The dual-core 45nm MSM8x60 started sampling in mid-2010 and should be first available in Q2/Q3 2011. And finally, the next-generation 28nm MSM8960 with LTE support and a completely new CPU architecture is expected to sample in early 2011 and should compete head-on with Cortex-A15 SoCs (but those are only expected to sample 2-3 quarters later).
Inside Scorpion
Despite being developed completely in-house, Scorpion appears at first to be little more than a Cortex-A8 that achieves 25%+ higher clock speeds on the same process. It even has the same 13-stage dual-issue integer pipeline and 10-stage NEON pipeline that sits after the integer execution units (total fixed pipeline length of 23 cycles for NEON as on the A8). The dual-core variant also has a shared L2 cache (but only 256/512KB for single/dual-core respectively versus 512/1024KB for most competitors).
But when you look deeper, there are several important differences. The most obvious is that all the NEON units are 128-bit wide like on the A15 (Scorpion clock-gates half the unit for 64-bit instructions) and there's a pipelined VFP unit to run scalar floating-point instructions at full speed (which ARM only added back on the Cortex-A9). And presumably Scorpion's NEON also hides the L1 latency completely and supports limited dual-issue like on the Cortex-A8, so it should definitively be much faster than on the Cortex-A9.
Also noteworthy is that the MSM8672 (for tablets/smartbooks) supports heterogeneous multiprocessing somewhat like Marvell: the two CPU cores each have their own voltage regulators (which adds some engineering complexity and a tiny bit of cost to the power management chip) so that they can simultaneously run at different clocks/voltages unlike current Cortex-A9 designs (it's unclear what exactly Marvell does on that front). And there have been some indications that the two cores might have been synthesised separately (one using higher leakage transistors to reach 1.5GHz and the other limited to a slightly lower frequency and used by default for lower performance tasks) although the difference isn't as extreme as in Marvell's case.
Out-of-Order? Really?
More mysterious are the possible differences on the integer side: Qualcomm has hinted several times that Scorpion has some limited form of Out-of-Order Issue but nothing more than that is known (so most people have rightfully been skeptical that it's anything more than a gimmick if it's even true). It certainly has no ROB (fixed length pipeline) and no register renaming of any kind. But it does achieve 2.1DMIPS/MHz on the same compiler as the A8's 2.0DMIPS/MHz (apparently with a small margin of error) despite all the similarities so it's plausible that there genuinely is something going on here. It could be rather boring like a slightly lower latency L1 cache but that's neither very likely nor would it probably be enough. Then what?
By elimination, it seems that Scorpion might have a small shelving buffer. This is not very different from the Marvell PJ4 (see discussion above) we just described but it's necessarily even less flexible because there's no ROB. That means there's no guarantee results can be retired to the register file in-order and out-of-order instructions cannot be cancelled in the event of a pipeline flush (exception or branch misprediction), so under no circumstance should an instruction issue ahead of another which could potentially force a flush. Given that limitation (and probably more in practice), the relatively small performance boost that should be expected anyway, and the level of ambiguity in Qualcomm's claims, it's far from obvious that this is actually what Qualcomm did. But it's still the most likely possibility as far as we can tell...
(EDIT: one alternative mentioned by Gubbi on the feedback thread is zero-cycle load-to-use rather than the A8's one-cycle through more heavily delayed/skewed execution - it's not Out-of-Order Issue but you could describe it as very limited OoOE. It wouldn't make much sense for the Marvell PJ4 however, so an instruction window is still very likely there).
After the Sting: Qualcomm's Next Generation
Finally, what about Qualcomm's next-generation CPU architecture in the MSM8960? All we know is that it achieves 5x the DMIPS of the original Snapdragon at 75% less power (which is presumably for the same level of performance, so 25% *more* power at maximum performance). That's nearly certainly for a dual-core configuration, so it should be 5250 DMIPS per core (2.5x the QSD8250's 2100 DMIPS). At 1.5GHz or 2.0GHz, that would be 3.5 or 3.0 DMIPS/MHz respectively - that's only slightly slower than the Cortex-A15 and would seem to imply a similar architecture (OoOE with at least 3 instruction decoders, everything else could be quite different).
But it might not be that simple: heterogeneous multiprocessing would complicate the numbers, and more importantly it's very possible that it supports SMT (2 threads/core) which would imply lower performance for single-threaded workloads but probably slightly lower cost and higher power efficiency most of the time. And it gets even more complicated when we consider smaller implementation details (e.g. single large issue queue or multiple small ones like the Cortex-A15?) - there's really no point speculating any further without more information. But either way, Qualcomm's CPU roadmap looks strong on paper and it only remains to be seen whether these upcoming chips manage to be as successful as the original Snapdragon.