If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#26 |
|
Member
Join Date: Jan 2009
Posts: 229
|
Thanks everyone! The app is now published to Google Play:
https://play.google.com/store/apps/d...divine.rgbench The results might be slightly different (but not too much) from even v1.2 as I found and fixed one bug |
|
|
|
|
|
#27 |
|
Member
Join Date: Jan 2009
Posts: 229
|
I do apologize to everyone who tested with all the versions before, since I had bugs in code.
However, everything is good now I think. Please use the Play version from now on. I have removed APKs from my site. If I ever meet anyone who spent time on the previous versions, I will buy you a beer to compensate for your time. |
|
|
|
|
|
#28 |
|
Member
Join Date: Jan 2009
Posts: 229
|
Some results reported by users:
Nexus 7 (Tegra 3): 1488 MFlops Galaxy S2X (Snapdragon S3 dual-core): 1175 MFlops Galaxy S2 (Exynos 4 dual-core) : 998 MFlops Acer A100 (Tegra 2): 714 MFlops |
|
|
|
|
|
#29 |
|
Tiled
Join Date: Oct 2003
Location: Kings Langley, UK
Posts: 2,675
|
So how close to peak performance is the benchmark achieving? I'm more ignorant about ARM CPU performance than maybe I ought to be. Someone clue me in
__________________
A major redesign of the core ALU pineapple boomerang fortress. |
|
|
|
|
|
#30 |
|
Member
Join Date: Jan 2006
Location: France
Posts: 200
|
On my good old SGS1 (i9000) under JB (cm10 20/09/2012) and Mackay Kernel v0.66 :
1thread : ... 68 Mflops
__________________
- I'm french. Sorry if you don't understand what i say - |
|
|
|
|
|
#31 | |
|
Member
Join Date: Jan 2009
Posts: 229
|
Quote:
edit: Much lower efficiency on Tegra 3 though. |
|
|
|
|
|
|
#32 |
|
Member
Join Date: Jan 2009
Posts: 229
|
|
|
|
|
|
|
#33 | |
|
Member
Join Date: Jan 2006
Location: France
Posts: 200
|
Quote:
Nice new bench btw.
__________________
- I'm french. Sorry if you don't understand what i say - |
|
|
|
|
|
|
#34 |
|
Member
Join Date: Jan 2009
Posts: 229
|
Copying my comment from TR forums:
I think after looking at the data from the multithreaded numbers so far, we can conclude the following: Cortex A9 implementations are achieving about 0.4 flops/cycle on the multithreaded mode on Exynos and OMAP implementations. Snapdragon S3 is also doing 0.4 flops/cycle. However, Tegra 3 to be the exception to rule and is only achieving about 0.32 flops/cycle average. Tegra 2 is also stuck at about 0.36 flops/cycle. Wonder if it has something to do with Nvidia's memory controller. About the single threaded results, actually the single-threaded and multi-threaded versions are working on different problem sizes. The single threaded one is working on smaller matrices. This was to ensure that the single thread case does not take very long to run, but now starting to think that was a poor design decision on my part. So results from single threaded and multithreaded are not directly comparable. I think I will provide settings to choose the matrix sizes yourself sometime. |
|
|
|
|
|
#35 |
|
Member
Join Date: Jan 2009
Posts: 229
|
|
|
|
|
|
|
#36 | |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,631
|
Quote:
An important question is if loads and stores can be issued on the second issue cycle of the FP64 operation. My guess is not. The typical text book tiling algorithm for traditional matrix multiplication (like this http://en.wikipedia.org/wiki/Loop_tiling) will do an RMW on every element. It's good for cache locality but poor for register locality. On a CPU with as much load and store as FLOP co-issuing capability this may not be a problem, but on many (most?) platforms this won't be the case. Cortex-A9 would definitely benefit from anything that can improve the load/store to FLOP ratio. There just aren't really enough registers to hide a lot. Not really sure what'd be the best kernel for something like this.. someone else here probably already has experience with something like that.. |
|
|
|
|
|
|
#37 | |
|
Member
Join Date: Jan 2009
Posts: 229
|
Quote:
|
|
|
|
|
|
|
#38 |
|
Member
Join Date: Jan 2009
Posts: 229
|
My blog post describing some preliminary analysis including GCC generated assembly of the innermost loop to keep Exophase happy
http://codedivine.org/2012/09/25/pre...sis-rgbenchmm/ |
|
|
|
|
|
#39 |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,631
|
Thanks, I just read it
I wonder what kind of results you get just running the inner loop on the same data over and over again, to try to isolate the effects of cache misses (and to a much lesser extent, branch mispredicts). If cache misses really are showing up as a large percentage some prefetch instructions would help. If the inner loop cost w/o memory stalls is close to as high as the number you reported then I'm pretty surprised. I really expect there to be just 16 cycles for the FMACs, 6 cycles for the loads, and - if Cortex-A9 dispatches VFP/NEON instructions anything like A8 does - zero cycles for everything else since it'd be running in parallel. Are you sure that the Tegra 3 is running at the clock speed you think it is, throughout the duration of the test? On closer look, I think you might be suffering stall cycles due to a WAW hazard with the second vld (to d6) hitting the most recent vmacd (accessing d6). They'd appear almost back to back in the VFP pipeline, depending on how loads work. I wouldn't expect it to push you anywhere close to what you're seeing though, not by itself. Are you compiling with VFPv3-d16? Is Tegra 2 using that? If so, I'd be curious to see a version compiled with full VFPv3 and a bigger kernel that can make use of more registers. Last edited by Exophase; 25-Sep-2012 at 20:15. |
|
|
|
|
|
#40 | |||
|
Member
Join Date: Jan 2009
Posts: 229
|
Quote:
Quote:
Quote:
I guess I should look at using mfpu=vfpv3-d16 flag instead? I don't think I will be changing the play store kernel code now, as people seem happy enough with it, but I will do a build and post a link to an APK here soon for analysis purposes. Thanks for your detailed remarks. This is why I love B3D (and also techreport forums). I got severely burnt by trolling at RWT forums (though partly because of my idiocy and miscommunication) |
|||
|
|
|
|
|
#41 |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,631
|
You should be fine continuing to NOT use that flag, I just wanted to make sure you weren't. This should also mean you're free to use a bigger kernel without worrying about running out of registers just yet. That should give you a better load/FLOP ratio as well as help avoid that stall. Could also help with prefetching.
Don't feel too bad about RWT, I get burned there too |
|
|
|
|
|
#42 | |
|
super willyjuice
Join Date: May 2005
Location: Astoria, NY
Posts: 998
|
Quote:
Good work codevine, removing the jit variable is essential for benchmarking. Definitely worth a 5 star rating (and not because you're from b3d!). |
|
|
|
|
|
|
#43 |
|
Member
Join Date: Jan 2009
Posts: 229
|
Thanks willardjuice
I was not clear on VFP versions earlier. Some clarity about VFP versions: 1. Using the default NDK switches for ARMv7-a ABI, you get 16 register version of VFP. I was using this till now. I tried using a larger inner loop and got register spillover and bad performance. 2. If you are sure that your hardware supports 32 register VFP, but you are not sure about NEON support, you can pass vfpv3-d32 flag to NDK. I have not tried this. 3. If you enable NEON on ARMv7-A ABI (thus excluding chips like Tegra 2), you automatically also get 32 register version of VFP. I experimented with this, and then increased the size of the inner loop to 16 FMACs and 8 loads and it compiled with no spillovers. I got a performance bump to 1270 mflops from 1170 mflops on a Snapdragon S3 device I tested on. Apart from the VFP testing, I think the order in which i am iterating over the tiles is not necessarily ideal. I can experiment with that sometime. |
|
|
|
|
|
#44 |
|
Member
Join Date: Jan 2010
Posts: 119
|
Could you please add output of results to log (adb logcat), so it will be easy to automate for or is there some results file already?
|
|
|
|
|
|
#45 | |
|
Member
Join Date: Jan 2009
Posts: 229
|
Quote:
I wanted to keep this one simple so that people (especially people who review phones) don't feel scared I hope the benchmark is picked up by some reviewers as I would like to see more accurate reporting about Android devices (Oh and btw check out my second benchmark, about memory bandwidth. Not too consistent but gives a rough estimate of memory bandwidth. Based upon STREAM benchmark. https://play.google.com/store/apps/d...vine.rgbenchbw ) |
|
|
|
|
|
|
#46 | |
|
Member
Join Date: Dec 2007
Posts: 423
|
Quote:
__________________
Speaking for myself. |
|
|
|
|
|
|
#47 |
|
Senior Member
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,631
|
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|