Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 24-Sep-2012, 17:08   #26
codedivine
Member
 
Join Date: Jan 2009
Posts: 229
Default

Thanks everyone! The app is now published to Google Play:

https://play.google.com/store/apps/d...divine.rgbench

The results might be slightly different (but not too much) from even v1.2 as I found and fixed one bug
codedivine is offline   Reply With Quote
Old 24-Sep-2012, 17:25   #27
codedivine
Member
 
Join Date: Jan 2009
Posts: 229
Default

I do apologize to everyone who tested with all the versions before, since I had bugs in code.

However, everything is good now I think. Please use the Play version from now on. I have removed APKs from my site.

If I ever meet anyone who spent time on the previous versions, I will buy you a beer to compensate for your time.
codedivine is offline   Reply With Quote
Old 25-Sep-2012, 04:25   #28
codedivine
Member
 
Join Date: Jan 2009
Posts: 229
Default

Some results reported by users:

Nexus 7 (Tegra 3): 1488 MFlops
Galaxy S2X (Snapdragon S3 dual-core): 1175 MFlops
Galaxy S2 (Exynos 4 dual-core) : 998 MFlops
Acer A100 (Tegra 2): 714 MFlops
codedivine is offline   Reply With Quote
Old 25-Sep-2012, 09:33   #29
Rys
Tiled
 
Join Date: Oct 2003
Location: Kings Langley, UK
Posts: 2,675
Default

So how close to peak performance is the benchmark achieving? I'm more ignorant about ARM CPU performance than maybe I ought to be. Someone clue me in
__________________
A major redesign of the core ALU pineapple boomerang fortress.
Rys is offline   Reply With Quote
Old 25-Sep-2012, 12:43   #30
Rootax
Member
 
Join Date: Jan 2006
Location: France
Posts: 200
Default

On my good old SGS1 (i9000) under JB (cm10 20/09/2012) and Mackay Kernel v0.66 :

1thread : ... 68 Mflops
__________________
- I'm french. Sorry if you don't understand what i say -
Rootax is offline   Reply With Quote
Old 25-Sep-2012, 13:15   #31
codedivine
Member
 
Join Date: Jan 2009
Posts: 229
Default

Quote:
Originally Posted by Rys View Post
So how close to peak performance is the benchmark achieving? I'm more ignorant about ARM CPU performance than maybe I ought to be. Someone clue me in
If I am interpreting Laurent06 correctly, Cortex A9 should do 1 fp64 flop/cycle. Based on that, I would say around 40% of peak on some Cortex A9 systems. I don't know for sure about Snapdragons since I didn't find any relevant Qualcomm documentation.

edit: Much lower efficiency on Tegra 3 though.
codedivine is offline   Reply With Quote
Old 25-Sep-2012, 13:42   #32
codedivine
Member
 
Join Date: Jan 2009
Posts: 229
Default

Quote:
Originally Posted by Rootax View Post
On my good old SGS1 (i9000) under JB (cm10 20/09/2012) and Mackay Kernel v0.66 :

1thread : ... 68 Mflops
If I am not wrong, the Cortex A8 VFP is not pipelined. Hence the result.
codedivine is offline   Reply With Quote
Old 25-Sep-2012, 15:34   #33
Rootax
Member
 
Join Date: Jan 2006
Location: France
Posts: 200
Default

Quote:
Originally Posted by codedivine View Post
If I am not wrong, the Cortex A8 VFP is not pipelined. Hence the result.
Well, my phone is running JB good, so that's what matter to me

Nice new bench btw.
__________________
- I'm french. Sorry if you don't understand what i say -
Rootax is offline   Reply With Quote
Old 25-Sep-2012, 15:39   #34
codedivine
Member
 
Join Date: Jan 2009
Posts: 229
Default

Copying my comment from TR forums:

I think after looking at the data from the multithreaded numbers so far, we can conclude the following: Cortex A9 implementations are achieving about 0.4 flops/cycle on the multithreaded mode on Exynos and OMAP implementations. Snapdragon S3 is also doing 0.4 flops/cycle.

However, Tegra 3 to be the exception to rule and is only achieving about 0.32 flops/cycle average. Tegra 2 is also stuck at about 0.36 flops/cycle. Wonder if it has something to do with Nvidia's memory controller.

About the single threaded results, actually the single-threaded and multi-threaded versions are working on different problem sizes. The single threaded one is working on smaller matrices. This was to ensure that the single thread case does not take very long to run, but now starting to think that was a poor design decision on my part. So results from single threaded and multithreaded are not directly comparable. I think I will provide settings to choose the matrix sizes yourself sometime.
codedivine is offline   Reply With Quote
Old 25-Sep-2012, 15:50   #35
codedivine
Member
 
Join Date: Jan 2009
Posts: 229
Default

Quote:
Originally Posted by Rootax View Post
Well, my phone is running JB good, so that's what matter to me

Nice new bench btw.
Thanks
codedivine is offline   Reply With Quote
Old 25-Sep-2012, 17:45   #36
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,631
Default

Quote:
Originally Posted by Rys View Post
So how close to peak performance is the benchmark achieving? I'm more ignorant about ARM CPU performance than maybe I ought to be. Someone clue me in
Cortex-A9 can apparently issue all FP64 operations once every other cycle. Traditional matrix multiplication algorithms for an NxN matrix require O(N^3) FLOPs, where (N^3 - N^2) of those are FMADDs and N^2 of them are FADDs. Both can be issued every other cycle on Cortex-A9 (according to Laurent, the TRM must be wrong about FP64 FADDs issuing every cycle), but FMACs (they're 3-op/destructive) have a really long latency so they're hard to fill, especially since I don't think there's a special forwarding path between dependent FMACs like there are for integer MACs.

An important question is if loads and stores can be issued on the second issue cycle of the FP64 operation. My guess is not. The typical text book tiling algorithm for traditional matrix multiplication (like this http://en.wikipedia.org/wiki/Loop_tiling) will do an RMW on every element. It's good for cache locality but poor for register locality. On a CPU with as much load and store as FLOP co-issuing capability this may not be a problem, but on many (most?) platforms this won't be the case. Cortex-A9 would definitely benefit from anything that can improve the load/store to FLOP ratio. There just aren't really enough registers to hide a lot.

Not really sure what'd be the best kernel for something like this.. someone else here probably already has experience with something like that..
Exophase is offline   Reply With Quote
Old 25-Sep-2012, 18:32   #37
codedivine
Member
 
Join Date: Jan 2009
Posts: 229
Default

Quote:
Originally Posted by Exophase View Post
Cortex-A9 can apparently issue all FP64 operations once every other cycle. Traditional matrix multiplication algorithms for an NxN matrix require O(N^3) FLOPs, where (N^3 - N^2) of those are FMADDs and N^2 of them are FADDs. Both can be issued every other cycle on Cortex-A9 (according to Laurent, the TRM must be wrong about FP64 FADDs issuing every cycle), but FMACs (they're 3-op/destructive) have a really long latency so they're hard to fill, especially since I don't think there's a special forwarding path between dependent FMACs like there are for integer MACs.

An important question is if loads and stores can be issued on the second issue cycle of the FP64 operation. My guess is not. The typical text book tiling algorithm for traditional matrix multiplication (like this http://en.wikipedia.org/wiki/Loop_tiling) will do an RMW on every element. It's good for cache locality but poor for register locality. On a CPU with as much load and store as FLOP co-issuing capability this may not be a problem, but on many (most?) platforms this won't be the case. Cortex-A9 would definitely benefit from anything that can improve the load/store to FLOP ratio. There just aren't really enough registers to hide a lot.

Not really sure what'd be the best kernel for something like this.. someone else here probably already has experience with something like that..
My kernel does 6 loads for 8 FMACs
codedivine is offline   Reply With Quote
Old 25-Sep-2012, 18:56   #38
codedivine
Member
 
Join Date: Jan 2009
Posts: 229
Default

My blog post describing some preliminary analysis including GCC generated assembly of the innermost loop to keep Exophase happy
http://codedivine.org/2012/09/25/pre...sis-rgbenchmm/
codedivine is offline   Reply With Quote
Old 25-Sep-2012, 19:57   #39
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,631
Default

Thanks, I just read it

I wonder what kind of results you get just running the inner loop on the same data over and over again, to try to isolate the effects of cache misses (and to a much lesser extent, branch mispredicts). If cache misses really are showing up as a large percentage some prefetch instructions would help. If the inner loop cost w/o memory stalls is close to as high as the number you reported then I'm pretty surprised. I really expect there to be just 16 cycles for the FMACs, 6 cycles for the loads, and - if Cortex-A9 dispatches VFP/NEON instructions anything like A8 does - zero cycles for everything else since it'd be running in parallel.

Are you sure that the Tegra 3 is running at the clock speed you think it is, throughout the duration of the test?

On closer look, I think you might be suffering stall cycles due to a WAW hazard with the second vld (to d6) hitting the most recent vmacd (accessing d6). They'd appear almost back to back in the VFP pipeline, depending on how loads work. I wouldn't expect it to push you anywhere close to what you're seeing though, not by itself.

Are you compiling with VFPv3-d16? Is Tegra 2 using that? If so, I'd be curious to see a version compiled with full VFPv3 and a bigger kernel that can make use of more registers.

Last edited by Exophase; 25-Sep-2012 at 20:15.
Exophase is offline   Reply With Quote
Old 25-Sep-2012, 20:41   #40
codedivine
Member
 
Join Date: Jan 2009
Posts: 229
Default

Quote:
Originally Posted by Exophase View Post
Thanks, I just read it

I wonder what kind of results you get just running the inner loop on the same data over and over again, to try to isolate the effects of cache misses (and to a much lesser extent, branch mispredicts). If cache misses really are showing up as a large percentage some prefetch instructions would help. If the inner loop cost w/o memory stalls is close to as high as the number you reported then I'm pretty surprised. I really expect there to be just 16 cycles for the FMACs, 6 cycles for the loads, and - if Cortex-A9 dispatches VFP/NEON instructions anything like A8 does - zero cycles for everything else since it'd be running in parallel.
Good suggestion. I will try that, might report back 1-2 week later though due to work.

Quote:
Are you sure that the Tegra 3 is running at the clock speed you think it is, throughout the duration of the test?
Oh. No i have not verified that. Those results are what a user reported using a Nexus 7. I don't have a Tegra 3 device so I cannot say for sure. I just looked up the specs of T30L in Nexus 7 (1.2GHz quad-core) and assumed that frequency in my calculation of cycles taken to execute. How does one verify that on Android?

Quote:
On closer look, I think you might be suffering stall cycles due to a WAW hazard with the second vld (to d6) hitting the most recent vmacd (accessing d6). They'd appear almost back to back in the VFP pipeline, depending on how loads work. I wouldn't expect it to push you anywhere close to what you're seeing though, not by itself.

Are you compiling with VFPv3-d16? Is Tegra 2 using that? If so, I'd be curious to see a version compiled with full VFPv3 and a bigger kernel that can make use of more registers.
I used the default GCC flags for armv7a target from the android NDK r8b with GCC 4.6. These are the relevant flags that the NDK was using: -march=armv7-a -mfloat-abi=softfp -mfpu=vfp

I guess I should look at using mfpu=vfpv3-d16 flag instead? I don't think I will be changing the play store kernel code now, as people seem happy enough with it, but I will do a build and post a link to an APK here soon for analysis purposes.

Thanks for your detailed remarks. This is why I love B3D (and also techreport forums). I got severely burnt by trolling at RWT forums (though partly because of my idiocy and miscommunication)
codedivine is offline   Reply With Quote
Old 25-Sep-2012, 21:44   #41
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,631
Default

You should be fine continuing to NOT use that flag, I just wanted to make sure you weren't. This should also mean you're free to use a bigger kernel without worrying about running out of registers just yet. That should give you a better load/FLOP ratio as well as help avoid that stall. Could also help with prefetching.

Don't feel too bad about RWT, I get burned there too
Exophase is offline   Reply With Quote
Old 25-Sep-2012, 22:52   #42
willardjuice
super willyjuice
 
Join Date: May 2005
Location: Astoria, NY
Posts: 998
Default

Quote:
Don't feel too bad about RWT, I get burned there too
You guys are doing better than me; I can't even figure out how to read David's forum!

Good work codevine, removing the jit variable is essential for benchmarking. Definitely worth a 5 star rating (and not because you're from b3d!).
willardjuice is offline   Reply With Quote
Old 26-Sep-2012, 04:22   #43
codedivine
Member
 
Join Date: Jan 2009
Posts: 229
Default

Thanks willardjuice

I was not clear on VFP versions earlier. Some clarity about VFP versions:

1. Using the default NDK switches for ARMv7-a ABI, you get 16 register version of VFP.
I was using this till now. I tried using a larger inner loop and got register spillover and bad performance.

2. If you are sure that your hardware supports 32 register VFP, but you are not sure about NEON support, you can pass vfpv3-d32 flag to NDK. I have not tried this.

3. If you enable NEON on ARMv7-A ABI (thus excluding chips like Tegra 2), you automatically also get 32 register version of VFP. I experimented with this, and then increased the size of the inner loop to 16 FMACs and 8 loads and it compiled with no spillovers. I got a performance bump to 1270 mflops from 1170 mflops on a Snapdragon S3 device I tested on.

Apart from the VFP testing, I think the order in which i am iterating over the tiles is not necessarily ideal. I can experiment with that sometime.
codedivine is offline   Reply With Quote
Old 26-Sep-2012, 11:31   #44
OlegSH
Member
 
Join Date: Jan 2010
Posts: 119
Default

Could you please add output of results to log (adb logcat), so it will be easy to automate for or is there some results file already?
OlegSH is offline   Reply With Quote
Old 26-Sep-2012, 11:40   #45
codedivine
Member
 
Join Date: Jan 2009
Posts: 229
Default

Quote:
Originally Posted by OlegSH View Post
Could you please add output of results to log (adb logcat), so it will be easy to automate for or is there some results file already?
Hi. I am working on a "Pro" version of the app which will allow more settings, provide more detailed data etc. Will be a little while though as I have school work to do too

I wanted to keep this one simple so that people (especially people who review phones) don't feel scared

I hope the benchmark is picked up by some reviewers as I would like to see more accurate reporting about Android devices

(Oh and btw check out my second benchmark, about memory bandwidth. Not too consistent but gives a rough estimate of memory bandwidth. Based upon STREAM benchmark. https://play.google.com/store/apps/d...vine.rgbenchbw )
codedivine is offline   Reply With Quote
Old 12-Oct-2012, 10:34   #46
Laurent06
Member
 
Join Date: Dec 2007
Posts: 423
Default

Quote:
Originally Posted by Exophase View Post
If so ARM is lying on their TRMs again :/

http://infocenter.arm.com/help/topic...h02s03s02.html
I edited my message, the TRM is in fact correct.
__________________
Speaking for myself.
Laurent06 is offline   Reply With Quote
Old 13-Oct-2012, 18:32   #47
Exophase
Senior Member
 
Join Date: Mar 2010
Location: Cleveland, OH
Posts: 1,631
Default

Quote:
Originally Posted by Laurent06 View Post
I edited my message, the TRM is in fact correct.
Thanks for the clarification.
Exophase is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 08:14.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.