Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 17-Mar-2014, 13:12   #1
DSC
Naughty Boy!
 
Join Date: Jul 2003
Posts: 689
Default Khronos Releases OpenGL ES 3.1 Specification

https://www.khronos.org/news/press/k...-specification

http://www.anandtech.com/show/7867/k...s-opengl-es-31

Quote:
March 17, 2014 – San Francisco, Game Developer’s Conference – The Khronos™ Group today announced the immediate release of the OpenGL® ES 3.1 specification, bringing significant functionality enhancements to the industry-leading, royalty-free 3D graphics API that is used on nearly all of the world’s mobile devices. OpenGL ES 3.1 provides access to state-of-the-art graphics processing unit (GPU) functionality with portability across diverse mobile and embedded operating systems and platforms. The full specification and reference materials are available for immediate download at http://www.khronos.org/registry/gles/.
DSC is offline   Reply With Quote
Old 17-Mar-2014, 15:37   #2
codedivine
Member
 
Join Date: Jan 2009
Posts: 270
Default

Wow that was quick. Finally compute shaders on mobile. I hope it is quickly rolled out on iOS, Android, BB10 etc.
codedivine is offline   Reply With Quote
Old 17-Mar-2014, 18:39   #3
sebbbi
Senior Member
 
Join Date: Nov 2007
Posts: 1,388
Default

Yeah! Everything I need

(and not some crap like tessellation or geometry shaders)

---

Compute shaders: check
Indirect dispatch: check
Indirect draw: check
Atomics: check
Atomic counter buffers: check
Barriers: check
Reinterpret cast float<->int: check
Buffers (UAVs): check
Image store/load (RW texture UAVs): check
Packing / unpacking instructions (for small types) + float exp/mantissa generation: check
Gather: check
Thread block shared memory (LDS): check!!!!1

This API is basically as good as DirectX 11. Everything important is there. You could run a next gen console engine on top of this

EDIT: Hah, the forum said I had too many images (smilies) in my post. Good news is always worth celebrating.

Last edited by sebbbi; 17-Mar-2014 at 19:15.
sebbbi is offline   Reply With Quote
Old 17-Mar-2014, 19:16   #4
sebbbi
Senior Member
 
Join Date: Nov 2007
Posts: 1,388
Default

Link to the specification:
http://www.khronos.org/opengles/sdk/docs/man31/
sebbbi is offline   Reply With Quote
Old 17-Mar-2014, 20:10   #5
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,934
Default

Quote:
Originally Posted by sebbbi View Post
Yeah! Everything I need
(and not some crap like tessellation or geometry shaders)
Agreed Can you see anything missing from DX11 you'd like? Or anything from OGL4.4 you'd rather see exposed before geometry shaders and tesselation?

Quote:
Thread block shared memory (LDS): check!!!!1
I'm very curious how that will work out in practice for game developers. Unlike on the desktop, the way shared memory is implemented is very very different from architecture to architecture (you could argue it's either flawed or too different on some architectures).

What kind of LDS usage and access patterns do you think matter in practice? And how much would a slow LDS implementation hurt performance of the kind of renderer you're thinking of?
__________________
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is offline   Reply With Quote
Old 18-Mar-2014, 08:10   #6
ToTTenTranz
Senior Member
 
Join Date: Jul 2008
Posts: 3,531
Default

What's wrong with tesselation in mobile?

Given an identic end result, isn't lower polygon count + tesselation cheaper/more power-efficient to make than higher polygon count?
ToTTenTranz is offline   Reply With Quote
Old 18-Mar-2014, 08:21   #7
sebbbi
Senior Member
 
Join Date: Nov 2007
Posts: 1,388
Default

Quote:
Originally Posted by ToTTenTranz View Post
What's wrong with tesselation in mobile?

Given an identic end result, isn't lower polygon count + tesselation cheaper/more power-efficient to make than higher polygon count?
In theory, yes. But in practice making tessellation work properly with LODs and complex content pipelines (including third party 3d-modeling/animation tools) is not that straightforward, especially if your intention is to reduce the polygon count (optimize rendering). Artists used to polygon modelling also need to learn new (quite different) way to model things.

Tessellation cannot efficiently replace per pixel displacement mapping techniques (such as parallax occlusion mapping or QDM), because tessellating to single pixel triangles kills both quad efficiency and overloads the triangle/primitive setup engines. If you don't tessellate to single pixel triangles, you basically cannot use tessellation rough surfaces, because the vertices will wobble over the small surface details (causing unstable look).
sebbbi is offline   Reply With Quote
Old 18-Mar-2014, 09:18   #8
ltcommander.data
Member
 
Join Date: Apr 2010
Posts: 578
Default

Are OES3.1 compute shaders comparable to OGL4.3 and DX11 compute shaders or are there limitations? By not having tessellation there are die area and power savings by not having the tessellator, but if compute shaders, vertex shaders, and pixel shaders are all full featured comparable to OGL4.x/DX11, with unified shaders, are there really much savings by omitting geometry shaders?
ltcommander.data is offline   Reply With Quote
Old 18-Mar-2014, 09:43   #9
sebbbi
Senior Member
 
Join Date: Nov 2007
Posts: 1,388
Default

Quote:
Originally Posted by Arun View Post
Agreed Can you see anything missing from DX11 you'd like? Or anything from OGL4.4 you'd rather see exposed before geometry shaders and tesselation?
Quick answer (in order of my preference):
- Good well defined and portable texture compression (raw data accessible from compute shaders = can do realtime GPU compression).
- Asynchronous compute (multiple concurrent compute queues in addition to render queue). (CUDA, GCN*).
- Multi draw indirect (from OpenGL 4.3 / GCN*)
- Multi draw read draw call count from GPU buffer) (OpenGL 4.4) (https://www.opengl.org/registry/spec...parameters.txt)
- Ballot (from CUDA and GCN*). Return value in 32/64 bit integer (one wave, each thread sets one bit).
- Sparse texture (PRT / hardware virtual texture) (from OpenGL 4.4 / DirectX 11.2)
- Bindless resources (from OpenGL 4.4 / Nvidia extensions / GCN*)

GCN* = see AMD Sea Islands instruction set (here: http://developer.amd.com/wordpress/m...chitecture.pdf). This hardware is close to next gen / hardware used by Mantle API.

Long answer:

Multi draw call is not necessary required, since you can render the whole scene in a single draw call without it. All the required ingredients are included in ES 3.1: indirect draw, indirect dispatch, gl_VertexID, unordered (UAV) buffer load from vertex shader.

My concern here is the performance of mobile hardware UAV (buffer) reads. On modern PC hardware storing vertex (and constant data) to UAVs in SoA layout is actually more efficient to the GPU hardware than using vertex buffers. This way the shader compiler can reorder the calculation between the partial vertex stream reads, and hide latency much better than using AoS style (big struct) vertices. So the performance is actually better when storing vertex data to custom (UAV) buffers than storing them to fat vertex buffer. I am just hoping that the performance on mobile hardware will behave similarly. Modern PC hardware has flexible general purpose L1 and L2 caches that work as well for UAVs as they work for constant buffers or vertex buffers.

Nobody except the hardware engineers themselves knows yet how well PowerVR chips perform in compute shaders and (cache friendly, but multiple indirection) UAV buffer reads.

Hardware sparse texturing (PRT) and/or bindless resources are not that critical for us, because we have been using software virtual texturing (shader based indirection) for multiple projects, and are perfectly happy with it. The most optimal virtual texture indirection code is just 4 (1d) ALU instructions. Custom anisotropic is quite hacky, but trilinear is straightforward and fast (and definitely enough for a mobile game).

The thing I am most concerned about is the state of the texture compression in OpenGL in general. Our virtual texturing heavily relies on real time DXT texture compression. We write directly in a compute shader on top of DXT5 compressed VT atlas (aliased to 32-32-32-32 integer target). Modern GPUs actually do optimized DXT5 compression (simple endpoint selection) faster than they copy uncompressed 8888 data (DXT5 texture compression is also BW bound, but obviously the write BW is only 25% of uncompressed case). Even if real time texture compression would be slightly slower on mobile devices than copying data to VT cache atlas, it wouldn't matter much since the amortized cost is so small. On average case each generated texture page is sampled 200+ times (60 frames per second, 4 seconds = 240 frames) before it goes out of the screen. Texture compression saves 75% of the bandwidth cost of these 200+ sampling operations, and thus would save a huge amount of battery life on mobile devices (and also boost performance in BW limited mobile devices). We need to do real time compression to virtual texture pages, because be blend decals on top of the texture data (this saves huge amount of rendering cost on scenes that have lots of decals. And decals are needed to get lots of texture variety to scenes).

I just hope we don't need to use uncompressed data on mobile devices while we can use proper texture compression on consoles and PCs. That would be very awkward.

Asynchronous compute is great. Our shaders doing rasterization (shadow map or textureless g-buffer rendering) are completely bound by fixed function units (such as triangle/primitive setup, ROP fill rate, attribute caches, etc). Executing ALU & BW heavy operations such as lighting and post processing simultaneously increases performance and GPU utilization dramatically (as the bottlenecks are different). We need this for mobile devices as well.

Ballot instruction (in CUDA and GCN*) is good for reducing LDS traffic (and instruction counts in general), because it allows you to do prefix sum calculation for a wave/warp using just a few instructions. Prefix sum is very important for many GPU algorithms. ES 3.1 has bitCount and bitFieldExtract instructions. All we need is a ballot instruction. Ballot = each thread inputs one boolean to the ballot instruction, and the ballot instruction returns the same packed (one bit per thread) 32/64 bit integer for all threads in the wave/warp.
Quote:
Originally Posted by Arun View Post
What kind of LDS usage and access patterns do you think matter in practice? And how much would a slow LDS implementation hurt performance of the kind of renderer you're thinking of?
If append buffers (atomic counter buffers) are as fast on mobile hardware as they are in GCN (almost equal speed to normal linear write), these can be used for many tasks requiring compacting data. This greatly reduces the need for fast LDS (in steps like occlusion culling and scene setup). However LDS is still needed for post processing, blur kernels being the most important use case. LDS saves lots of bandwidth and sampling cost in blur kernels. Modern lighting algorithms also load potentially visible lights to LDS (by screen region or hashed cluster identifier), and read the light data from LDS for each pixel in the same cluster (again saving bandwidth). Hopefully these use cases are fast enough, as compute shaders are much more efficient (saves battery life) compared to pixel shaders in these use cases (data is as close to the execution units as possible = much more energy efficient to read the data repeatedly).
sebbbi is offline   Reply With Quote
Old 18-Mar-2014, 09:46   #10
sebbbi
Senior Member
 
Join Date: Nov 2007
Posts: 1,388
Default

Quote:
Originally Posted by ltcommander.data View Post
Are OES3.1 compute shaders comparable to OGL4.3 and DX11 compute shaders or are there limitations? By not having tessellation there are die area and power savings by not having the tessellator, but if compute shaders, vertex shaders, and pixel shaders are all full featured comparable to OGL4.x/DX11, with unified shaders, are there really much savings by omitting geometry shaders?
According to OpenGL ES 3.1 reference (http://www.khronos.org/opengles/sdk/docs/man31/), the compute shaders are fully featured. I didn't find any DirectX 11 compute shader (or indirect draw) feature that I miss. The only thing not mentioned was the define value of maximum LDS size... Hopefully no implementation returns zero for this
sebbbi is offline   Reply With Quote
Old 18-Mar-2014, 14:41   #11
silent_guy
Senior Member
 
Join Date: Mar 2006
Posts: 2,322
Default

That's an awesome post, sebbi!
silent_guy is offline   Reply With Quote
Old 18-Mar-2014, 15:12   #12
codedivine
Member
 
Join Date: Jan 2009
Posts: 270
Default

Quote:
Originally Posted by sebbi
- Asynchronous compute (multiple concurrent compute queues in addition to render queue). (CUDA, GCN*).
Just expanding on sebbi's point here.

These multiple async compute queues are not really exposed in DX and GL right now. DX and GL seem to serialize all kernels (graphics and compute) into the same queue afaik. It will be nice to have this support.

GCN (in Tahiti generation) allows upto 2 ACEs and newer GCN such as Bonaire and Kaveri allow upto 8. These are exposed in OpenCL and I did find nice performance improvements in some apps by using multiple queues. From Mantle's public presentations so far, it looks like Mantle does support multiple async compute queues on GCN as well.

On Nvidia side, some support has been present since Fermi though it had weird restrictions, and Kepler improves things a lot. This support has only been exposed in CUDA so far with Nvidia's lackluster OpenCL driver not exposing it afaik.

Anyway, would be great if these were available in DX or GL.
codedivine is offline   Reply With Quote
Old 18-Mar-2014, 15:53   #13
Lazy8s
Senior Member
 
Join Date: Oct 2002
Posts: 3,036
Default

As long as Apple and Google adopt this update to their API, GPGPU can finally move forward a little more on mobile.
Lazy8s is online now   Reply With Quote
Old 18-Mar-2014, 17:32   #14
sebbbi
Senior Member
 
Join Date: Nov 2007
Posts: 1,388
Default

Quote:
Originally Posted by codedivine View Post
These multiple async compute queues are not really exposed in DX and GL right now. DX and GL seem to serialize all kernels (graphics and compute) into the same queue afaik. It will be nice to have this support.
Yes, this is the biggest thing I would want to have in future desktop OpenGL and DirectX versions as well. Shouldn't be that hard, since CUDA and OpenCL (and seems that Mantle as well) already expose it. Should be a big GPU performance boost, even for those GPUs that only have 2 ACEs (two simultaneous tasks should be enough to get most gains).

Last edited by sebbbi; 18-Mar-2014 at 17:37.
sebbbi is offline   Reply With Quote
Old 18-Mar-2014, 18:41   #15
Rurouni
Member
 
Join Date: Sep 2008
Posts: 356
Default

Do we need new mobile hardware or current generation (Adreno 3XX / Rogue / Mali T6XX) is enough (at least hardware wise)?
Rurouni is offline   Reply With Quote
Old 18-Mar-2014, 20:32   #16
sebbbi
Senior Member
 
Join Date: Nov 2007
Posts: 1,388
Default

Quote:
Originally Posted by Rurouni View Post
Do we need new mobile hardware or current generation (Adreno 3XX / Rogue / Mali T6XX) is enough (at least hardware wise)?
Rogue and Tegra K1 at least support ES 3.1 (according to Anandtech).
sebbbi is offline   Reply With Quote
Old 18-Mar-2014, 23:15   #17
ams
Member
 
Join Date: Jul 2012
Posts: 884
Default

Quote:
Originally Posted by sebbbi View Post
Tessellation cannot efficiently replace per pixel displacement mapping techniques (such as parallax occlusion mapping or QDM), because tessellating to single pixel triangles kills both quad efficiency and overloads the triangle/primitive setup engines. If you don't tessellate to single pixel triangles, you basically cannot use tessellation rough surfaces, because the vertices will wobble over the small surface details (causing unstable look).
Tessellation is not meant to replace displacement mapping, but rather to complement it, while replacing some existing bump mapping techniques that have their own set of drawbacks: http://www.nvidia.com/object/tessellation.html .

"At its most basic, displacement mapping can be used as a drop-in replacement for existing bump mapping techniques. Current techniques such as normal mapping create the illusion of bumpy surfaces through better pixel shading. All these techniques work only in particular cases, and are only partially convincing when they do work. Take the case of parallax occlusion mapping, a very advanced form of bump mapping. Though it produces the illusion of overlapping geometry, it only works on flat surfaces and only in the interior of the object (see image above). True displacement mapping has none of these problems and produces accurate results from all viewing angles."

And while there are obviously tradeoffs even with programmable tessellation, NVIDIA believes that tessellation is a "key technology for efficient geometry", even in the ultra mobile space. They claim " > 50x Triangle Savings vs Brute Force ES2.0"

OpenGL ES 3.1 is clearly a great step forward for the ultra mobile space, but it may be challenging to quickly and easily bring console games to the ultra mobile space without full OpenGL 4.x support (including support for tessellation and geometry shaders). I can understand why OpenGL ES 3.x was implemented as such, but I don't consider it a great thing that some key features of OpenGL 4.x were removed in their entirety when upcoming ultra mobile hardware from NVIDIA, Qualcomm, etc. will support these said features.

Last edited by ams; 18-Mar-2014 at 23:38.
ams is offline   Reply With Quote
Old 18-Mar-2014, 23:39   #18
PixResearch
Member
 
Join Date: May 2010
Location: Kings Langley
Posts: 122
Default

Really nice step forwards

Quote:
it may be challenging to quickly and easily bring console games to the ultra mobile space without full OpenGL 4.x support
Why would it be difficult?

I'm not sure I've seen a game with tessellation where turning it off wasn't an option. Geometry shaders are a little more ingrained but not hard to swap out. I'm all for tessellation since when used well it is indeed a key technology for efficient geometry. However, lets be honest - the majority of applications of tessellation to date have been to excessively over-tessellate surfaces to make them look round. There aren't too many titles that when you enable tessellation actually decrease the geometry load.
PixResearch is offline   Reply With Quote
Old 18-Mar-2014, 23:55   #19
ams
Member
 
Join Date: Jul 2012
Posts: 884
Default

Here is what Tim Sweeney had to say:

most importantly the software runs full open GL on mobile hardware. To have the full graphics API that's available on PC and the highest in platforms available in the industry is just a breakthrough. It enables us to bring graphics up to the next level without any compromises. With full Open GL it knocks down the remaining major barrier between PC level graphics and mobile level graphics. From here onward, I think we're going to see the performance between mobile, PC and high-end console gaming continue to narrow to the point where the differences between the platforms really blur.
ams is offline   Reply With Quote
Old 19-Mar-2014, 00:20   #20
PixResearch
Member
 
Join Date: May 2010
Location: Kings Langley
Posts: 122
Default

I don't believe he's saying he would find it difficult to port desktop content onto an API without tessellation/GS in that quote... Maybe I'm missing something?

It would of course be easier not to change anything and run desktop OpenGL 4 directly but I'm suggesting it isn't meaningfully harder to use this new OGLES either. Perhaps naively, I'd like to hope most developers know that simply copy pasting a game unmodified from 100W CPU + 200W GPU environment onto a mobile SOC using closer to 2W would be a pretty bad idea (assuming we don't want heatsinks and fans to start appearing in tablets and permanently connecting power cables).

Once you're profiling your apps for power consumption and bandwidth performance the API isn't going to too high on their priority list. Just because something can be done doesn't mean it should be.

(PS: not knocking NVidia btw I'm excited to see what K1 can do - part of me hopes they don't get unfairly penalised from inefficient ports off the back of the API support as it looks like a decent architecture).
PixResearch is offline   Reply With Quote
Old 19-Mar-2014, 01:22   #21
ams
Member
 
Join Date: Jul 2012
Posts: 884
Default

What he means is that lack of API support for certain features fundamental to Unreal Engine 4 (such as geometry shaders and tessellation) creates both hurdles and compromises in bringing UE4 PC/console games to the ultra mobile space (even without accounting for any performance profiling that would be needed for ultra mobile hardware).
ams is offline   Reply With Quote
Old 19-Mar-2014, 09:54   #22
sebbbi
Senior Member
 
Join Date: Nov 2007
Posts: 1,388
Default

Quote:
Originally Posted by ams View Post
OpenGL ES 3.1 is clearly a great step forward for the ultra mobile space, but it may be challenging to quickly and easily bring console games to the ultra mobile space without full OpenGL 4.x support (including support for tessellation and geometry shaders).
Geometry shaders sound like a much better idea than they actually are. It's very hard to find an use case for them, because there's almost always an alternative that is faster. I have measured a 2.8x performance drop (for primitive/triangle bound cases) by adding a pass-through geometry shader (that doesn't do anything) on a high end PC GPU. Unless you intend to use geometry shaders for low polygon meshes, this rules out most use cases regarding to mesh rendering.

Even the "best case" scenarios such as particle rendering are actually faster when you instead run a compute shader + vertex shader combination (or just the vertex shader using modulo of vertex id to expand the quads). Exotic use cases (I have encountered) such as DOF (bokeh shape quads) rendering using geometry shaders are also (much) slower using a geometry shader than a compute shader.
Quote:
Originally Posted by ams View Post
They claim " > 50x Triangle Savings vs Brute Force ES2.0"
Marketing departments obviously like corner cases that provide big numbers. Reality is obviously different. I don't think many next gen games use tessellation yet, so the lack of tessellation doesn't affect porting to mobile devices much right now. In the future games tessellation might be used more, and that's when you might want to have it on mobiles as well (assuming the performance is good enough).
sebbbi is offline   Reply With Quote
Old 19-Mar-2014, 13:44   #23
JohnH
Member
 
Join Date: Mar 2002
Location: UK
Posts: 581
Default

Quote:
Originally Posted by sebbbi View Post
Marketing departments obviously like corner cases that provide big numbers. Reality is obviously different. I don't think many next gen games use tessellation yet, so the lack of tessellation doesn't affect porting to mobile devices much right now. In the future games tessellation might be used more, and that's when you might want to have it on mobiles as well (assuming the performance is good enough).
One of the biggest issues with Dx11 style tessellation is that LOD is not continuously variable across a patch, i.e any LOD difference between the edges of a patch is resolved at its edges. This means that in order to get better LOD management surfaces need to be pre-broken down into smaller patches, which means that it's entirely probable that in real use cases Dx11 style tessellation actually increases the amount of input geometry instead of reducing it.
JohnH is online now   Reply With Quote
Old 19-Mar-2014, 14:58   #24
sebbbi
Senior Member
 
Join Date: Nov 2007
Posts: 1,388
Default

Quote:
Originally Posted by JohnH View Post
One of the biggest issues with Dx11 style tessellation is that LOD is not continuously variable across a patch, i.e any LOD difference between the edges of a patch is resolved at its edges. This means that in order to get better LOD management surfaces need to be pre-broken down into smaller patches, which means that it's entirely probable that in real use cases Dx11 style tessellation actually increases the amount of input geometry instead of reducing it.
Agreed completely. That's one of the big design problems in DX11-style tessellation. Unfortunately that's quite hard to change without completely new design and new hardware.
sebbbi is offline   Reply With Quote
Old 19-Mar-2014, 15:28   #25
ams
Member
 
Join Date: Jul 2012
Posts: 884
Default

Quote:
Originally Posted by sebbbi View Post
I don't think many next gen games use tessellation yet, so the lack of tessellation doesn't affect porting to mobile devices much right now. In the future games tessellation might be used more, and that's when you might want to have it on mobiles as well (assuming the performance is good enough).
Yes, but that transition is happening as we speak. Now that AAA game developers will be targeting next gen consoles rather than ancient prior gen consoles, features such as tessellation will be used in more and more new games (with variable degrees of tessellated detail of course). Tessellation is a key feature in the latest UnrealEngine, CryEngine, etc. And lack of full API parity with PC's and next gen consoles means that a separate mobile renderer would be needed for the ultra mobile space (on top of any additional performance profiling needed to optimize performance on said hardware).
ams is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 20:52.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2014, Jelsoft Enterprises Ltd.