If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.
![]() |
|
|
#1 | |
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,840
|
NOTE: This thread was split from the Microsoft Surface thread in the Handheld Forums.
Quote:
Since the resolution is high enough that you can't resolve the individual pixels? Your eye effectively integrates/resolves the high resolution image...
__________________
The content of this message is my personal opinion only. |
|
|
|
|
|
|
#2 | ||
|
Epsilon plus three
Join Date: Feb 2002
Location: Chania
Posts: 7,764
|
Quote:
I've cleaned up for a friend a couple of days his PC which was a mess and threw it through a couple of hurdles. The GT210 it carries has 4 TMUs and merely 8GB/s bandwidth over a 64bit bus. Even in the highest resolution AF didn't cost more than a fraction of performance which wasn't even noticable (a couple of fps), quite to the contrary to any Multisampling amount (up to ~1/3rd the 1xAA performance with 4xAA enabled) Further to that since we're in a surface tablet thread, I haven't seen any benchmarks in order to see how much performance the ULP GF in Tegra3 loses with AF enabled, but it's at least capable of it. It's on the other hand not capable of MSAA due to lack of tiling. One to other isn't related, but it's not that Tegra3 as a SoC has any bandwidth to spare rather the contrary. IF AF should cost more performance on it than even the lowest desktop GPU it would be more likely due to the lack of TMUs. That thing shouldn't have more than 2 TMUs anyway. Yes loops cost in bandwidth, but it's an indirect issue and not the primary bottleneck. Quote:
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs. |
||
|
|
|
|
|
#3 | |||
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,840
|
Quote:
MSAA is similarly dependent on the scene, but more-so on geometric frequencies. For simple geometry, there's no reason it has to cost much either since MSAA compression should handle the bandwidth usage. Quote:
Quote:
__________________
The content of this message is my personal opinion only. |
|||
|
|
|
|
|
#4 | ||||
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,840
|
Quote:
Quote:
Quote:
Quote:
I'm also not convinced framebuffer bandwidth is the real limitation in the long run. Certainly while we're still in the space of rendering with basic shaders, simple and extremely low resolution texturing and the like it's a win, but it's not clear that framebuffer bandwidth is a significant factor in desktop games for instance. So unless you believe that the mobile graphics world will evolve significantly differently (and so far it really has just mirrored the evolution of desktop graphics with a few omissions), I'm not sure we can necessarily pronounce IMR dead in power-constrained environments. For my part I expect the graphics pipeline portion of rendering a frame to be increasingly small as we move forward, with more and more work being done in generic compute/software. For my part I'm actually becoming less and less interested in pure tablets, or even tablets + keyboard "covers". After having played with a few "convertibles", I'm much more leaning towards a good ULV big core (17W or lower), with a detachable tablet portion + keyboard (ideally with more batteries, transformer style), and a nice digitizer. Mobile hardware just hasn't scaled up in performance as quickly as desktop hardware seems to be scaling down in power usage.
__________________
The content of this message is my personal opinion only. |
||||
|
|
|
|
|
#5 | |||||
|
Epsilon plus three
Join Date: Feb 2002
Location: Chania
Posts: 7,764
|
Quote:
Quote:
Quote:
Quote:
Quote:
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs. |
|||||
|
|
|
|
|
#6 | ||
|
Senior Member
|
Quote:
Quote:
A true TBDR paired with a renderer that plays to it's strengths is another matter of course. Not sure how many of iOS games render-to-TBDR, so to speak. |
||
|
|
|
|
|
#7 | |||
|
Senior Member
|
Quote:
I am curious about the API bits you mentioned that are inconvenient for a TBDR. I had no idea about this and I'd like to know more. Care to share? Quote:
When we get to ~10MB cache on die, around 14 nm or 10 nm for sure, then we *could* have entire framebuffer on die, or atleast the z buffer. I think that could shift the paradigm quite a bit. Also, since UI is such an important job for mobile GPUs and the systems are so bandwidth constrianed, that alone can be quite useful. Quote:
|
|||
|
|
|
|
|
#8 |
|
Regular
|
Except that some developers have moved to SOFTWARE tiling with deferred shading to avoid doing multiple geometry passes ... so not really.
__________________
Cinematic is the new streamlined. |
|
|
|
|
|
#9 | |
|
Senior Member
Join Date: Mar 2002
Posts: 3,779
|
IMRs do HSR as well through early Z. The only thing they may do extra is go through the tile's polys twice (Z-pass first), but that often has questionable value, and can be a loss in a well optimized game with object-level sorting.
Where tiled renderers have a bandwidth advantage is primarily with alpha rendering. They also have a small advantage with the efficiency of large block writes to the framebuffer. The penalty is the bandwidth cost of binning (vertices pass through the chip multiple times). Quote:
The stylus lets you run any desktop software comfortably, as it doesn't need a low density UI that also considers how the finger blocks the view of whatever is under it. |
|
|
|
|
|
|
#10 |
|
Regular
|
It would be so nice if hardware tilers were given the necessary information to do this as well ... all this time and the APIs are still brain dead.
__________________
Cinematic is the new streamlined. |
|
|
|
|
|
#11 | ||||||
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,840
|
Quote:
Quote:
Phone CPU/GPU performance is basically uninteresting to me beyond the threshold where it can run the basic OS, e-mail, etc. I have zero desire to ever play a game on a tiny phone screen that has barely enough space to see let alone interact. Samsung Series 7 Slate has been out for a while. It's awesome, and the stylus obviously works great with Photoshop. Surface Pro looks to be basically the same thing one generation newer... Quote:
Quote:
Quote:
Quote:
1) Typing without tactile feedback is pretty awful... so much so that I don't really care to have a keyboard at all unless it's a decent tactile one. The tactile surface one might suffice, but we'll see. 2) If you're going to have a keyboard, might as well fit some more battery in the enclosure for it 3) Stylus is great. I don't use it exclusively or anything, but for anything more precise, or drawing or even writing to some extent, it's quite pleasant. Personally I use it for rough work, math, etc. in OneNote. OneNote even can convert my math scrawling to symbols (!!) and it works very well. Thus I basically see no reason why I can't have it all... touch, keyboard, stylus, good performance and ability to run anything I want. The convertible aspect means that I can use just the tablet portion when it's more convenient to do that (on a bus, etc) but turn it into a laptop when I want to get some work done. After I've seen some of the convertible systems, they seem like a strict superset of what you get in other mobile devices. I won't claim that these are the primary concerns for everyone, but it is hard to argue that the "strict" tablets have any advantages over convertibles going forward other than perhaps price, which gets muddy if you still buy a laptop in addition...
__________________
The content of this message is my personal opinion only. |
||||||
|
|
|
|
|
#12 | ||
|
Epsilon plus three
Join Date: Feb 2002
Location: Chania
Posts: 7,764
|
Quote:
Quote:
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs. |
||
|
|
|
|
|
#13 | |
|
Senior Member
Join Date: Mar 2006
Posts: 1,686
|
Quote:
|
|
|
|
|
|
|
#14 | |
|
Senior Member
|
Quote:
Analytic AO needs 2 geometry passes anyway. Even if you are doing deferred shading in software, if the entire G buffer can be had in high speed memory, which seems possible with interposers, that would change sweet spots considerably. Even if you can't fit the full G buffer on die, if you can just fit the ID buffer, that is still a big win. ID buffer = what a hw TBDR generates to decide which tri/pixel combination to shade. Essentially Frame based Deferred Rendering in hw without any cost of binning.
__________________
The views presented here are my own and not my employer's. Last edited by rpg.314; 27-Oct-2012 at 02:30. |
|
|
|
|
|
|
#15 | |
|
Senior Member
|
Quote:
2) Not having to shade quads is a win. 3) And there's this http://www.google.com/patents/US20110254852 It's a new IMG patent that describes how you can use a TBDR to save both pixel and texel fill rate with shadow mapping and the like. Basically, don't rasterize shadow maps immediately after binning is complete. Wait until the next render wants to lookup the texels. Then rasterize just one tile of the shadow map opportunistically and then immediately use it to shade the fragments. This way both the z testing and the subsequent texture filtering can be done out of on chip buffers. Since the final render is going to have fairly large spatial coherence, it could save quite a bit of lookups. I don't think this technique can be copied by an IMR. 4) On a handheld we have much larger resolution and since the screen size is small, the physical size of objects (in mm across the screen) is small. I am just thinking aloud here, but I think the tris needed to hide the curvature would be less too. Which could tip the balance in a TBR's favor for this market. |
|
|
|
|
|
|
#16 | ||||
|
Senior Member
|
Quote:
Wouldn't using a UAV immediately after rendering to it stall an IMR as well? A TBR should do no worse than an IMR in such case. Quote:
Quote:
Quote:
|
||||
|
|
|
|
|
#17 | ||
|
Member
Join Date: Nov 2007
Posts: 938
|
Quote:
I personally try to avoid all techniques that require rendering geometry twice, because geometry transform/rasterization is the step that has by the far the most fluctuating running time. Draw call count, vertex/triangle count, quad efficiency, overdraw/fillrate, etc all change radically depending on the rendered scene. Screen pixel count is always the same (720p = 922k pixels). All algorithms you process just once for each pixel in the screen incur a constant cost. That's why I like deferred techniques (= processing after all geometry is rasterized). Constant stable frame rate is the most important thing for games. Worst case performance is what matters in algorithm selection, average performance is meaningless (unless it's guaranteed to amortize over the frame). I am not a particular fan of LiDR and it's descendants (including Forward+). Depth pre-pass doubles the most fluctuating part of the frame rending (draw calls / geometry processing). It also is a waste of GPU resources. All the 2000+ programmable shader "cores" of modern GPUs are basically idling while the GPU crunches though all the scene draw calls and renders them to Z-buffer (depth testing, filling, triangle setup, etc fixed function work). Memory bandwidth is also underutilized (just vertex fetching and only depth writes, no texture reads or color writes at all). For good GPU utilization you have to have balanced load at every stage of your graphics rendering pipeline. Depth pre-pass isn't balanced at all. Various displacement mapping techniques will be used more and more in future games, and these make the extra geometry pass even more expensive. DX11 supports vertex tessellation and conservative depth output. Tessellation will promote usage of vertex based displacement mapping techniques, and conservative depth is very useful for pixel based displacement mapping techniques (allows early-z and hi-z to be enabled with pixel shader programmable depth output). A side note: The programmable depth output and pixel discard isn't a good thing for TBDRs (making pixel shader based displacement quite inefficient). Vertex tessellation also adds some extra burden (how bad that is remains to be seen in the future). Brute force deferred rendering with fat g-buffers isn't either the best choice in the long run. Basically all source textures are compressed (DXT variants, DX11 even adds an HDR format). A forward renderer simply reads each DXT texture once a pixel. A deferred renderer reads the compressed texture, outputs it to a uncompressed rendertarget and later reads the uncompressed texture from the render target. DXT5 is 1 byte per pixel, while uncompressed (8888 or 11f-11f-10f) is 4 bytes per pixel. Forward reads 1 byte per each texture layer used, deferred reads 5 bytes and writes 4 bytes (9x more BW used). This problem isn't yet a big problem, because most games don't have more than two textures per object (8 channels for example can fit: rgb color, xy normal, roughness, specular, opacity). But in the future the materials will become more complex and the g-buffers will become fatter (as we need to store all the texture data to the g-buffer for later stages). I personally like to keep geometry rendering pass as cheap as possible. Rendering to three or four render targets and reading three or four textures isn't cheap. Overdraw gets very expensive and quad efficiency and texture cache efficiency play a big (unpredictable) role in the performance. It's better just to store the (interpolated) texture coordinates to the g-buffer. This way you get a very fast pixel shader (with no texture cache stalls), quad efficiency and/or overdraw doesn't matter much, full fill rate (no MRTs), low BW requirement, etc. All the heavy lifting is done later, once a pixel, in a compute shader. Compressed textures are read only once, and no uncompressed texture data is written/read from the g-buffers. This kind of system minimizes the fluctuating cost from geometry processing/rasterization and it compares very well to a TBDR in scenes that have high overdraw. IMR still has more overdraw and TBDR, but the overdraw is dirt cheap. (**) What matters in the future isn't the geometry rasterization performance. Geometry rasterization is only around 10%-20% of the whole frame rendering cost if you use advanced deferred rendering techniques. TBDR/IMR aren't that different if 80%+ of frame rendering time is spend running compute shaders. (**) The biggest downsize of the technique described above is that the "texture coordinate" (= texture address) must contain enough data to distinguish all the texture pixels that might be visible in the frame (and bilinear combinations of those). Basically with current architectures this means you need a big texture atlas, and you need to store all your textures there. This is not a viable strategy for games that have a gigabyte worth of textures loaded at memory at once. Virtual texturing however only tries to keep texture data in memory that is required to render the current frame. The whole data set fits to a single 8192x8192 atlas (virtual texture page cache). With this kind of single atlas, the "texture coordinate" problem becomes trivial: Just store a 32 bit (normalized int 16-16) texture coordinate to the g-buffer. Quote:
Page visibility determination (from depth buffer) of course takes some extra time, but you can combine it with some other full screen pass to minimize it's impact. Rendering several smaller shadow frustums (pages) of course increases draw call count (and vertex overhead), but techniques such as merge-instancing can basically eliminate that problem (single draw call per page / subobject culling for reduced vertex overhead). With some DrawInstancedIndirect/DispatchIndirect trickery that's doable, but dynamic kernel dispatching by other kernels would make things much better (GK110 will be the first GPU to support this). |
||
|
|
|
|
|
#18 | |||
|
Unknown.
Join Date: Aug 2002
Location: UK
Posts: 4,877
|
Quote:
Quote:
IMO, the most bandwidth efficient way to do deferred rendering is to do it on a TB(D)R with the right extensions (programmable blending, being able to use tile memory as scratch not being output, etc.). Even in a worst-case scenario where you don't benefit from the deferred rendering, you're still not really using more memory bandwidth than a forward renderer. I'd say that's pretty cool! Quote:
Agreed. And whether that's what the workload looks like or not, shader core efficiency is key.
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles) "[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions." |
|||
|
|
|
|
|
#19 | |
|
Member
Join Date: Jan 2010
Posts: 117
|
Quote:
|
|
|
|
|
|
|
#20 |
|
Epsilon plus three
Join Date: Feb 2002
Location: Chania
Posts: 7,764
|
I'm afraid I've completely lost connection with the above.
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs. |
|
|
|
|
|
#21 | |
|
Unknown.
Join Date: Aug 2002
Location: UK
Posts: 4,877
|
Quote:
As for MSAA, it will reduce tile size rather than reduce tile memory per pixel. Otherwise since SGX only has 64bpp you wouldn't even have enough space to do 4xMSAA for 32bpp framebuffers
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles) "[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions." |
|
|
|
|
|
|
#22 | ||||||||
|
AndyTX
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,840
|
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Quote:
Anyways, this is an interesting conversation but we're pretty off-topic for this thread... might be worth someone splitting this?
__________________
The content of this message is my personal opinion only. |
||||||||
|
|
|
|
|
#23 | |||
|
Member
Join Date: Nov 2007
Posts: 938
|
Quote:
Lets first go though the easy case of bilinear filtering from virtual texture. In this case, the texture coordinate already implicitly contains the mip level, as the indirection texture lookup transforms the texture coordinate to the correct 128x128 pixel page depending on the mip level (gradients). Basically the x,y texture coordinate pair contains all the info you need. Bilinear filtering isn't exactly a hot technique itself, but if you use anisotropic mip hardware to calculate the lod level based on gradients (min of x,y clamped to max+1 instead of max) you will get higher detail on slopes. I call this "bilinear anisotropic", and we used it in Trials Evolution (hacks like this are required for 60 fps on current gen consoles). Trilinear isn't much harder. The virtual texture indirection lookup basically truncates the mip value to floor(mip). That data is implicitly stored to the x,y texture coordinate pair for free. All extra data you need to store is the fraq(mip) portion. 4 bit normalized [0,1] integer is enough for this purpose. We are blending between two adjacent mip levels after all, not two completely different images (**), so using 8 bits or even more (floats Anisotropic filtering with virtual texturing is still a topic that hasn't been researched a lot ("Carmack's Hack" being the state of the art for performance If you want to do proper anisotropic filtering, you obviously need to store both gradients. Virtual texture indirection lookup points you to a location that stores the most detailed data you need (minimum of the gradients). Both gradients are increments to this value. The smaller gradient increment is always in range of [0,1] (when measured in mip levels), the larger can be more than that (but it's always positive). Again 4 bits should be enough for the first one, and if we share a 16 bit value for them, we have 12 bits remaining. That's more than enough for the second gradient bias. Another way to approach this problem is to prefilter 128x128 tiles as 128x64 and 64x128 anisotropic tiles. Now we also use the gradient values to adjust the indirection lookup before storing the texture coordinate to the g-buffer. We can store these tiles adjacent to the original 128x128 (splitting the cache basically to 256x128 tiles) if we do not want to increase the indirection texture size (as coordinate bias to anisotropic pages is easy to calculate). Alternatively we can use a hash instead of the indirection texture (cuckoo hash is guaranteed O(1), can be easily coded with no branching/flow control, has no dependency chains and benefits nicely from GPU latency hiding). As a extra bonus, this technique saves bandwidth compared to standard anisotropic filtering, but it doubles the virtual texture cache atlas size. The last, and the most ambitious way is to store no texture coordinate data at all. Use rasterization only for a depth pre-pass. Depth value is translated to a 3d-coordinate in the lighting shader (all deferred renderers do this already). If you have unique mapping in the virtual texture (***), you can do a hash lookup using this world coordinate to get the virtual texture coordinate. Naive thing would be to add all virtual texture pixels to a hash based on their 3d world coordinates (and update the hash whenever a page is loaded). A better way would be to have a sparse multilayer volume texture where the texture coordinates could be queried (this is basically a hash as well, but hash nodes are (8x8x8) volumes instead single pixels, and it would be easy to query if GPU has paged virtual memory, AMDs PRT OpenGL extension for example). It would contain only the surfaces visible in the screen (or virtual texture cache, because it's a superset of screen pixels). This kind of structure wouldn't need to be super high resolution, because texture coordinates are linearly interpolated along polygons (linear filtering from volume texture would work just fine). (**) When using trilinear filtering, the virtual texture atlas has a single mip level. This allows you to use hardware trilinear filtering to blend between the current level and one below it. It increases virtual texture atlas memory consumption by 25%. That's usually not a big deal. (***) You would want to have unique mapping for other purposes as well. It allows you to have unique decals on all your objects in world, and it allows you to precalculate object based texture transformations to the virtual texture cache (for example colorization). Unique virtual mapping shouldn't be confused with unique physical mapping. You don't need to store all versions of pages to the hard drive (like Rage does), you can burn the decals (and colorizations, etc) to pages during page loading. Quote:
Quote:
--> Please someone move this discussion to it's own thread. Thank you! |
|||
|
|
|
|
|
#24 |
|
Regular
|
I'm not saying there is no trade off, I'm saying that however you dice it the bandwidth needed either for an early Z pass or for some form of through framebuffer deferred shading is always going to stay significant. The raison d'etre for hardware tilers is not going away (although if the geometry load keeps increasing at some point they will need object level binning to compete).
__________________
Cinematic is the new streamlined. |
|
|
|
|
|
#25 | |
|
a.k.a. Ingenu
Join Date: Feb 2002
Location: Apsley, U.K.
Posts: 2,729
|
Quote:
That along with standard texture layout would allow for immense worlds streamed in memory and requiring less memory but likely more bandwidth though. (That shouldn't be too much of a problem with SSD becoming mainstream, but memory being already bandwidth limited today I'm not too sure how that would go. Didn't do any estimate of the required bandwidth recently either ;p)
__________________
So many things to do, and yet so little time to spend... |
|
|
|
|
![]() |
| Thread Tools | |
| Display Modes | |
|
|