Welcome, Unregistered.

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Reply
Old 24-Oct-2012, 22:22   #1
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,840
Default Deferred rendering, virtual texturing, IMR vs TB(D)R, and more...

NOTE: This thread was split from the Microsoft Surface thread in the Handheld Forums.


Quote:
Originally Posted by Ailuros View Post
As for the "free" thing, today's desktop GPUs have fillrate to spare; AF doesn't consume any worthwhile bandwidth nor memory footprint last time I checked in contrast to anti-aliasing.
AF definitely consumes bandwidth - it effectively bumps up the mip levels that you sample. It also is slower in the texture samplers, so the only case in which it is "free" is when you have samplers and bandwidth to spare, which is less and less the case (as these are typically the real bottlenecks). And it's never "free" in terms of power.

Quote:
Originally Posted by Ailuros View Post
Since when does a simple higher resolution equal to supersampling?
Since the resolution is high enough that you can't resolve the individual pixels? Your eye effectively integrates/resolves the high resolution image...
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 24-Oct-2012, 23:53   #2
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 7,764
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
AF definitely consumes bandwidth - it effectively bumps up the mip levels that you sample. It also is slower in the texture samplers, so the only case in which it is "free" is when you have samplers and bandwidth to spare, which is less and less the case (as these are typically the real bottlenecks). And it's never "free" in terms of power.
Nothing is completely free in 3D anyway, unless the system is vastly CPU bound fe. Those type of GPUs need more TMUs and you'll get with those more samplers anyway. As for bandwidth overall, when a small form factor doesn't usually lose more than 10% for 4x Multisampling, with the right amount of TMUs there shouldn't be any worthwhile cost for AF either especially since AF algorithms are adaptive for eons now.

I've cleaned up for a friend a couple of days his PC which was a mess and threw it through a couple of hurdles. The GT210 it carries has 4 TMUs and merely 8GB/s bandwidth over a 64bit bus. Even in the highest resolution AF didn't cost more than a fraction of performance which wasn't even noticable (a couple of fps), quite to the contrary to any Multisampling amount (up to ~1/3rd the 1xAA performance with 4xAA enabled)

Further to that since we're in a surface tablet thread, I haven't seen any benchmarks in order to see how much performance the ULP GF in Tegra3 loses with AF enabled, but it's at least capable of it. It's on the other hand not capable of MSAA due to lack of tiling. One to other isn't related, but it's not that Tegra3 as a SoC has any bandwidth to spare rather the contrary. IF AF should cost more performance on it than even the lowest desktop GPU it would be more likely due to the lack of TMUs. That thing shouldn't have more than 2 TMUs anyway. Yes loops cost in bandwidth, but it's an indirect issue and not the primary bottleneck.

Quote:
Since the resolution is high enough that you can't resolve the individual pixels? Your eye effectively integrates/resolves the high resolution image...
It's not supersampling in the strict sense, either way you twist it. I understood your initial point, but the above hairsplitting gets to the point where it's rather ridiculous. As if the eye could resolve pixels at 1280 on a sub 10" display medium unless you glue your nose on to it.
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 25-Oct-2012, 00:50   #3
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,840
Default

Quote:
Originally Posted by Ailuros View Post
Even in the highest resolution AF didn't cost more than a fraction of performance which wasn't even noticable (a couple of fps), quite to the contrary to any Multisampling amount (up to ~1/3rd the 1xAA performance with 4xAA enabled)
You're trivializing something that is data dependent and far more complex than you seem to be taking into account. For instance, higher resolution will incur *less* AF (lighten the load). It's similarly dependent on available texture resolution - it will only do anything if you're min filtering textures (i.e. texture resolution exceeds projected screen shading rate). Certainly if you have a high screen resolution and/or low texture resolutions AF will do absolutely nothing, and thus be "free" :P

MSAA is similarly dependent on the scene, but more-so on geometric frequencies. For simple geometry, there's no reason it has to cost much either since MSAA compression should handle the bandwidth usage.

Quote:
Originally Posted by Ailuros View Post
Yes loops cost in bandwidth, but it's an indirect issue and not the primary bottleneck.
How is it indirect? AF *directly* affects the MIP calculation (hence the colored tunnel tests) by using the minor axis and making more pixels use higher mip levels. Certainly the line integration and additional samples are costly too, but it's a directly related issue (more taps is expensive *because* of bandwidth).

Quote:
Originally Posted by Ailuros View Post
It's not supersampling in the strict sense, either way you twist it. I understood your initial point, but the above hairsplitting gets to the point where it's rather ridiculous. As if the eye could resolve pixels at 1280 on a sub 10" display medium unless you glue your nose on to it.
But that's the point... my point was that you don't need to render at high-dpi resolutions to make images look good. Better sampling (more AA, better filtering, etc) is really what you want, and brute force pixel shading at the higher frequency is a poor use of hardware resources to that end.
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 25-Oct-2012, 20:42   #4
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,840
Default

Quote:
Originally Posted by Ailuros View Post
Indirect under the reasoning that by the time you need to loop for any of the AF related calculations (due to the absence of quad TMUs for the lowest common denominators) the bandwidth requirements increase by far more.
Uhh, the TMUs (which by the way is not exactly a well-defined concept that exists in the same way on every architecture) pretty much always have to loop for AF. I see no compelling reason to massively complicate the design by distributing the integration of a single sample when there's plenty of parallelism in different samples anyways. Thus all you're saying is "if texture throughput is sufficient, texturing will not a bottleneck"... ok? Doesn't that go for pretty much anything?

Quote:
Originally Posted by Xmas View Post
The degree of anisotropy at a sample point is not dependent on resolution (assuming constant aspect ratio).
Sure the ratio does not change, but the number of samples you take does, simply because of hitting the most detailed mip level. In mobile where texture resolutions are extremely low, this is an especially pronounced effect.

Quote:
Originally Posted by sebbbi View Post
Of course the downside being that you can run demanding games only for two hours or so, and the battery is dead
But that's the rub in general Power efficiency of computing a frame is a separate metric than raw performance, and one that people are normally ill-equipped to measure meaningfully. It's also not a simple "which is better" situation because you need to reach a minimum level of performance before a solution is interesting at all. Running serial algorithms (parallelism always has power overhead) on a low power single core CPU and finishing your frame in a few minutes might be the most power-efficient, but it's hardly interesting

Quote:
Originally Posted by Grall View Post
Problem with HD4000 is it's an immediate-mode renderer, which spends a LOT of its fillrate and power budget drawing and shading invisible pixels.
Meh, while I admit tiled renderers have advantages in framebuffer bandwidth and power, I'm fairly certainly that Tegra is not a tiled renderer either, so it's hardly the only way to play the game. Also I don't think modern IMG stuff sorts or does other hidden surface removal in the tile... I believe they are tiled, but pretty much just run like an IMR inside each tile (like Larrabee). There have been API changes over the years that make it infeasible to run any other way (and still be spec compliant), especially when you get into DX10 and 11.

I'm also not convinced framebuffer bandwidth is the real limitation in the long run. Certainly while we're still in the space of rendering with basic shaders, simple and extremely low resolution texturing and the like it's a win, but it's not clear that framebuffer bandwidth is a significant factor in desktop games for instance. So unless you believe that the mobile graphics world will evolve significantly differently (and so far it really has just mirrored the evolution of desktop graphics with a few omissions), I'm not sure we can necessarily pronounce IMR dead in power-constrained environments. For my part I expect the graphics pipeline portion of rendering a frame to be increasingly small as we move forward, with more and more work being done in generic compute/software.

For my part I'm actually becoming less and less interested in pure tablets, or even tablets + keyboard "covers". After having played with a few "convertibles", I'm much more leaning towards a good ULV big core (17W or lower), with a detachable tablet portion + keyboard (ideally with more batteries, transformer style), and a nice digitizer. Mobile hardware just hasn't scaled up in performance as quickly as desktop hardware seems to be scaling down in power usage.
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 25-Oct-2012, 21:03   #5
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 7,764
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
Uhh, the TMUs (which by the way is not exactly a well-defined concept that exists in the same way on every architecture) pretty much always have to loop for AF.
Loop a little or loop a lot?

Quote:
Meh, while I admit tiled renderers have advantages in framebuffer bandwidth and power, I'm fairly certainly that Tegra is not a tiled renderer either, so it's hardly the only way to play the game.
Of course doesn't the ULP GF not use any tiling, but uses large caches to solve the bandwidth problem. It doesn't however work for Multisampling and that's probably the primary reason why it isn't supported but only coverage AA.

Quote:
Also I don't think modern IMG stuff sorts or does other hidden surface removal in the tile... I believe they are tiled, but pretty much just run like an IMR inside each tile (like Larrabee).
I'd love to see how things look like on a theoretical LRB with very thin diagonally placed triangles.

Quote:
I'm also not convinced framebuffer bandwidth is the real limitation in the long run. Certainly while we're still in the space of rendering with basic shaders, simple and extremely low resolution texturing and the like it's a win, but it's not clear that framebuffer bandwidth is a significant factor in desktop games for instance. So unless you believe that the mobile graphics world will evolve significantly differently (and so far it really has just mirrored the evolution of desktop graphics with a few omissions), I'm not sure we can necessarily pronounce IMR dead in power-constrained environments. For my part I expect the graphics pipeline portion of rendering a frame to be increasingly small as we move forward, with more and more work being done in generic compute/software.
On a pure GPU integration statistical level it's a 50-50 ballgame between TBDRs and IMRs in the small form factor markets right now. If you'd slightly turn the perspective and skip the deferred part (which is limited to IMG only) and concentrate on tile based small form factor GPUs they're the widest majority. It's not that there aren't any bandwidth savings with tiling and early-Z combinations in those IMRs, rather the contrary.

Quote:
Mobile hardware just hasn't scaled up in performance as quickly as desktop hardware seems to be scaling down in power usage.
I tend to disagree. The performance and efficiency jumps for smartphone/mW platforms are huge over the years and the bump will only get significantly larger with the coming generation of small form factor GPUs. How long was it ago when f.e. OMAP3 GPUs had barely 1 GFLOP arithmetic throughput, while the Adreno320 in Qualcomm S4 smartphones should exceed the 50 GFLOPs mark.
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 26-Oct-2012, 04:47   #6
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Grall View Post
This advantage hardly matters in a tablet, because almost nobody runs computing-intensive apps on such a device. It kills the battery for starters, and second, even intel's portable chips are no damn good at computing anyway; not compared to a desktop processor.
Quote:
Problem with HD4000 is it's an immediate-mode renderer, which spends a LOT of its fillrate and power budget drawing and shading invisible pixels. This is no good in a portable device - arguably no good in a desktop setting either really, but neither AMD nor Nvidia (nor Intel for that matter) seem interested in wanting to do anything real about this issue...yet, anyway.

Time and power constraints may change that eventually.
It's an Early Z IMR. Obscured pixels matter less than you think.

A true TBDR paired with a renderer that plays to it's strengths is another matter of course. Not sure how many of iOS games render-to-TBDR, so to speak.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 26-Oct-2012, 05:35   #7
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
Meh, while I admit tiled renderers have advantages in framebuffer bandwidth and power, I'm fairly certainly that Tegra is not a tiled renderer either, so it's hardly the only way to play the game. Also I don't think modern IMG stuff sorts or does other hidden surface removal in the tile... I believe they are tiled, but pretty much just run like an IMR inside each tile (like Larrabee). There have been API changes over the years that make it infeasible to run any other way (and still be spec compliant), especially when you get into DX10 and 11.
Modern IMG stuff does HSR within a tile. Promise.

I am curious about the API bits you mentioned that are inconvenient for a TBDR. I had no idea about this and I'd like to know more. Care to share?

Quote:
I'm also not convinced framebuffer bandwidth is the real limitation in the long run. Certainly while we're still in the space of rendering with basic shaders, simple and extremely low resolution texturing and the like it's a win, but it's not clear that framebuffer bandwidth is a significant factor in desktop games for instance. So unless you believe that the mobile graphics world will evolve significantly differently (and so far it really has just mirrored the evolution of desktop graphics with a few omissions), I'm not sure we can necessarily pronounce IMR dead in power-constrained environments. For my part I expect the graphics pipeline portion of rendering a frame to be increasingly small as we move forward, with more and more work being done in generic compute/software.
With tessellation, the cost of doing Early Z pass can become quite a bit more. Also, in one of your presentations, I remember you had mentioned that Tiled deferred and Tiled forward had pretty much the same bandwidth on an Early Z IMR. Well, it would have a lot better GPU bandwidth numbers for Tiled forward on a TBDR. Not to mention all the savings CPU side.

When we get to ~10MB cache on die, around 14 nm or 10 nm for sure, then we *could* have entire framebuffer on die, or atleast the z buffer. I think that could shift the paradigm quite a bit.

Also, since UI is such an important job for mobile GPUs and the systems are so bandwidth constrianed, that alone can be quite useful.

Quote:
For my part I'm actually becoming less and less interested in pure tablets, or even tablets + keyboard "covers". After having played with a few "convertibles", I'm much more leaning towards a good ULV big core (17W or lower), with a detachable tablet portion + keyboard (ideally with more batteries, transformer style), and a nice digitizer. Mobile hardware just hasn't scaled up in performance as quickly as desktop hardware seems to be scaling down in power usage.
That was my belief too. It's good to hear confirmation form someone with experience. Is there anything in particular that attracted you towards a transformer + stylus device? Anything that pushed you away from tablet + covers?
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 26-Oct-2012, 06:01   #8
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,221
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by rpg.314 View Post
When we get to ~10MB cache on die, around 14 nm or 10 nm for sure, then we *could* have entire framebuffer on die, or atleast the z buffer. I think that could shift the paradigm quite a bit.
Except that some developers have moved to SOFTWARE tiling with deferred shading to avoid doing multiple geometry passes ... so not really.
__________________
Cinematic is the new streamlined.
MfA is offline   Reply With Quote
Old 26-Oct-2012, 06:16   #9
Mintmaster
Senior Member
 
Join Date: Mar 2002
Posts: 3,779
Default

Quote:
Originally Posted by rpg.314 View Post
Modern IMG stuff does HSR within a tile. Promise.
IMRs do HSR as well through early Z. The only thing they may do extra is go through the tile's polys twice (Z-pass first), but that often has questionable value, and can be a loss in a well optimized game with object-level sorting.

Where tiled renderers have a bandwidth advantage is primarily with alpha rendering. They also have a small advantage with the efficiency of large block writes to the framebuffer. The penalty is the bandwidth cost of binning (vertices pass through the chip multiple times).
Quote:
That was my belief too. It's good to hear confirmation form someone with experience. Is there anything in particular that attracted you towards a transformer + stylus device? Anything that pushed you away from tablet + covers?
Touch+stylus is simply superior to touch-only. A stylus is superior to a mouse in every way except in ease of switching from the keyboard (few tenths of a second extra) and cost (minimal now). Fingers are inferior to the mouse in every way except for some multitouch gestures (and, of course, the convenience of being permanently attached to the human body).

The stylus lets you run any desktop software comfortably, as it doesn't need a low density UI that also considers how the finger blocks the view of whatever is under it.
Mintmaster is offline   Reply With Quote
Old 26-Oct-2012, 07:21   #10
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,221
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by Mintmaster View Post
object-level sorting.
It would be so nice if hardware tilers were given the necessary information to do this as well ... all this time and the APIs are still brain dead.
__________________
Cinematic is the new streamlined.
MfA is offline   Reply With Quote
Old 26-Oct-2012, 17:53   #11
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,840
Default

Quote:
Originally Posted by Ailuros View Post
I'd love to see how things look like on a theoretical LRB with very thin diagonally placed triangles.
Well no GPUs like thin/skinny triangles, but obviously tiled ones don't like any triangles that span too many tiles

Quote:
Originally Posted by Ailuros View Post
I tend to disagree. The performance and efficiency jumps for smartphone/mW platforms are huge over the years and the bump will only get significantly larger with the coming generation of small form factor GPUs. How long was it ago when f.e. OMAP3 GPUs had barely 1 GFLOP arithmetic throughput, while the Adreno320 in Qualcomm S4 smartphones should exceed the 50 GFLOPs mark.
Well but that level of performance I'm simply not interested in. Ultimately for tablet stuff I'm looking at something more in the range of 5-15W, and I fully expect the ULV desktop parts to be in that range with similar performance to their current 17W offerings in a year or two. Nothing I've seen from mobile makes me thing that their perf is going to scale high enough in that time to get near the performance of such a chip, CPU *or* GPU-wise. Love to be proven wrong, as I'd love to play with some more exotic hardware than what we currently have on PC

Phone CPU/GPU performance is basically uninteresting to me beyond the threshold where it can run the basic OS, e-mail, etc. I have zero desire to ever play a game on a tiny phone screen that has barely enough space to see let alone interact.

Quote:
Originally Posted by NRP View Post
I'd be interested in a tablet that can handle Photoshop.
Samsung Series 7 Slate has been out for a while. It's awesome, and the stylus obviously works great with Photoshop. Surface Pro looks to be basically the same thing one generation newer...

Quote:
Originally Posted by rpg.314 View Post
I am curious about the API bits you mentioned that are inconvenient for a TBDR. I had no idea about this and I'd like to know more. Care to share?
Sure; two of the big ones are predicated rendering and the semantics of UAV access from pixel shaders. Both of them let you set up dependencies between arbitrary pixels and subsequent draw calls. For UAV accesses they are allowed to be unordered within a single draw call, but all must complete before the next draw call takes place (!). This isn't too bad for an IMR (just means you need to put a barrier after each draw call that has a UAV bound), but for a TBR this requires flushing all bins between every draw call. That is a disaster...

Quote:
Originally Posted by rpg.314 View Post
With tessellation, the cost of doing Early Z pass can become quite a bit more. Also, in one of your presentations, I remember you had mentioned that Tiled deferred and Tiled forward had pretty much the same bandwidth on an Early Z IMR. Well, it would have a lot better GPU bandwidth numbers for Tiled forward on a TBDR.
Not totally true, because on a TBR you would implement tiled deferred using framebuffer reads and discard the G-buffer (the G-buffer is effectively just per-tile scratch space), so it will be similar there too. But agreed that a naive implementation would work that way.

Quote:
Originally Posted by Mintmaster View Post
Where tiled renderers have a bandwidth advantage is primarily with alpha rendering.
Indeed, and that's pretty significant for the majority of stuff that you see on mobile. So much so that I expect people to start considering implementing binning in software (even though GPUs are fairly poor at that sort of data structure create at the moment) for particles, etc. to avoid such a massive waste of bandwidth on IMR blending.

Quote:
Originally Posted by rpg.314 View Post
Is there anything in particular that attracted you towards a transformer + stylus device? Anything that pushed you away from tablet + covers?
A few things actually.

1) Typing without tactile feedback is pretty awful... so much so that I don't really care to have a keyboard at all unless it's a decent tactile one. The tactile surface one might suffice, but we'll see.

2) If you're going to have a keyboard, might as well fit some more battery in the enclosure for it

3) Stylus is great. I don't use it exclusively or anything, but for anything more precise, or drawing or even writing to some extent, it's quite pleasant. Personally I use it for rough work, math, etc. in OneNote. OneNote even can convert my math scrawling to symbols (!!) and it works very well.

Thus I basically see no reason why I can't have it all... touch, keyboard, stylus, good performance and ability to run anything I want. The convertible aspect means that I can use just the tablet portion when it's more convenient to do that (on a bus, etc) but turn it into a laptop when I want to get some work done. After I've seen some of the convertible systems, they seem like a strict superset of what you get in other mobile devices.

I won't claim that these are the primary concerns for everyone, but it is hard to argue that the "strict" tablets have any advantages over convertibles going forward other than perhaps price, which gets muddy if you still buy a laptop in addition...
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 26-Oct-2012, 21:45   #12
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 7,764
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
Well no GPUs like thin/skinny triangles, but obviously tiled ones don't like any triangles that span too many tiles
The point was that IMG's tiling method is unique, otherwise they wouldn't have a number of patents for it.

Quote:
Well but that level of performance I'm simply not interested in. Ultimately for tablet stuff I'm looking at something more in the range of 5-15W, and I fully expect the ULV desktop parts to be in that range with similar performance to their current 17W offerings in a year or two. Nothing I've seen from mobile makes me thing that their perf is going to scale high enough in that time to get near the performance of such a chip, CPU *or* GPU-wise. Love to be proven wrong, as I'd love to play with some more exotic hardware than what we currently have on PC
Tablets obviously have a much higher power envelope than smartphones. Apple is doubling GPU performance on a yearly cadence since the iPad2 (well for the 4th generation iPad it's less than a year but it could very well be some sort of mid life kicker until their true next generation tablet arrives). We'll see how all next generation small form factor GPUs will perform in real time after they're released, but I wouldn't be as quick to underestimate IMG's Rogue or even the ULP GF in NV's Wayne. Scalability for the first in terms of clusters doesn't end neither at 16 clusters nor at just 1 TFLOP fp throughput (despite that FLOPs is another as meaningless metric as triangle throughputs once used to be); it'll come down to how perf/W looks like in order to make any sort of comparison in the first place or better how much GPU performance anyone could squeeze into that 5-15W tablet power envelope, all other factors included.
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 26-Oct-2012, 23:41   #13
silent_guy
Senior Member
 
Join Date: Mar 2006
Posts: 1,686
Default

Quote:
Originally Posted by Ailuros View Post
The point was that IMG's tiling method is unique, otherwise they wouldn't have a number of patents for it.
http://knowyourmeme.com/photos/312563
silent_guy is offline   Reply With Quote
Old 27-Oct-2012, 02:23   #14
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by MfA View Post
Except that some developers have moved to SOFTWARE tiling with deferred shading to avoid doing multiple geometry passes ... so not really.
Users of tiled forward rendering might not agree. MSAA is quite useful.

Analytic AO needs 2 geometry passes anyway.

Even if you are doing deferred shading in software, if the entire G buffer can be had in high speed memory, which seems possible with interposers, that would change sweet spots considerably.

Even if you can't fit the full G buffer on die, if you can just fit the ID buffer, that is still a big win.

ID buffer = what a hw TBDR generates to decide which tri/pixel combination to shade. Essentially Frame based Deferred Rendering in hw without any cost of binning.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.

Last edited by rpg.314; 27-Oct-2012 at 02:30.
rpg.314 is offline   Reply With Quote
Old 27-Oct-2012, 02:42   #15
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Mintmaster View Post
IMRs do HSR as well through early Z. The only thing they may do extra is go through the tile's polys twice (Z-pass first), but that often has questionable value, and can be a loss in a well optimized game with object-level sorting.

Where tiled renderers have a bandwidth advantage is primarily with alpha rendering. They also have a small advantage with the efficiency of large block writes to the framebuffer. The penalty is the bandwidth cost of binning (vertices pass through the chip multiple times).
1) That analysis is correct but would certainly change in view of large amounts tessellation. You would not want to render geometry twice.

2) Not having to shade quads is a win.

3) And there's this

http://www.google.com/patents/US20110254852

It's a new IMG patent that describes how you can use a TBDR to save both pixel and texel fill rate with shadow mapping and the like.

Basically, don't rasterize shadow maps immediately after binning is complete. Wait until the next render wants to lookup the texels. Then rasterize just one tile of the shadow map opportunistically and then immediately use it to shade the fragments.

This way both the z testing and the subsequent texture filtering can be done out of on chip buffers. Since the final render is going to have fairly large spatial coherence, it could save quite a bit of lookups.

I don't think this technique can be copied by an IMR.

4) On a handheld we have much larger resolution and since the screen size is small, the physical size of objects (in mm across the screen) is small. I am just thinking aloud here, but I think the tris needed to hide the curvature would be less too. Which could tip the balance in a TBR's favor for this market.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 27-Oct-2012, 03:17   #16
rpg.314
Senior Member
 
Join Date: Jul 2008
Location: /
Posts: 4,070
Send a message via Skype™ to rpg.314
Default

Quote:
Originally Posted by Andrew Lauritzen View Post
Sure; two of the big ones are predicated rendering and the semantics of UAV access from pixel shaders. Both of them let you set up dependencies between arbitrary pixels and subsequent draw calls. For UAV accesses they are allowed to be unordered within a single draw call, but all must complete before the next draw call takes place (!). This isn't too bad for an IMR (just means you need to put a barrier after each draw call that has a UAV bound), but for a TBR this requires flushing all bins between every draw call. That is a disaster...
Predicated rendering on a TBDR is no worse than rendering on IMR.

Wouldn't using a UAV immediately after rendering to it stall an IMR as well? A TBR should do no worse than an IMR in such case.
Quote:
Not totally true, because on a TBR you would implement tiled deferred using framebuffer reads and discard the G-buffer (the G-buffer is effectively just per-tile scratch space), so it will be similar there too. But agreed that a naive implementation would work that way.
I had not thought of that. That would definitely work. Wouldn't the fb read involved be beyond current APIs though?

Quote:
Indeed, and that's pretty significant for the majority of stuff that you see on mobile. So much so that I expect people to start considering implementing binning in software (even though GPUs are fairly poor at that sort of data structure create at the moment) for particles, etc. to avoid such a massive waste of bandwidth on IMR blending.

Quote:
A few things actually.

1) Typing without tactile feedback is pretty awful... so much so that I don't really care to have a keyboard at all unless it's a decent tactile one. The tactile surface one might suffice, but we'll see.

2) If you're going to have a keyboard, might as well fit some more battery in the enclosure for it

3) Stylus is great. I don't use it exclusively or anything, but for anything more precise, or drawing or even writing to some extent, it's quite pleasant. Personally I use it for rough work, math, etc. in OneNote. OneNote even can convert my math scrawling to symbols (!!) and it works very well.

Thus I basically see no reason why I can't have it all... touch, keyboard, stylus, good performance and ability to run anything I want. The convertible aspect means that I can use just the tablet portion when it's more convenient to do that (on a bus, etc) but turn it into a laptop when I want to get some work done. After I've seen some of the convertible systems, they seem like a strict superset of what you get in other mobile devices.

I won't claim that these are the primary concerns for everyone, but it is hard to argue that the "strict" tablets have any advantages over convertibles going forward other than perhaps price, which gets muddy if you still buy a laptop in addition...
Thanks for this. I am looking for something that has hand writing recognition, math formula -> Latex recognition/conversion. Is there anything out there that does that.
__________________
The views presented here are my own and not my employer's.
Quote:
Originally Posted by Alexko View Post
So in a nutshell, model [BLANK] will have [BLANK], up to [BLANK], and even [BLANK] for a power consumption of just [BLANK]. Impressive.
rpg.314 is offline   Reply With Quote
Old 27-Oct-2012, 14:14   #17
sebbbi
Member
 
Join Date: Nov 2007
Posts: 938
Default

Quote:
Originally Posted by rpg.314 View Post
1) That analysis is correct but would certainly change in view of large amounts tessellation. You would not want to render geometry twice.

2) Not having to shade quads is a win.
My take on deferred rendering (and tile based techniques)...

I personally try to avoid all techniques that require rendering geometry twice, because geometry transform/rasterization is the step that has by the far the most fluctuating running time. Draw call count, vertex/triangle count, quad efficiency, overdraw/fillrate, etc all change radically depending on the rendered scene. Screen pixel count is always the same (720p = 922k pixels). All algorithms you process just once for each pixel in the screen incur a constant cost. That's why I like deferred techniques (= processing after all geometry is rasterized). Constant stable frame rate is the most important thing for games. Worst case performance is what matters in algorithm selection, average performance is meaningless (unless it's guaranteed to amortize over the frame).

I am not a particular fan of LiDR and it's descendants (including Forward+). Depth pre-pass doubles the most fluctuating part of the frame rending (draw calls / geometry processing). It also is a waste of GPU resources. All the 2000+ programmable shader "cores" of modern GPUs are basically idling while the GPU crunches though all the scene draw calls and renders them to Z-buffer (depth testing, filling, triangle setup, etc fixed function work). Memory bandwidth is also underutilized (just vertex fetching and only depth writes, no texture reads or color writes at all). For good GPU utilization you have to have balanced load at every stage of your graphics rendering pipeline. Depth pre-pass isn't balanced at all.

Various displacement mapping techniques will be used more and more in future games, and these make the extra geometry pass even more expensive. DX11 supports vertex tessellation and conservative depth output. Tessellation will promote usage of vertex based displacement mapping techniques, and conservative depth is very useful for pixel based displacement mapping techniques (allows early-z and hi-z to be enabled with pixel shader programmable depth output). A side note: The programmable depth output and pixel discard isn't a good thing for TBDRs (making pixel shader based displacement quite inefficient). Vertex tessellation also adds some extra burden (how bad that is remains to be seen in the future).

Brute force deferred rendering with fat g-buffers isn't either the best choice in the long run. Basically all source textures are compressed (DXT variants, DX11 even adds an HDR format). A forward renderer simply reads each DXT texture once a pixel. A deferred renderer reads the compressed texture, outputs it to a uncompressed rendertarget and later reads the uncompressed texture from the render target. DXT5 is 1 byte per pixel, while uncompressed (8888 or 11f-11f-10f) is 4 bytes per pixel. Forward reads 1 byte per each texture layer used, deferred reads 5 bytes and writes 4 bytes (9x more BW used). This problem isn't yet a big problem, because most games don't have more than two textures per object (8 channels for example can fit: rgb color, xy normal, roughness, specular, opacity). But in the future the materials will become more complex and the g-buffers will become fatter (as we need to store all the texture data to the g-buffer for later stages).

I personally like to keep geometry rendering pass as cheap as possible. Rendering to three or four render targets and reading three or four textures isn't cheap. Overdraw gets very expensive and quad efficiency and texture cache efficiency play a big (unpredictable) role in the performance. It's better just to store the (interpolated) texture coordinates to the g-buffer. This way you get a very fast pixel shader (with no texture cache stalls), quad efficiency and/or overdraw doesn't matter much, full fill rate (no MRTs), low BW requirement, etc. All the heavy lifting is done later, once a pixel, in a compute shader. Compressed textures are read only once, and no uncompressed texture data is written/read from the g-buffers. This kind of system minimizes the fluctuating cost from geometry processing/rasterization and it compares very well to a TBDR in scenes that have high overdraw. IMR still has more overdraw and TBDR, but the overdraw is dirt cheap. (**)

What matters in the future isn't the geometry rasterization performance. Geometry rasterization is only around 10%-20% of the whole frame rendering cost if you use advanced deferred rendering techniques. TBDR/IMR aren't that different if 80%+ of frame rendering time is spend running compute shaders.

(**) The biggest downsize of the technique described above is that the "texture coordinate" (= texture address) must contain enough data to distinguish all the texture pixels that might be visible in the frame (and bilinear combinations of those). Basically with current architectures this means you need a big texture atlas, and you need to store all your textures there. This is not a viable strategy for games that have a gigabyte worth of textures loaded at memory at once. Virtual texturing however only tries to keep texture data in memory that is required to render the current frame. The whole data set fits to a single 8192x8192 atlas (virtual texture page cache). With this kind of single atlas, the "texture coordinate" problem becomes trivial: Just store a 32 bit (normalized int 16-16) texture coordinate to the g-buffer.
Quote:
Originally Posted by rpg.314 View Post
Basically, don't rasterize shadow maps immediately after binning is complete. Wait until the next render wants to lookup the texels. Then rasterize just one tile of the shadow map opportunistically and then immediately use it to shade the fragments.

[...]

I don't think this technique can be copied by an IMR.
This technique is very similar to virtual shadow mapping. Virtual shadow mapping works pretty much like virtual texturing, except that you use projected shadow map texture coordinates instead of the mesh texture coordinates. By using depth buffer and shadow map matrix you can calculate all the visible pages. Each page (frustum) is rendered separately (we can of course combine neighborhood pages to single frustums to speed up the processing). Shadow map fetching uses the same indirection texture approach as virtual texturing (cuckoo hashing is also a pretty good fit for GPU). The best thing about this technique is that it renders shadow maps always at correct 1:1 screen resolution. Oversampling/undersampling is much reduced compared to techniques such as cascaded shadow mapping.

Page visibility determination (from depth buffer) of course takes some extra time, but you can combine it with some other full screen pass to minimize it's impact. Rendering several smaller shadow frustums (pages) of course increases draw call count (and vertex overhead), but techniques such as merge-instancing can basically eliminate that problem (single draw call per page / subobject culling for reduced vertex overhead). With some DrawInstancedIndirect/DispatchIndirect trickery that's doable, but dynamic kernel dispatching by other kernels would make things much better (GK110 will be the first GPU to support this).
sebbbi is offline   Reply With Quote
Old 27-Oct-2012, 18:33   #18
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,877
Default

Quote:
Originally Posted by sebbbi View Post
A side note: The programmable depth output and pixel discard isn't a good thing for TBDRs
If the depth/visibility was finalised near the start of the pixel shader, then with the right extra hardware you wouldn't have to compute the rest of the program if you knew some subsequent object would overwrite that pixel (which a TBDR can know about unlike an IMR). That would help significantly for some uses, although probably not for others.

Quote:
Originally Posted by sebbbi View Post
It's better just to store the (interpolated) texture coordinates to the g-buffer. This way you get a very fast pixel shader (with no texture cache stalls), quad efficiency and/or overdraw doesn't matter much, full fill rate (no MRTs), low BW requirement, etc. All the heavy lifting is done later, once a pixel, in a compute shader.
Maybe I'm missing something, but how would the ddx/ddy calculation work in that compute shader? How do you know the neighboring pixel is part of the same object and what happens if the neighboring pixels are all of different objects? The only safe way I can think of implementing this is to also store ddx/ddy, and then you aren't really saving much bandwidth and your texture instructions go at 1/4th the normal speed because they work on individual pixels instead of quads...

IMO, the most bandwidth efficient way to do deferred rendering is to do it on a TB(D)R with the right extensions (programmable blending, being able to use tile memory as scratch not being output, etc.). Even in a worst-case scenario where you don't benefit from the deferred rendering, you're still not really using more memory bandwidth than a forward renderer. I'd say that's pretty cool!

Quote:
This technique is very similar to virtual shadow mapping. Virtual shadow mapping works pretty much like virtual texturing, except that you use projected shadow map texture coordinates instead of the mesh texture coordinates. By using depth buffer and shadow map matrix you can calculate all the visible pages. Each page (frustum) is rendered separately (we can of course combine neighborhood pages to single frustums to speed up the processing). Shadow map fetching uses the same indirection texture approach as virtual texturing (cuckoo hashing is also a pretty good fit for GPU). The best thing about this technique is that it renders shadow maps always at correct 1:1 screen resolution. Oversampling/undersampling is much reduced compared to techniques such as cascaded shadow mapping.
Agreed there are some similarities. However the architecture described in the patent would have a lower performance overhead and save most of the read bandwidth. And the bandwidth saving also applies to many post-processing and/or downsampling passes, in some cases it could even save the write bandwidth. Obviously all at the cost of some extra hardware...

Quote:
Originally Posted by sebbbi View Post
TBDR/IMR aren't that different if 80%+ of frame rendering time is spend running compute shaders.
Agreed. And whether that's what the workload looks like or not, shader core efficiency is key.
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles)
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is offline   Reply With Quote
Old 27-Oct-2012, 20:07   #19
OlegSH
Member
 
Join Date: Jan 2010
Posts: 117
Default

Quote:
Originally Posted by Arun View Post
IMO, the most bandwidth efficient way to do deferred rendering is to do it on a TB(D)R with the right extensions (programmable blending, being able to use tile memory as scratch not being output, etc.). Even in a worst-case scenario where you don't benefit from the deferred rendering, you're still not really using more memory bandwidth than a forward renderer. I'd say that's pretty cool!
Not cool if you don't have enough scratch to handle deffered lighting in full resolution(i am about Uncharted: Golden Abyss), don't even mention MSAA scratch requirements. I don't believe in bright future of deferred rendering on a TB(D)R because of developers complainings about 360's EDRAM limitations on deffered shading and after sub res Uncharted: Golden Abyss fail
OlegSH is offline   Reply With Quote
Old 27-Oct-2012, 20:20   #20
Ailuros
Epsilon plus three
 
Join Date: Feb 2002
Location: Chania
Posts: 7,764
Default

Quote:
Originally Posted by OlegSH View Post
I don't believe in bright future of deferred rendering on a TB(D)R because of developers complainings about 360's EDRAM limitations on deffered shading and after sub res Uncharted: Golden Abyss fail
I'm afraid I've completely lost connection with the above.
__________________
People are more violently opposed to fur than leather; because it's easier to harass rich ladies than motorcycle gangs.
Ailuros is offline   Reply With Quote
Old 27-Oct-2012, 20:31   #21
Arun
Unknown.
 
Join Date: Aug 2002
Location: UK
Posts: 4,877
Default

Quote:
Originally Posted by OlegSH View Post
Not cool if you don't have enough scratch to handle deffered lighting in full resolution(i am about Uncharted: Golden Abyss), don't even mention MSAA scratch requirements. I don't believe in bright future of deferred rendering on a TB(D)R because of developers complainings about 360's EDRAM limitations on deffered shading and after sub res Uncharted: Golden Abyss fail
I don't think it makes sense to look at specific examples on specific hardware and draw general conclusions from them. Anyway Uncharted: Golden Abyss kept the light accumulation buffer on-chip but the G-Buffers were still off-chip because they didn't have enough space on SGX. While still limited, you'd obviously expect next-generation hardware to have more space...

As for MSAA, it will reduce tile size rather than reduce tile memory per pixel. Otherwise since SGX only has 64bpp you wouldn't even have enough space to do 4xMSAA for 32bpp framebuffers
__________________
Focusing on non-graphics projects in 2013 (but I still love triangles)
"[...]; the kind of variation which ensues depending in most cases in a far higher degree on the nature or constitution of the being, than on the nature of the changed conditions."
Arun is offline   Reply With Quote
Old 27-Oct-2012, 21:29   #22
Andrew Lauritzen
AndyTX
 
Join Date: May 2004
Location: British Columbia, Canada
Posts: 1,840
Default

Quote:
Originally Posted by rpg.314 View Post
Predicated rendering on a TBDR is no worse than rendering on IMR.
Not true - what if I set up a predicate and render a pixel at 0,0 and then predicate a draw call that renders a triangle on the opposite corner of the screen? I've now set up an arbitrary spatial dependency, which defeats any ability to bin/re-sort the incoming geometry.

Quote:
Originally Posted by rpg.314 View Post
Wouldn't using a UAV immediately after rendering to it stall an IMR as well? A TBR should do no worse than an IMR in such case.
Well "stall" in that the draw call has to finish through the pipeline before the next one starts, yes. But there's a large order of magnitude difference... IMRs apply the draw calls "wide" across the GPU for the most part and only pipeline multiple together when necessary. TBRs bin geometry spatially first, and really want to bin *all* geometry for a given render target before starting to shade tiles. This is simply not possible with the above UAV/predication semantics (really the same issue in both cases) - every draw call must flush *all* tiles before starting binning the next. That's extremely bad.

Quote:
Originally Posted by rpg.314 View Post
I had not thought of that. That would definitely work. Wouldn't the fb read involved be beyond current APIs though?
Tegra - while not tiled per se - has a framebuffer read extension and IIRC Apple actually just added one to iOS as well. That plus "discard" should be all you really need.

Quote:
Originally Posted by rpg.314 View Post
Thanks for this. I am looking for something that has hand writing recognition, math formula -> Latex recognition/conversion. Is there anything out there that does that.
Right, so OneNote will get you the handwriting -> math symbols part, just need to convert the resulting symbols to Latex I guess.

Quote:
Originally Posted by rpg.314 View Post
Users of tiled forward rendering might not agree. MSAA is quite useful.
... and totally usable with tiled deferred. In fact with sufficient MSAA compression hardware and the ability for samplers to read these compressed buffers (not always ubiquitous right now I will admit), the overhead is actually not dissimilar to forward rendering. And if we exposed some bits from the compression format it to user space it could be even cheaper, as I've discussed in my talks in the past.

Quote:
Originally Posted by sebbbi View Post
I personally try to avoid all techniques that require rendering geometry twice, because geometry transform/rasterization is the step that has by the far the most fluctuating running time.
I completely agree, and I extend that to not liking to run expensive pixels shaders for the same reason - it's too variable how long they take due to hardware scheduling decisions and features. Spikes are bad... I'd rather predictably run at the "worst case" speed all the time than randomly spiking up and down in performance.

Quote:
Originally Posted by sebbbi View Post
But in the future the materials will become more complex and the g-buffers will become fatter (as we need to store all the texture data to the g-buffer for later stages).
Right but as you note, there's no reason to store that texture data to the g-buffer ultimately, only interpolants (and even then you could store the sources and barycentrics if you had a ton of them, but that's not exactly common). The only reason people tend to store the texture data itself these days is because typically it's smaller than uv + gradients (6 floats, although you should be able to get away with 5, or obviously 3 for isotropic sampling). If you have >2-3 textures using the same coordinates though, it becomes cheaper to defer the texture lookup too. All these things are fairly straightforward choices though - there's no need make a big deal out of it; just test which is faster for a specific case.

Quote:
Originally Posted by sebbbi View Post
Basically with current architectures this means you need a big texture atlas, and you need to store all your textures there.
Sure, but ultimately this will be solved by bindless textures/resources referenced in constant buffers or similar. i.e. you just store a pointer/offset to the material data in the G-buffer and look up into it in the deferred pass. There may be an interim time when people use virtual texturing for this, but in the long run there's no need for atlases/continuous address spaces for this kind of work. This will be far more efficient than redundantly dumping all this data into the G-buffer itself (much of it is constant), and basically reduces the role of the rendering pipeline to just rasterization (and perhaps displacement mapping) and some basic attribute interpolation.

Anyways, this is an interesting conversation but we're pretty off-topic for this thread... might be worth someone splitting this?
__________________
The content of this message is my personal opinion only.
Andrew Lauritzen is offline   Reply With Quote
Old 28-Oct-2012, 10:25   #23
sebbbi
Member
 
Join Date: Nov 2007
Posts: 938
Default

Quote:
Originally Posted by Arun View Post
Maybe I'm missing something, but how would the ddx/ddy calculation work in that compute shader? How do you know the neighboring pixel is part of the same object and what happens if the neighboring pixels are all of different objects? The only safe way I can think of implementing this is to also store ddx/ddy, and then you aren't really saving much bandwidth and your texture instructions go at 1/4th the normal speed because they work on individual pixels instead of quads...
Didn't explain it, because I tried to keep my post as short as possible (this is off topic discussion after all)... but I failed miserably

Lets first go though the easy case of bilinear filtering from virtual texture. In this case, the texture coordinate already implicitly contains the mip level, as the indirection texture lookup transforms the texture coordinate to the correct 128x128 pixel page depending on the mip level (gradients). Basically the x,y texture coordinate pair contains all the info you need. Bilinear filtering isn't exactly a hot technique itself, but if you use anisotropic mip hardware to calculate the lod level based on gradients (min of x,y clamped to max+1 instead of max) you will get higher detail on slopes. I call this "bilinear anisotropic", and we used it in Trials Evolution (hacks like this are required for 60 fps on current gen consoles).

Trilinear isn't much harder. The virtual texture indirection lookup basically truncates the mip value to floor(mip). That data is implicitly stored to the x,y texture coordinate pair for free. All extra data you need to store is the fraq(mip) portion. 4 bit normalized [0,1] integer is enough for this purpose. We are blending between two adjacent mip levels after all, not two completely different images (**), so using 8 bits or even more (floats ) is pure overkill for this purpose. However if you have a traditional texture atlas (and not a virtual textured atlas), the texture coordinate doesn't implicitly contain any extra information about the mip level, and you need to have extra bits to store the mip level.

Anisotropic filtering with virtual texturing is still a topic that hasn't been researched a lot ("Carmack's Hack" being the state of the art for performance ). Anisotropic filtering can be approximated just by using the trilinear version above, and adjusting the mip calculation based on the gradients (just like we did for our "bilinear anisotropic" for Trials Evolution). This doesn't require any extra g-buffer storage, but can sometimes result in slight oversampling (FXAA in Trials Evolution did take care of that). Of course this isn't a perfect solution, and we absolutely need to do better in the future. Good texture filtering quality is as important as good antialiasing quality.

If you want to do proper anisotropic filtering, you obviously need to store both gradients. Virtual texture indirection lookup points you to a location that stores the most detailed data you need (minimum of the gradients). Both gradients are increments to this value. The smaller gradient increment is always in range of [0,1] (when measured in mip levels), the larger can be more than that (but it's always positive). Again 4 bits should be enough for the first one, and if we share a 16 bit value for them, we have 12 bits remaining. That's more than enough for the second gradient bias.

Another way to approach this problem is to prefilter 128x128 tiles as 128x64 and 64x128 anisotropic tiles. Now we also use the gradient values to adjust the indirection lookup before storing the texture coordinate to the g-buffer. We can store these tiles adjacent to the original 128x128 (splitting the cache basically to 256x128 tiles) if we do not want to increase the indirection texture size (as coordinate bias to anisotropic pages is easy to calculate). Alternatively we can use a hash instead of the indirection texture (cuckoo hash is guaranteed O(1), can be easily coded with no branching/flow control, has no dependency chains and benefits nicely from GPU latency hiding). As a extra bonus, this technique saves bandwidth compared to standard anisotropic filtering, but it doubles the virtual texture cache atlas size.

The last, and the most ambitious way is to store no texture coordinate data at all. Use rasterization only for a depth pre-pass. Depth value is translated to a 3d-coordinate in the lighting shader (all deferred renderers do this already). If you have unique mapping in the virtual texture (***), you can do a hash lookup using this world coordinate to get the virtual texture coordinate. Naive thing would be to add all virtual texture pixels to a hash based on their 3d world coordinates (and update the hash whenever a page is loaded). A better way would be to have a sparse multilayer volume texture where the texture coordinates could be queried (this is basically a hash as well, but hash nodes are (8x8x8) volumes instead single pixels, and it would be easy to query if GPU has paged virtual memory, AMDs PRT OpenGL extension for example). It would contain only the surfaces visible in the screen (or virtual texture cache, because it's a superset of screen pixels). This kind of structure wouldn't need to be super high resolution, because texture coordinates are linearly interpolated along polygons (linear filtering from volume texture would work just fine).

(**) When using trilinear filtering, the virtual texture atlas has a single mip level. This allows you to use hardware trilinear filtering to blend between the current level and one below it. It increases virtual texture atlas memory consumption by 25%. That's usually not a big deal.

(***) You would want to have unique mapping for other purposes as well. It allows you to have unique decals on all your objects in world, and it allows you to precalculate object based texture transformations to the virtual texture cache (for example colorization). Unique virtual mapping shouldn't be confused with unique physical mapping. You don't need to store all versions of pages to the hard drive (like Rage does), you can burn the decals (and colorizations, etc) to pages during page loading.
Quote:
Originally Posted by Andrew Lauritzen View Post
Sure, but ultimately this will be solved by bindless textures/resources referenced in constant buffers or similar. i.e. you just store a pointer/offset to the material data in the G-buffer and look up into it in the deferred pass. There may be an interim time when people use virtual texturing for this, but in the long run there's no need for atlases/continuous address spaces for this kind of work. This will be far more efficient than redundantly dumping all this data into the G-buffer itself (much of it is constant), and basically reduces the role of the rendering pipeline to just rasterization (and perhaps displacement mapping) and some basic attribute interpolation.
Absolutely. Fully featured GPU virtual memory and data addressing is the future. AMD is touting it with HSA, Nvidia is touting it with Kepler, even ARMs Mali-T604 papers talk about GPU virtual memory. AMDs PRT OpenGL extensions are the first developer controllable virtual memory API for GPUs. It's currently only available for texturing (and has pretty big 64 kB pages), but it's a very good first step. I hope will will soon have unified 64 bit address space between CPU and GPU with same sized (preferably small 4 kB pages) and total developer control over handling page faults and virtual mappings. That would allow us to do all kind of crazy deferred rendering implementations
Quote:
Originally Posted by Andrew Lauritzen View Post
Anyways, this is an interesting conversation but we're pretty off-topic for this thread... might be worth someone splitting this?
Agreed. This is indeed a interesting topic, and unfortunately something that's not been discussed enough.

--> Please someone move this discussion to it's own thread. Thank you!
sebbbi is offline   Reply With Quote
Old 28-Oct-2012, 13:54   #24
MfA
Regular
 
Join Date: Feb 2002
Posts: 5,221
Send a message via ICQ to MfA
Default

Quote:
Originally Posted by rpg.314 View Post
Users of tiled forward rendering might not agree. MSAA is quite useful.
I'm not saying there is no trade off, I'm saying that however you dice it the bandwidth needed either for an early Z pass or for some form of through framebuffer deferred shading is always going to stay significant. The raison d'etre for hardware tilers is not going away (although if the geometry load keeps increasing at some point they will need object level binning to compete).
__________________
Cinematic is the new streamlined.
MfA is offline   Reply With Quote
Old 28-Oct-2012, 14:58   #25
Rodéric
a.k.a. Ingenu
 
Join Date: Feb 2002
Location: Apsley, U.K.
Posts: 2,729
Default

Quote:
Originally Posted by sebbbi View Post
I hope will will soon have unified 64 bit address space between CPU and GPU with same sized (preferably small 4 kB pages) and total developer control over handling page faults and virtual mappings. That would allow us to do all kind of crazy deferred rendering implementations
Amen to that, I've been asking for that feature for a few years already, empowering developers is the right way to go.
That along with standard texture layout would allow for immense worlds streamed in memory and requiring less memory but likely more bandwidth though.
(That shouldn't be too much of a problem with SSD becoming mainstream, but memory being already bandwidth limited today I'm not too sure how that would go. Didn't do any estimate of the required bandwidth recently either ;p)
__________________
So many things to do, and yet so little time to spend...
Rodéric is offline   Reply With Quote

Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 17:41.


Powered by vBulletin® Version 3.8.6
Copyright ©2000 - 2013, Jelsoft Enterprises Ltd.