Deep internal/external searching… for… stuff


Stenciling for “deferred” lighting

I've seen many papers describing using stencil culling to speed up rendering of a light in a deferred (or semi-deferred) manner. For instance an Insomniac paper. It confuses me a bit.

The approach is normally: clear stencil, render front/back faces of bounding volume to stencil, render a screen space quad with stencil culling. The reasoning: Screen space quad gets best usage out of pixel groups, stencil settings reject expensive pixel shaded pixels.

Pros: Stencil will reject all pixels not in the bounds. There are many costs to this approach: stencil clear is not cheap (it's actually more expensive than depth+stencil clear IIRC), stencil culling is done in big blocks so it won't actually reject as many pixels on the edge of intersections with the geometry, you have 2x pixels to render to stencil operations (though you'll probably get the fast-depth only path too), changing render targets to point to the stencil buffer can stall the pipeline.

Alternate approach: render bounding light volume using front OR back faces (depth pass or fail respectively) and turn on depth bounds clamping.

Pros: Depth bounds clamping early (performance suggests this at least) rejects many pixels outside your bounds (especially for small lights). No stencil clearing tax. Cons: may not early reject some blocks of pixels because depth bounds is not as fine grained as stencil, pixel quads may not be optimally filled due to the triangulated nature of the bounding volume.

In ALL situations we've tried at work these trade offs have favored the later approach, even for relatively large lights. Some issues which might favor the first approach: switching render targets to render shadows anyway, a big light with a really expensive shader, adding non-spatial rejection bits into the stencil operations, something else I'm completely not considering for some reason (?).

We'll have to try this again to make sure we're not missing something. I could imagine depth bounds clamping failing if your early z get borked for some reason. I have to believe that others have tried our approach as well, and its somewhat surprising that no one ever mentions it (too obvious?). But for now I'll call it competitive advantage I guess :P.

Comments (3) Trackbacks (1)
  1. Hi,

    I think you forgot one “con” for the alternate approach:
    some arithmetics of position reconstruction can be move to vertex part, but only when you render a full quad.
    (in pixel pipe, you can reconstruct position from linear_z with only a mad and 2 interpolants)

    When you render a bounding geometry, the position reconstruction can no longer be prepared in vertex space.

  2. We read directly from the depth buffer and thus don’t have linear z :). 3 components of the reverse projection matrix can be factored into the vertex shader, so you do perspective space Z * matClipToWorld[2] + vecXYW (float4 interpolated from vertex shader). Then you have a divide (recip + mul). So yes, it’s 3 instructions and 1 interpolant instead of 1-2, but you don’t have to store linear Z (full screen worth of RAM). Reconstructing perspective Z rather than sampling from a globally stored linear Z buffer, takes 3 instructions itself (on the RSX at least). I think a better argument for screen space tiles than saving a few arithmetic instructions is merging multiple lights to reduce (and potentially maximize cache reuse) for sampling giant textures. We’ll probably try tiles again. However, stencil culling still feels like potentially anti-useful trade off.

  3. And actually come to think of it you might still be able to do single mad by computing the following in the vertex shader: ProjectToFarPlane(posWorld – posCamera). There may be issues with correct interpolation for geometry clipping through near/far, but you might be able to get around that with trickery.

Leave a comment