Optimising a Compile-Time Ray Tracer

Introduction

This is part 2 of the CTRT series. You can read the complete series here:

Last time, we discussed the initial design and setup for a compile-time ray tracer that used constexpr functions to generate the final image. If you want a refresher, you can read the full article here. In today’s travel, we will continue developing the ray tracer by adding some Lambertian shading along with some shadows. Before we can do any of this, however, we should address a rather glaring problem: the compile-time for our image is at least $6 \times 10^6$ times slower than running it at runtime and uses an exorbitant amount of memory!

Before we dive in, let’s record our current findings:

Scene	Time (Dynamic)	Time (Compile-Time)
Just background	1.67 ms	19m 37s
Red Sphere	3.10 ms	34m 36s

Note
At the top of each section, you will find links to all the files that are discussed there in case you want to read the code along with the explanations. The entire source code for this post can be seen here

Understanding `constexpr`

In order to understand the times that we’re seeing for the compile-time version, we first need to discuss what happens when the compiler performs computations at compile-time. It is worth noting the following observations:

The conclusions drawn here are based mostly on the behaviour exhibited by MSVC. Different compilers may vary slightly.
This is by no means an accurate explanation of what exactly is happening. Since we have no real way of debugging the compiler itself, the observations made here are based on the general behaviours of other compilers.

Parsing `constexpr`

Consider the following simplified example:

1
2
3
4
5
6
constexpr int add(int a, int b)
{
    return a + b;
}

constexpr int result = add(1, 2);

When the compiler sees this code, it must first perform a series of checks to ensure that the code complies with the following:

The code must meet all of the requirements for constexpr functions. The full list can be seen here.
The code must not result in undefined-behaviour (UB).
The result of evaluating add must itself be a constant expression.

Let’s pretend we’re the compiler for a second and let’s parse the code. We begin with the expression constexpr int result = add(1, 2);. After finishing all necessary syntax checks, we begin by noting that result is declared constexpr, so we must determine if the right-hand side of the = results in a constant expression. Clearly 1 and 2 are integral constants and therefore constexpr, so we’re fine here. Next, we note that add is itself marked constexpr, so let’s go into the function and check that. It takes two int by copy and then requires operator+ to be defined. Since the two parameters are fundamental types, all that we have to do is now check to see if there are any undefined behaviours. Adding 1 and 2 cannot result in any form of overflow, division by 0, or other arithmetic errors, so the result is fine. Therefore, the entire expression is constant and we can compute the result at compile-time.

You’d be right to think that this is incredibly tedious. What the compiler is basically doing here is that it’s interpreting the individual instructions while checking them against the pre-established rules to ensure that nothing illegal is done. Since this must be done for every single line, template, array access, etc. we can easily see how the evaluation time quickly escalates with the complexity of the code.

The other part of the puzzle is related to the increase in memory usage. Compiling the second scene results in over 50GB of RAM being used by the compiler! Where is all of this memory coming from?

Memoization… No it’s Not a Typo

Before we get started, let’s do a bit of maths. We know that our image is $512 \times 512$ and that each pixel requires 3 floats. Assuming that each float requires 4 bytes, then we end up with 512 x 512 x 3 x 4 = 314728 bytes of memory to represent the image (or approximately 3.1 MB). So if the base memory usage of our image is 3.1MB, then how do we get to 50GB that the compiler is using?

To understand this, we first need to talk about memoization. This is a technique used to improve the performance of recursive algorithms. This is done by caching the results of recursive calls as they become available. The assumption is the following: if a function call is pure (does not modify any internal state and will return the same value with the same parameters) then we can safely cache the result of a function call with a given set of parameters. If we invoke that same function again with the exact same parameters, we can just return the value we stored as opposed to having to re-calculate them again.

Given the way the memory increases as the compiler runs, the likely explanation for the increased memory usage is that the compiler is attempting to cache the results of each function invocation in an attempt to speed-up the process. While it is true that all of the ray tracer is effectively pure, we never request a function be evaluated with the same parameter twice. What this results in is the compiler caching a significant amount of results that are never used. This caching, in addition to the high amount of parsing and error-checks that must be performed and the non-trivial nature of the code leads to the compiler taking significantly longer to perform the calculations than the runtime equivalent.

So, now that we know all of this, can we do anything about it? The answer is yes. To an extent at least.

Improving Performance

There are two things that we can do immediately to reduce the load on the compiler:

Reduce the depth of the call-stack and
Simplify the complexity of the call-stack itself.

Let’s start with the second one. If we read through our code, we will find that there are several things that we can simplify:

The Vector class is currently templated to allow us to change the precision between floats and doubles. Given the high memory usage, it would probably be unwise to attempt doubles, so let’s remove the template and commit the entire code to using floats. This is standard in graphics anyway, so it’s not a big deal.
The ShapeContainer and MaterialContainer classes can be removed. These are essentially wrappers that hide the loop over all the shapes or the individual accesses to the material array. Since this doesn’t hinder readability, we can move them outside to the main render function. Note that this not only simplifies the call-stack (one less template), but it also reduces its depth.

The reduction of the depth of the call-stack can be easily accomplished by removing any recursive functions and transforming them into iterative versions. The only recursive function we currently have is in utils::sqrt, which we can re-write as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
constexpr float sqrt(float x)
{
    if (x >= 0 && x < std::numeric_limits<float>::infinity())
    {
        float curr{x}, prev{0.0f};
        while (curr != prev)
        {
            prev = curr;
            curr = 0.5f * (curr + x / curr);
        }

        return curr;
    }

    return std::numeric_limits<float>::quiet_NaN();
}

Re-Structuring the Ray Tracer

Now that we have cleaned up the code a bit, it’s good time to re-think the architecture of the code and simplify things a bit further. Given the current state of the code, it doesn’t make much sense to have the Scene class as the one iterating over all the shapes. This adds complexity in two ways: first, it increases the size of the call stack, and second it increases the complexity through the template it receives (plus the usage of CRTP). What we’re going to do instead is re-factor the class so it becomes a container for our shapes, materials, and whatever else we want to add.

Since the materials and shapes are currently held by the std::variant wrappers and will ultimately be contained by an array, we can just pass the types of the arrays as template parameters. Upon construction of the class, we can simply move the arrays so they are held there. We can draft this as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
template<class ShapeContainer,
         class MaterialContainer>
class Scene
{
public:
    constexpr Scene(ShapeContainer&& shape_container,
                    MaterialContainer&& material_container) :
        m_shapes{std::forward<ShapeContainer>(shape_container)},
        m_materials{std::forward<MaterialContainer>(material_container)}
    {}
    // ...
private:
    ShapeContainer m_shapes;
    MaterialContainer m_materials;
};

We still need a way of specifying the colour of the background. We can do this by creating a simple interface for CRTP that we can then use to create the different background colours we need. Following the same pattern as before, this would look something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
template<typename T>
class Background : public utils::StaticBase<T, Background>
{
public:
    constexpr Background()
    {}

    constexpr Colour operator()(Ray const& ray) const
    {
        return this->self().operator()(ray);
    }
};

We can then change the template for the Scene class into this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
template<class ShapeContainer,
         class MaterialContainer,
         typename T>
class Scene
{
public:
    constexpr Scene(ShapeContainer&& shape_container,
                    MaterialContainer&& material_container,
                    Background<T> bckg) :
        m_shapes{std::forward<ShapeContainer>(shape_container)},
        m_materials{std::forward<MaterialContainer>(material_container)},
        m_background_colour{bckg}
    {}
    // ...
private:
    Camera m_camera;
    ShapeContainer m_shapes;
    LightContainer m_lights;
    MaterialContainer m_materials;
    Background<T> m_background_colour;

If we attempted to compile this, we would receive a compiler error that would look like this:

1
2
3
4
5
6
main.cpp(19,26): error C2131: expression did not evaluate to a constant
utils.hpp(42,20): message : failure was caused by cast of object of dynamic type 'Background<GradientBackground>' to type 'const T'
        with
        [
            T=GradientBackground
        ]

The problem here is that we’re trying to store the derived class GradientBackground as if it were the base class. Clearly this doesn’t work since we can’t cast one into the other (hence the error message). We could try changing it into a reference, but that still wouldn’t work. Unfortunately the only solution here is to change the typename T argument from the Scene template and change it to class BackgroundColour. We can then replace Background<T> with BackgroundColour and the code will compile. Since we aren’t using CRTP any more, we should remove the base class (it offers no extra value and just lengthens the call-stack).

Now that we have the Scene class as just a container, we’re going to simplify the Renderer::render function as well. Before we had two variants: one for simple renders (just trace the background) and one for the sphere. We’re going to collapse these into a single function that will handle both. Fortunately, since we have now bundled all of the data inside the Scene class, we can simplify the template parameters to the function so it only takes two things: the Image and Scene classes. The new function looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
template<class Image, class Scene>
static constexpr void render(Image& image, Scene const& scene)
{
    Ray ray{Point{0.0f, 0.0f, 500.0f}, Vector{0.0f}};

    const float width  = static_cast<float>(Image::width);
    const float height = static_cast<float>(Image::height);

    auto const& materials = scene.get_materials();
    auto const& shapes    = scene.get_shapes();

    for (std::size_t row{0}; row < Image::height; ++row)
    {
        for (std::size_t col{0}; col < Image::width; ++col)
        {
            Point sample_pt{0.5f, 0.5f, 0.0f};
            Point pixel_pt{
                static_cast<float>(col) - 0.5f * width + sample_pt.x(),
                static_cast<float>(row) - 0.5f * height + sample_pt.y(),
                0.0f};
            ray.direction = scene.get_camera().get_ray_direction(pixel_pt);

            Colour colour{0.0f};
            bool bHit{false};
            for (auto const& shape : scene.get_shapes())
            {
                if (auto result = shape.hit(ray); result)
                {
                    colour = rec.default_colour;
                    break;
                }
            }

            if (!bHit)
            {
                colour = scene.get_background_colour(ray);
            }

            image(row, col) = colour;
        }
    }
}

Now that we have cleaned up the code, let’s run the tests once again. Since the run-time times didn’t change as much, we will ignore them and instead focus on the compile-time results:

Scene	Time (Before)	Time (After)
Just background	19m 37s	21m 6s
Red Sphere	34m 36s	11m 43s

The increase in compile-times between the first and second version for the first case is due to the re-structuring. Recall that the original code ran through a simplified version of the render function, which utilised a significantly simpler call-stack. The new version uses the same function for both, which carries with it an increased complexity. In spite of this, the difference in performance is relatively small (approximately 1% difference), which is a testament to the performance improvements in the new version. The relevant comparison here is the second case, since it’s the closest in terms of implementation. Notice that the new version is 3 times faster than the original! So, with all of this in mind, we will continue expanding the system by adding shading and shadows.

Shades of Light

Shading requires to main things: a BRDF (bidirectional reflectance distribution function) which describes the way light is reflected by the surface, and the material itself which contains things like colour, diffuse properties, reflectance, etc. Both of these are going to be represented with the same CRTP/std::variant pattern we have already seen, so we will not dedicate too much time on them. The base class for the BRDFs will look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
template<typename T>
class BRDF : public utils::StaticBase<T, BRDF>
{
public:
    constexpr BRDF()
    {}

    constexpr Colour fn(ShadeRec const& rec,
                        Vector const& reflected,
                        Vector const& incoming) const
    {
        return this->self().fn(rec, reflected, incoming);
    }

    constexpr Colour rho(ShadeRec const& rec, Vector const& reflected) const
    {
        return this->self().rho(rec, reflected);
    }
};

Where fn is the main BRDF function itself, and rho is the bihemispherical reflectance. For now, we will define a single class which will implement perfect diffuse shading, otherwise known as Lambertian shading. Similar to the Shape classes, we will also add a wrapper for the std::variant that will hold the different sub-classes.

The Material now gets re-written to contain a single function that will return the colour of the surface at the current hit point. The new interface looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
template<typename T>
class Material : public utils::StaticBase<T, Material>
{
public:
    constexpr Material()
    {}

    template<class LightContainer>
    constexpr Colour shade(ShadeRec const& rec,
                           LightContainer const& lights) const
    {
        return this->self().shade(rec, lights);
    }
};

Since shading also needs light sources, let’s add those now. Note that the Scene will also have to be modified to accept the new light array. The main interface for lights looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
template<typename T>
class Light : public utils::StaticBase<T, Light>
{
public:
    constexpr Vector get_direction(ShadeRec const& rec) const
    {
        return this->self().get_direction(rec);
    }

    constexpr Colour l(ShadeRec const&) const
    {
        return m_radiance * m_colour;
    }

    constexpr void set_radiance(float r)
    {
        m_radiance = r;
    }

    constexpr void set_colour(Colour const& cl)
    {
        m_colour = cl;
    }

protected:
    Colour m_colour{1.0f};
    float m_radiance{1.0f};
};

And we will create a single sub-class which will be our ambient light. With all of this together, the Scene class now becomes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
template<class ShapeContainer,
         class LightContainer,
         class MaterialContainer,
         class BackgroundColour>
class Scene
{
public:
    constexpr Scene(ShapeContainer&& shape_container,
                    LightContainer&& light_container,
                    MaterialContainer&& material_container,
                    BackgroundColour bckg) :
        m_shapes{std::forward<ShapeContainer>(shape_container)},
        m_lights{std::forward<LightContainer>(light_container)},
        m_materials{std::forward<MaterialContainer>(material_container)},
        m_background_colour{bckg}
    {}
    // ...
private:
    Camera m_camera;
    ShapeContainer m_shapes;
    LightContainer m_lights;
    MaterialContainer m_materials;
    BackgroundColour m_background_colour;
};

The Empty Case

Great! So now we have a way of creating new lights, materials, and BRDFs. So, all that we have to do is re-write our generator functions so that we can re-create our previous scenes with the new system. For the first scene, we don’t have shapes or lights, so we could just do something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
std::array<Shapes, 0> shapes{};
std::array<Lights, 0> lights{};
std::array<Materials, 0> materials{};

Scene<decltype(shapes),
      decltype(lights),
      decltype(materials),
      GradientBackground>
    scene{shapes, lights, materials, GradientBackground{}};
scene.set_camera(camera);

We hit compile and… it doesn’t work. We are greeted with the following error:

1
2
main.cpp(50): error C2512: 'ShapeWrapper<Plane,Sphere>': no
appropriate default constructor available

So what is going on here? Well looking at the definition of ShapeWrapper, we see that we didn’t provide a default constructor. Before we provide one though, let us look a bit closer at the definition of std::variant. If we read through the constructor list, we will see this:

Default constructor. Constructs a variant holding the value-initialized value of the first alternative (index() is zero).

Given our current setup, this means that if we were to provide a default constructor to ShapeWrapper, it will default initialise a Plane class! This is obviously wrong, since we don’t have (or want) a plane in our scene. So how do we solve this problem?

The easiest way by far is to create an “empty” shape that essentially does nothing. We can then put it as the first entry into the variant, which means that it would default initialise to this special shape. So, with this in mind, we are going to be adding the following classes:

EmptyShape and
EmptyMaterial.

Notice that since we already have an ambient light (which effectively does nothing), we can just use that as the empty light. After adding these classes to the wrappers and adding default constructors, our code compiles correctly.

Finally, we can sub-class the Material class to create a matte material that will hold the Lambertian BRDF we created earlier. All that the shade function needs to do is loop through all the lights adding their contributions and accumulating the final colour. Let’s create a new scene with two spheres and a plane to showcase shading and we will be left with a scene like this:

Two shaded spheres

Not bad, not bad. Now let’s add some shadows, since those are relatively straight-forward.

Shadow of Light

For the shadows, we are going to have to perform the following changes:

The lights need a way of knowing whether they cast shadows or not. Since this is general to all lights, we will be adding that to the base class directly. Note that it should default to false, since ambient lights do not cast shadows.
The shapes will need a second hit function, which we will call shadow_hit. The purpose of this function is to provide a simplified intersection function that only computes two things: whether the ray hits the surface, and the value of $t$ at the intersection point.
The Matte::shade function will now need to take into account whether the light casts shadows. If it does, then it should only leave the colour from the ambient light. To check this, we just have to create a shadow ray and use the shadow_hit function from all the shapes. This does mean that the interface for the material class will have to be changed to accept the container of shapes like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
template<typename T>
class Material : public utils::StaticBase<T, Material>
{
public:
    constexpr Material()
    {}

    template<class LightContainer, class ShapeContainer>
    constexpr Colour shade(ShadeRec const& rec,
                           LightContainer const& lights,
                           ShapeContainer const& shapes) const
    {
        return this->self().shade(rec, lights, shapes);
    }
};

Once all of this is done, we can now set our light to cast shadows and we’ll be greeted with this:

Spheres with shadows

Perfect! We now have a respectable ray tracer! Now, for the sake of argument, let’s look at the compile-time and… it takes approximately 42 minutes. Which isn’t bad at all if we consider just how much the code is doing, but it does beg the question: could we do any better?

The answer is yes, but we will have to start sacrificing the interface that we have built in order to do so. Strictly speaking, we don’t need the various CRTP base classes, so we could strip those away (note that we still need the variant wrappers). We could also attempt to move more of the code into the Renderer::render function, but this will come at the cost of flexibility if we wanted to add different kinds of materials (which we definitely will). So overall, I believe we’re in a pretty good state right now, save for one thing: aliasing.

To Sample or not to Sample

The final thing we will touch in today’s journey is whether we could add multi-sampling. In ray tracing, sampling involves generating a random point inside the bounds of the pixel (generally a point with coordinates in the range 0 to 1) and using that to compute the direction of our ray. This process would then get repeated several times (16 samples is a good number for our case) and the final colour would then be the average of all the samples.

This presents a couple of problems. The first part is the random number generation. Most random number generators will rely on things like <chrono> or <random> from the STL, neither of which is marked constexpr and therefore cannot be used. We could abuse the properties of the __TIME__ macro which provides us with the current time the compile is happening. The thing is that because the entire sequence of random numbers must be known at compile-time, the sequence itself must have a fixed size. This has several problems:

The sequence will always be the same. In general, it isn’t a good idea to sample based on a pattern since this can lead to artefacts, so we usually tend to randomly shuffle the sample list, which we wouldn’t be able to do here. Granted, we could also generate a sequence of random integers that we could use to index into the sample list, but that just means we now have to carry two random sequences instead of one.
We can’t increase the size of the samples for adaptive sampling. If we’re just trying to increase the number of samples per pixel this isn’t really a problem (since that number is generally fixed in most simple cases), but for more advanced techniques, the number of samples we may need will be different. Therefore, we would have to know beforehand the total number of samples we would require to run the entire render, which leads us back to the first problem (since we’re basically sampling in the same sequence).

This leads us to the second problem: the amount of memory usage and the time it takes to compile it. Adding a very conservative 6 samples to the first scene results in the compiler using all of the system memory (64GB in my case) and taking well over 30 minutes to finish. Since we are going to be adding at least one more feature (reflections) to the ray tracer, it would be best to not tempt things, since running out of memory will result in a hard crash from the compiler.

So, in summary, while we could have some form of random sampling, we won’t be adding it here. The lack of random sampling will also rule out other techniques such as ambient occlusion, glossy reflections, and more advanced rendering techniques since they all rely on samplers.

Conclusion

So, what did we learn in this trip?

Compile-time computations effectively turn the compiler into an interpreter that must check each line for undefined behaviours, language rules, etc. This results in the compiler taking a long time to produce the results we want and using a significant amount of memory.
To help reduce this problem, we have to:
- Simplify the call-stack by removing unnecessary templates (or simplifying said templates whenever possible), and
- Shorten the call-stack by removing function calls whenever possible as well as re-writing recursive functions.
Note that all of these should be done within reason so as to not lose readability or flexibility.
We cannot store members of CRTP classes as their base class (similar to how we would store pointers to base classes). Instead of this, we can just pass an additional template parameter.
std::variant has a default constructor that will initialise to the first entry in the list of types. Therefore, if we want an “empty” type, we must provide it and make sure it is listed first.
Random sampling is not possible at compile-time since the numbers must be known at compile-time. It will also result in increased compile-times and memory usage.

We are almost at the end of this journey. Join me next time where we will explore the following:

How can we add reflections without causing memory usage and compile-times to explode?
Can we parallelise our code?

References

The following are the references used for this article in case you wish to read more about the discussed concepts.

Edits:
22/09/2024: Improved TOC and fixed some phrasing due to blog style changes.