Introduction
This is part 2 of the CTRT series. You can read the complete series here:
- Part 1: Rendering a Sphere
- Part 2: BRDFs and Shadows
- Part 3: Phong and Reflections
Last time, we discussed the initial design and setup for a compile-time ray
tracer that used constexpr
functions to generate the final image. If you want
a refresher, you can read the full article here. In today’s travel, we will
continue developing the ray tracer by adding some Lambertian shading along with
some shadows. Before we can do any of this, however, we should address a rather
glaring problem: the compile-time for our image is at least $6 \times 10^6$ times
slower than running it at runtime and uses an exorbitant amount of memory!
Before we dive in, let’s record our current findings:
Scene | Time (Dynamic) | Time (Compile-Time) |
---|---|---|
Just background | 1.67 ms | 19m 37s |
Red Sphere | 3.10 ms | 34m 36s |
Note
At the top of each section, you will find links to all the files that are discussed there in case you want to read the code along with the explanations. The entire source code for this post can be seen here
Understanding constexpr
In order to understand the times that we’re seeing for the compile-time version, we first need to discuss what happens when the compiler performs computations at compile-time. It is worth noting the following observations:
- The conclusions drawn here are based mostly on the behaviour exhibited by MSVC. Different compilers may vary slightly.
- This is by no means an accurate explanation of what exactly is happening. Since we have no real way of debugging the compiler itself, the observations made here are based on the general behaviours of other compilers.
Parsing constexpr
Consider the following simplified example:
|
|
When the compiler sees this code, it must first perform a series of checks to ensure that the code complies with the following:
- The code must meet all of the requirements for
constexpr
functions. The full list can be seen here. - The code must not result in undefined-behaviour (UB).
- The result of evaluating
add
must itself be a constant expression.
Let’s pretend we’re the compiler for a second and let’s parse the code. We begin
with the expression constexpr int result = add(1, 2);
. After finishing all
necessary syntax checks, we begin by noting that result
is declared
constexpr
, so we must determine if the right-hand side of the = results in a
constant expression. Clearly 1 and 2 are integral constants and therefore
constexpr
, so we’re fine here. Next, we note that add
is itself marked
constexpr
, so let’s go into the function and check that. It takes two int
by
copy and then requires operator+
to be defined. Since the two parameters are
fundamental types, all that we have to do is now check to see if there are any
undefined behaviours. Adding 1 and 2 cannot result in any form of overflow,
division by 0, or other arithmetic errors, so the result is fine. Therefore, the
entire expression is constant and we can compute the result at compile-time.
You’d be right to think that this is incredibly tedious. What the compiler is basically doing here is that it’s interpreting the individual instructions while checking them against the pre-established rules to ensure that nothing illegal is done. Since this must be done for every single line, template, array access, etc. we can easily see how the evaluation time quickly escalates with the complexity of the code.
The other part of the puzzle is related to the increase in memory usage. Compiling the second scene results in over 50GB of RAM being used by the compiler! Where is all of this memory coming from?
Memoization… No it’s Not a Typo
Before we get started, let’s do a bit of maths. We know that our image is $512
\times 512$ and that each pixel requires 3 floats. Assuming that each float
requires 4 bytes, then we end up with 512 x 512 x 3 x 4 = 314728
bytes of memory to represent the image (or approximately 3.1 MB). So if the base
memory usage of our image is 3.1MB, then how do we get to 50GB that the compiler
is using?
To understand this, we first need to talk about memoization. This is a technique used to improve the performance of recursive algorithms. This is done by caching the results of recursive calls as they become available. The assumption is the following: if a function call is pure (does not modify any internal state and will return the same value with the same parameters) then we can safely cache the result of a function call with a given set of parameters. If we invoke that same function again with the exact same parameters, we can just return the value we stored as opposed to having to re-calculate them again.
Given the way the memory increases as the compiler runs, the likely explanation for the increased memory usage is that the compiler is attempting to cache the results of each function invocation in an attempt to speed-up the process. While it is true that all of the ray tracer is effectively pure, we never request a function be evaluated with the same parameter twice. What this results in is the compiler caching a significant amount of results that are never used. This caching, in addition to the high amount of parsing and error-checks that must be performed and the non-trivial nature of the code leads to the compiler taking significantly longer to perform the calculations than the runtime equivalent.
So, now that we know all of this, can we do anything about it? The answer is yes. To an extent at least.
Improving Performance
There are two things that we can do immediately to reduce the load on the compiler:
- Reduce the depth of the call-stack and
- Simplify the complexity of the call-stack itself.
Let’s start with the second one. If we read through our code, we will find that there are several things that we can simplify:
- The
Vector
class is currently templated to allow us to change the precision between floats and doubles. Given the high memory usage, it would probably be unwise to attempt doubles, so let’s remove the template and commit the entire code to using floats. This is standard in graphics anyway, so it’s not a big deal. - The
ShapeContainer
andMaterialContainer
classes can be removed. These are essentially wrappers that hide the loop over all the shapes or the individual accesses to the material array. Since this doesn’t hinder readability, we can move them outside to the mainrender
function. Note that this not only simplifies the call-stack (one less template), but it also reduces its depth.
The reduction of the depth of the call-stack can be easily accomplished by
removing any recursive functions and transforming them into iterative versions.
The only recursive function we currently have is in utils::sqrt
, which we can
re-write as follows:
|
|
Re-Structuring the Ray Tracer
Now that we have cleaned up the code a bit, it’s good time to re-think the
architecture of the code and simplify things a bit further. Given the current
state of the code, it doesn’t make much sense to have the Scene
class as the
one iterating over all the shapes. This adds complexity in two ways: first, it
increases the size of the call stack, and second it increases the complexity
through the template it receives (plus the usage of CRTP). What we’re going to
do instead is re-factor the class so it becomes a container for our shapes,
materials, and whatever else we want to add.
Since the materials and shapes are currently held by the std::variant
wrappers
and will ultimately be contained by an array, we can just pass the types of the
arrays as template parameters. Upon construction of the class, we can simply
move the arrays so they are held there. We can draft this as follows:
|
|
We still need a way of specifying the colour of the background. We can do this by creating a simple interface for CRTP that we can then use to create the different background colours we need. Following the same pattern as before, this would look something like this:
|
|
We can then change the template for the Scene
class into this:
|
|
If we attempted to compile this, we would receive a compiler error that would look like this:
|
|
The problem here is that we’re trying to store the derived class
GradientBackground
as if it were the base class. Clearly this doesn’t work
since we can’t cast one into the other (hence the error message). We could try
changing it into a reference, but that still wouldn’t work. Unfortunately the
only solution here is to change the typename T
argument from the Scene
template and change it to class BackgroundColour
. We can then replace
Background<T>
with BackgroundColour
and the code will compile. Since we
aren’t using CRTP any more, we should remove the base class (it offers no extra
value and just lengthens the call-stack).
Now that we have the Scene
class as just a container, we’re going to simplify
the Renderer::render
function as well. Before we had two variants: one for
simple renders (just trace the background) and one for the sphere. We’re going
to collapse these into a single function that will handle both. Fortunately,
since we have now bundled all of the data inside the Scene
class, we can
simplify the template parameters to the function so it only takes two things:
the Image
and Scene
classes. The new function looks like this:
|
|
Now that we have cleaned up the code, let’s run the tests once again. Since the run-time times didn’t change as much, we will ignore them and instead focus on the compile-time results:
Scene | Time (Before) | Time (After) |
---|---|---|
Just background | 19m 37s | 21m 6s |
Red Sphere | 34m 36s | 11m 43s |
The increase in compile-times between the first and second version for the first
case is due to the re-structuring. Recall that the original code ran through a
simplified version of the render
function, which utilised a significantly
simpler call-stack. The new version uses the same function for both, which
carries with it an increased complexity. In spite of this, the difference in
performance is relatively small (approximately 1% difference), which is a
testament to the performance improvements in the new version. The relevant
comparison here is the second case, since it’s the closest in terms of
implementation. Notice that the new version is 3 times faster than the original!
So, with all of this in mind, we will continue expanding the system by adding
shading and shadows.
Shades of Light
Shading requires to main things: a BRDF (bidirectional reflectance distribution
function) which describes the way light is reflected by the surface, and the
material itself which contains things like colour, diffuse properties,
reflectance, etc. Both of these are going to be represented with the same
CRTP/std::variant
pattern we have already seen, so we will not dedicate too
much time on them. The base class for the BRDFs will look like this:
|
|
Where fn
is the main BRDF function itself, and rho
is the bihemispherical
reflectance. For now, we will define a single class which will implement perfect
diffuse shading, otherwise known as Lambertian shading. Similar to the Shape
classes, we will also add a wrapper for the std::variant
that will hold the
different sub-classes.
The Material
now gets re-written to contain a single function that will return
the colour of the surface at the current hit point. The new interface looks like
this:
|
|
Since shading also needs light sources, let’s add those now. Note that the
Scene
will also have to be modified to accept the new light array. The main
interface for lights looks like this:
|
|
And we will create a single sub-class which will be our ambient light. With all
of this together, the Scene
class now becomes:
|
|
The Empty Case
Great! So now we have a way of creating new lights, materials, and BRDFs. So, all that we have to do is re-write our generator functions so that we can re-create our previous scenes with the new system. For the first scene, we don’t have shapes or lights, so we could just do something like this:
|
|
We hit compile and… it doesn’t work. We are greeted with the following error:
|
|
So what is going on here? Well looking at the definition of ShapeWrapper
, we
see that we didn’t provide a default constructor. Before we provide one though,
let us look a bit closer at the definition of std::variant
. If we read through
the constructor list, we will see this:
Default constructor. Constructs a variant holding the value-initialized value of the first alternative (index() is zero).
Given our current setup, this means that if we were to provide a default
constructor to ShapeWrapper
, it will default initialise a Plane
class! This
is obviously wrong, since we don’t have (or want) a plane in our scene. So how
do we solve this problem?
The easiest way by far is to create an “empty” shape that essentially does nothing. We can then put it as the first entry into the variant, which means that it would default initialise to this special shape. So, with this in mind, we are going to be adding the following classes:
EmptyShape
andEmptyMaterial
.
Notice that since we already have an ambient light (which effectively does nothing), we can just use that as the empty light. After adding these classes to the wrappers and adding default constructors, our code compiles correctly.
Finally, we can sub-class the Material
class to create a matte material that
will hold the Lambertian BRDF we created earlier. All that the shade
function
needs to do is loop through all the lights adding their contributions and
accumulating the final colour. Let’s create a new scene with two spheres and a
plane to showcase shading and we will be left with a scene like this:
Not bad, not bad. Now let’s add some shadows, since those are relatively straight-forward.
Shadow of Light
For the shadows, we are going to have to perform the following changes:
- The lights need a way of knowing whether they cast shadows or not. Since this is general to all lights, we will be adding that to the base class directly. Note that it should default to false, since ambient lights do not cast shadows.
- The shapes will need a second hit function, which we will call
shadow_hit
. The purpose of this function is to provide a simplified intersection function that only computes two things: whether the ray hits the surface, and the value of $t$ at the intersection point. - The
Matte::shade
function will now need to take into account whether the light casts shadows. If it does, then it should only leave the colour from the ambient light. To check this, we just have to create a shadow ray and use theshadow_hit
function from all the shapes. This does mean that the interface for the material class will have to be changed to accept the container of shapes like this:
|
|
Once all of this is done, we can now set our light to cast shadows and we’ll be greeted with this:
Perfect! We now have a respectable ray tracer! Now, for the sake of argument, let’s look at the compile-time and… it takes approximately 42 minutes. Which isn’t bad at all if we consider just how much the code is doing, but it does beg the question: could we do any better?
The answer is yes, but we will have to start sacrificing the interface that we
have built in order to do so. Strictly speaking, we don’t need the various CRTP
base classes, so we could strip those away (note that we still need the variant
wrappers). We could also attempt to move more of the code into the
Renderer::render
function, but this will come at the cost of flexibility if we
wanted to add different kinds of materials (which we definitely will). So
overall, I believe we’re in a pretty good state right now, save for one thing:
aliasing.
To Sample or not to Sample
The final thing we will touch in today’s journey is whether we could add multi-sampling. In ray tracing, sampling involves generating a random point inside the bounds of the pixel (generally a point with coordinates in the range 0 to 1) and using that to compute the direction of our ray. This process would then get repeated several times (16 samples is a good number for our case) and the final colour would then be the average of all the samples.
This presents a couple of problems. The first part is the random number
generation. Most random number generators will rely on things like <chrono>
or
<random>
from the STL, neither of which is marked constexpr
and therefore
cannot be used. We could abuse the properties of the __TIME__
macro which
provides us with the current time the compile is happening. The thing is that
because the entire sequence of random numbers must be known at compile-time, the
sequence itself must have a fixed size. This has several problems:
- The sequence will always be the same. In general, it isn’t a good idea to sample based on a pattern since this can lead to artefacts, so we usually tend to randomly shuffle the sample list, which we wouldn’t be able to do here. Granted, we could also generate a sequence of random integers that we could use to index into the sample list, but that just means we now have to carry two random sequences instead of one.
- We can’t increase the size of the samples for adaptive sampling. If we’re just trying to increase the number of samples per pixel this isn’t really a problem (since that number is generally fixed in most simple cases), but for more advanced techniques, the number of samples we may need will be different. Therefore, we would have to know beforehand the total number of samples we would require to run the entire render, which leads us back to the first problem (since we’re basically sampling in the same sequence).
This leads us to the second problem: the amount of memory usage and the time it takes to compile it. Adding a very conservative 6 samples to the first scene results in the compiler using all of the system memory (64GB in my case) and taking well over 30 minutes to finish. Since we are going to be adding at least one more feature (reflections) to the ray tracer, it would be best to not tempt things, since running out of memory will result in a hard crash from the compiler.
So, in summary, while we could have some form of random sampling, we won’t be adding it here. The lack of random sampling will also rule out other techniques such as ambient occlusion, glossy reflections, and more advanced rendering techniques since they all rely on samplers.
Conclusion
So, what did we learn in this trip?
Compile-time computations effectively turn the compiler into an interpreter that must check each line for undefined behaviours, language rules, etc. This results in the compiler taking a long time to produce the results we want and using a significant amount of memory.
To help reduce this problem, we have to:
- Simplify the call-stack by removing unnecessary templates (or simplifying said templates whenever possible), and
- Shorten the call-stack by removing function calls whenever possible as well as re-writing recursive functions.
Note that all of these should be done within reason so as to not lose readability or flexibility.
We cannot store members of CRTP classes as their base class (similar to how we would store pointers to base classes). Instead of this, we can just pass an additional template parameter.
std::variant
has a default constructor that will initialise to the first entry in the list of types. Therefore, if we want an “empty” type, we must provide it and make sure it is listed first.Random sampling is not possible at compile-time since the numbers must be known at compile-time. It will also result in increased compile-times and memory usage.
We are almost at the end of this journey. Join me next time where we will explore the following:
- How can we add reflections without causing memory usage and compile-times to explode?
- Can we parallelise our code?
References
The following are the references used for this article in case you wish to read more about the discussed concepts.
- Constant Expressions
std::variant
- PBRT
- Ray Tracing From the Ground Up
- Understanding C++
constexpr
Performance - Generate random numbers in C++ at compile time
Edits:
- 22/09/2024: Improved TOC and fixed some phrasing due to blog style changes.