Final Project:
Neural Radiance Field!

In this final project, we are going to implement Neural Radiance Field from scratch
Newer methods like Gaussian Splatting have made things faster and improved the quality of rendering. But by creating NeRF step by step, we will really get a feel for how it works, like how volume rendering is handled, position encodings, and how to train a neural network to reconstruct a scene. Plus, it will give us a solid understanding of the basics. Same as our last project, we will split this project into two parts.

Part 1: Fit a Neural Field to a 2D Image

Before we jump into implementing a real NeRF though, we will first try it out on a 2D image. Nerf is basically a neural network that has learned a 3D scene, each "weight" is a spot in the 3D space that has a color and an opacity. Color and opacity is the output of our model, and the input is the 3D position of the point we want to render and the viewing direction.

Since we are starting with 2D, we first just create a model that only takes 2D input, so the xy coordinates of the image. It then returns the rgb color of the pixel.

Our 2D neural network architecture

Our architecture is a simple feedforward neural network with 3 hidden layers. We use ReLU as our activation function, but we also use a Sinusoidal Positional Encoding (PE), which expands its dimensionality and tells our model where in the image the pixel is located.

We also keep the original input in the PE so the formulation is:

P E (x) = {x, s i n (2^{0} π x), c o s (2^{0} π x), s i n (2^{1} π x), c o s (2^{1} π x), . . ., s i n (2^{L - 1} π x), c o s (2^{L - 1} π x)}

Part 0: Setup

We are going to use the diffusion model DeepFloyd, to generate our images using text input. It is a two stage model by Stability AI. In the first stage it first generates a 64x64 image from the text input and then in the second stage it generates a 256x256 image by upsampling the 64x64 image.

Lets see what our diffucion model generates, if we feed it some prompts. We use the following prompts: "an oil painting of a snowy mountain village"
"a man wearing a hat"
"a rocket ship"

The results are shown below, I used different amounts of inference steps to generate the images:

We now implement a dataloader that randomly selects N samples at every iteration during training. Giving us the coodinates of the pixel as Nx2 and also the rgb values of the pixel as Nx3. We use some normalization for better results: (x = x / image_width, y = y / image_height) and the colors (rgbs = rgbs / 255.0)

We further use the Adam optimizer with a learning rate of 1e-4 and train the model for 1000 iterations, and get the following results:

The below images show the process of optimizing the network to fit on this image: Here I have choosen the following hyperparameters: Learning rate: 1e-2
Hidden Size: 256 L: 10 Batch Size: 10,000

10 Iterations

100 Iterations

200 Iterations

400 Iterations

1000 Iterations

Plot showing the training PSNR across iterations

We run the same on a different image, with the same hyperparameters and get the following results:

10 Iterations

100 Iterations

200 Iterations

400 Iterations

1000 Iterations

Plot showing the training PSNR four our second image across iterations

Now we try the hyperparameter tuning. We start with tuning our L and set it to 40.
We get the following results, which show us that setting L to a very high value does not improve the results:

Our image result after 1000 Iterations

Our training curve for L=40

Now we try the hyperparameter tuning with a different parameter. We try to set our hidden dimension to 32 instead of 256.
We get the following results, which show us that setting our hidden dimension to a lower value does decrease our result. Our network does not have a big enough capacity to learn the image anymore:

Our image result after 1000 Iterations (hidden dimension = 32)

Our training curve for hidden dimension = 32

Part 2: Fit a Neural Radiance Field from Multi-view Images

Now we are familiar with the basics of NeRF, trying to fit it to learn a 2D image, we can not transition to 3D. For this we are going to use the Lego Bulldozer dataset from the original NeRF Paper. We use a resulution of 200x200.

For this we first need to implement the conversion of 3D camera coordinates into 3D world coordinates, we imlement the following function.

\begin{array}{r} [\begin{array}{c} x_{c} \\ y_{c} \\ z_{c} \\ 1 \end{array}] = [\begin{array}{c} R_{3 \times 3} & t \\ 0_{1 \times 3} & 1 \end{array}] [\begin{array}{c} x_{w} \\ y_{w} \\ z_{w} \\ 1 \end{array}] \end{array}

Then we implement the function to convert 2D image coordinates into 3D camera coordinates. Here we just need to implement the following function:

\begin{array}{r} s [\begin{array}{c} u \\ v \\ 1 \end{array}] = K [\begin{array}{c} x_{c} \\ y_{c} \\ z_{c} \end{array}] \end{array}

After we can implement the function to turn pixel coordinates into rays. For this function we need the K, camera to world matrix and the pixel coordinates. We find the camera origin by just transforming the origin of the camera with our previous cam to world function.

For finding the ray direction we use the following function:

\begin{array}{r} r_{d} = \frac{X_{w} - r_{o}}{| | X_{w} - r_{o} | |_{2}} \end{array}

Now we adjust our dataloader to sample rays from the images. We sample random rays by first sampling a batch size of 200x200 u and v coordinates and then stacking them on top of each other. We then apply our previous mentioned functions to get the rays, that means the ray origin and the ray direction, then we also add the lable/rgb values. We sample randomly by just selecting random entries from this stack. We also shift the coordinates by 0.5 to make sure that we are sampling from the middle of the pixel and not the corner. Once we have the ray origin and ray direction, we sample multiple points on this ray between 2 and 6.

We can now use viser to visualize our cameras, rays and points on our rays, to make sure we have the right setup:

This is the visualization of our setup. We are displaying all cameras and then choosing random rays to display. We then display the points on the rays that we are going to sample. These are the rays and samples we draw at every iteration training step.

This is the visualization of only one camera in our setup. We are choosing random rays from this one camera so see whether the rays are all inside the frustum.

Now we can continue on to implement the architecture of our NeRF model. We use the following architecture:

This new architecutre is very similar to our old network we used in 2D but we are extending it by changing it to accept 3D input, which is first a 3D point in our space and then second a 3D viewing direction.

We make it to ourput a 3D RGB color and also a density value of the pixel.

We use Sigmoid to set the output color within range (0, 1), and use ReLU to set the output density to be non negative. We also adjust our PE function to accept 3D input concatonate it with our RGB output branch.

One we have set up our network we can now create our volume redering function. While the actual volume rendering function is as follows:

\begin{array}{r} C (r) = \int_{t_{n}}^{t_{f}} T (t) σ (r (t)) c (r (t), d) d t, where T (t) = \exp (- \int_{t_{n}}^{t} σ (r (s)) d s) \end{array}

We have to use the discrete version of this function, as we can not integrate over the continuous space:

\begin{array}{r} \hat{C} (r) = \sum_{i = 1}^{N} T_{i} (1 - \exp (- σ_{i} δ_{i})) c_{i}, where T_{i} = \exp (- \sum_{j = 1}^{i - 1} σ_{j} δ_{j}) \end{array}

We can understand this through an analogy, where the ray travels through the volume and depending on the density and color, picks up on Color and ends up with a saturated final color.
C is the color obtained from the network, Ti is the probability of the ray not termonating before the sampling location i.

We then implement the training loop, which uses the previously mentioned dataset function to get the ray origin and ray direction from our intrinsic matrix, pixels and camera to world matrix. We then use the network to get the color and density of the pixel by feeding the points and ray direction into the model.

We train 1000 iterations with a batch size of 10.000 pixels and a learning rate of 1e-3. In the following we can see the training loop and the results of the training loop, these specifically do not show a training sample but an unseen validation sample:

10 Iterations

After 100 Iterations

After 150 Iterations

After 300 Iterations

After 800 Iterations

Plot showing the validation PSNR of our 3D nerf model across iterations

This is our final result. I have used the camera intrinsics from our test dataset to generate images and create a spinning 3D gif view of our Lego Bulldozer.

Gif showing the Lego Bulldozer from different angles

For the final part we are going to implement the bells and whisles of displaying the depth. It is very similar to the mentioned volrendering part, only that we only output our sigma and use the sigma to display our depth. Since it is only a scalar, we render normalize it and render it in grey. Instead of compositing the point colors to the pixel color in the volume rendering, we in addition composite the point depths per point to the pixel depth.

The following shows our gif of the depth of the Lego Bulldozer:

Gif showing the depth of the Lego Bulldozer from different angles

Final Project: Neural Radiance Field!

Part 1: Fit a Neural Field to a 2D Image

Part 0: Setup

Part 2: Fit a Neural Radiance Field from Multi-view Images

Final Project:
Neural Radiance Field!