From Dust

2024 - Virtual reality opera.

Composer Michel van der Aa, together with vocal ensemble Sjaella, wanted to create an opera experience in virtual reality where the viewer is the protagonist in the story.

In this 'Cannes Film Festival 2025 best Immersive work' winning VR experience, the viewer answers a personal questionnaire before going into the experience. These questions influence how the viewer goes through the experience. For instance by changing certain aspects of how the digital singers react to you, or how the virtual worlds look and feel. Also, based on this questionnaire, an external PC generates and injects virtual spaces and environments into the experience, using a local running open source (CC0) neural network with custom middleware. Then on top of all that, the viewer has autonomy to choose different paths and portals to walk through, to create even more choice and variety to the experience. All while they are being sung to by 6 virtual singers in directional audio. Listening to different music mixes based on where they are in the virtual world, and the choices that they make.

Experience stills

Development Highlights

I was brought on as a VFX artist to work on certain particle and shader effects for this project. But during development, I got involved with other aspects of the project as well, and fell into the role of lead technical artist.

While we were talking about getting the singers facial expressions into the virtual 3D space, going over industry standards, the director mentioned that, for a while now, he had this idea to project an actual video of the singer's performance onto the face of the 3D model. And expressed the difficulties they had before, trying to make this a reality.

Research

I started doing some research to see if this is possible and achievable within the time, team size and budget.

/media/from_dust/last_of_us_face_capture.jpg

Modern video game facial performance capture for real time rendering usually uses a camera mounted in front of the face of the performer. This recording of the performer is then used as data to animate the rigged face of the 3D model.

To do this well, and not fall into the uncanny valley, requires a lot of tech (there are multiple off-the-shelf solutions), and a lot of hand work and expertise. Also, this is not the actual performance, it's a translated visualisation of the facial performance onto the 3D model of a face resembling that of the performer, or a different face entirely.

To my knowledge, the only company that has successfully done true face video capture into real time rendering is Rockstar Games with the videogame L.A. Noire in 2011. The technology called MotionScan had an elaborate expensive setup with dozens of cameras, and the performer needed to be stationary while recording.

There was a lot of extra work and money involved to make this tech viable, and a sign that the standards used today were a far more reliable and cost effective way to capture performances.

/media/from_dust/neural_network_visual.png

So how are we going to get this to work in 2024? Building a custom neural network solutions ourselves would take a lot of time and resources. Plus the output would not represent the actual face of the singer, and would give off an uncanny valley feel with no real way to address the issues. Current weighted neural network technology is a black box solution with limited control, even with proper middleware and targeted model training. Also, rendering hours of footage this way consistently would be a major challenge.

/media/from_dust/azure_kinect_headmount.jpg

We looked into using a Microsoft Azure Kinect to capture 3D data of a face. We did some tests, but found that the resolution of the camera and the size and weight of the device would make stable capture during live performance very difficult.

Then I remembered that on the latest iPhone models there is a depth sensor on the inside facing camera module. The resolution of the iPhone selfie camera is high enough to get a decent picture. But the output from the depth sensor is very low resolution, misses data at sharp angles, and is very very noisy. But even with these caveats, we still thought this would be the best solution within budget, time, and comfort for the performers.

Plus if this risky experiment would not work out, we would have a backup plan to use the captured footage with industry standard technology that we mentioned above. After a lot of discussion we set out to do a test with one of the singers, to see how workable this idea would be.

Prototyping

Trying out several apps to capture picture and depth data simultaneously with the best quality possible, we choose to use Record3D since it would export the depth data as an 32bit EXR with float point precision. The video would be shot in 30fps if we wanted to use the maximum resolution of the camera.

As you can see in the capture example of one image frame next to this paragraph, the resolution of the depth map is very small. And has a lot of imperfections. Overcoming this became one of the biggest challenges of this project.

After setting up rudimentary tooling in WebGL (more on this later) to process and convert the data into something that we can stream into Unity, we set out to test several methods of “projecting” the video footage as a face with dimensionality.

/media/from_dust/first_realtime_lighting_tests.jpg

Here you can see 2 methods tested with dynamic lighting using the rough output from the tooling.

On the left is the 3D model of the singer as a traditional mesh based on a 3D scan.

In the middle sits a VFX graph object generating particles at certain depths based on the frame data.

On the right is a projection on a segmented mesh plane using vertex and fragment shaders. As you can see, the data from the depth image is very unstable, and lacks detail, especially around the eyes.

The initial tests revealed that the current depth data fidelity was not going to be enough. I really wanted to give this idea of projecting the face a proper shot, so after some convincing, I went and created an elaborate tooling environment to process the captured data. After weeks of research, contemplation, testing, and a lot of ah-ha moments under the shower, we have a stable workable solution!

I created a specialized tool GUI in JavaScript using WebGL and WebAssembly to process all the data we needed. The reason why I built it in JavaScript is the flexibility, wide support and especially fast PNG image compression in most browsers. And using a PHP backend to save image data blobs using multiple threads on the background to spread the workload as much as possible during rendering. Running it all on a local Apache session so it could easily be set up on any Mac or PC.

/media/from_dust/from_dust_custom_tooling_preview.jpg

This tool has a lot of features and settings using a comprehensive process pipeline spread out in 30+ (compute) shader passes including:

Multi frame interpolation and de-noising
Radial fill and smoothing of missing pixels in depth data
Face landmarking and stabilization (using MediaPipe Face Mesh)
Masking eyes, lips, teeth, mouth areas for PBR material properties
Teeth detection and depth correction / align with closest lip depth
SDF depth fill out for areas out of view from the camera, or missing depth fidelity (for eyes, eyelids, cheeks, chin, jaw and neck)
(Vertex) Normal map generation using interpolated depth data and face mesh
Vertical / horizontal depth offset data layer to have depth control on all 3 axes
Masking headset and environment using a radial noise fill
Selfie lens corrections to get close to a parallel view of the face
Per frame dynamic face gradient mask depending on the expression to blend with the rest of the 3d model
and a lot more...

Because almost all processes are ran through (compute) shaders or WebAssembly, the entire pipeline is extremely fast, and results display near real time with a semi decent GPU. This makes processing thousands of frames actually doable within a couple of hours. We can run parallel instances using multiple browser windows, and it ran very stable using a browser running the Chrome V8 engine.

In the end, we did not have to build / train any custom neural network on our data. I was surprised how much can be done by just layering fragment shaders, combining multiple frame buffers to interpolate and generate missing image data. This gave us a very stable output, super fast rendering, and a lot of control.

Production

Feeling confident enough about the tech, it was time to shoot the entire opera. We had 3 days to shoot all 6 singers in all possible outcomes and variants. Everything was choreographed and marked out with colored tape in a grid on the floor. The singers would wear full motion tracking suits, and a headset with an iPhone for facial recording.

In between recordings I would run the footage through the tooling, and make adjustments to the tooling on the fly. I remember the motion tracking crew looking over my shoulder asking what software I was using, and told them I built it myself. The expression on their faces was something I can still recall vividly.

It was an intense couple of days. Hearing the singing up close, feeling it in your chest, was quite an emotional experience for me. Seeing the results was very satisfying.

Since we have 6 virtual singers performing at once, we had to find a solution to store all the different face data layers needed, and be able to run them all in sync with the music. Running 6 separate videos created sync issues, so we decided to run all 6 singers in one giant 8K (8192 x 4320 pixels) video. This was the maximum the H.265 codec could support, and gave us enough fidelity to work with. Having all the data in one RenderTexture did create quite a lot of UV mapping complexity, since we have 2 dimensions to map, the singer, and then the specific data part of this singer.

/media/from_dust/from_dust_singer_vfx_controller.jpg

Combining both visual methods (custom shaders on meshes, and VFX particle systems) created a way to blend between a rigid and fluid form of the singer. With this we could change their form based on the world narrative and interactions with the viewer.

I built a massive controller script for each singer that would manage all parts of the shaders and VFX systems on the model while exposing a limited amount of values that can be animated on the timeline, or through other scripts.

Each singer model consists of multiple skinned meshes that each have multiple materials and shaders. Each mesh has a corresponding VFX object that generates the dust particles based on the position of the skinned mesh and the “fluid state” of the singer.

The control script overwrites and generates all these assets based on the singer model that is imported. When we needed to update the models, and had to do a new import, I just had to run a few script commands, and it would overwrite and repopulate all the custom shaders and VFX objects, saving a lot of time and effort.

/media/from_dust/from_dust_faces_with_motion.jpg

After a few weeks of processing the huge amount of data, timing all the motion capture and videos to the music tracks, we were finally at the stage to see it all work together.

Seeing the actual video of the face combined with the motion capture on the model really solidifies the performance and nuance of the singers. The moment it all clicked we had a huge sigh of relief.

There was still a huge amount of refinement and tweaking involved (there will always be something to improve), but seeing it finally move out of the uncanny valley was a feeling of immense satisfaction. Especially working with the time constraints and budget we had.