↑↓
Scroll down, or swipe up,
to travel forward
Technical Art & Creative Technology
This process could take +/- 20 seconds.
      LOOK      



LOOK





LOOK





TRAVEL


hero award

From Dust


2024 - Virtual reality opera.

Composer Michel van der Aa, together with vocal ensemble Sjaella, wanted to create an opera experience in virtual reality where the viewer is the protagonist in the story.

In this 'Cannes Film Festival 2025 best Immersive work' winning VR experience, the viewer answers a personal questionnaire before going into the experience. These questions influence how the viewer goes through the experience. For instance by changing certain aspects of how the digital singers react to you, or how the virtual worlds look and feel. Also, based on this questionnaire, an external PC generates and injects virtual spaces and environments into the experience, using a local running open source (CC0) neural network with custom middleware. Then on top of all that, the viewer has autonomy to choose different paths and portals to walk through, to create even more choice and variety to the experience. All while they are being sung to by 6 virtual singers in directional audio. Listening to different music mixes based on where they are in the virtual world, and the choices that they make.


Experience stills


Development Highlights


I was brought on as a VFX artist to work on certain particle and shader effects for this project. But during development, I got involved with other aspects of the project as well, and fell into the role of lead technical artist.

While we were talking about getting the singers facial expressions into the virtual 3D space, going over industry standards, the director mentioned that, for a while now, he had this idea to project an actual video of the singer's performance onto the face of the 3D model. And expressed the difficulties they had before, trying to make this a reality.



Research



I started doing some research to see if this is possible and achievable within the time, team size and budget.


/media/from_dust/last_of_us_face_capture.jpg
Modern video game facial performance capture for real time rendering usually uses a camera mounted in front of the face of the performer. This recording of the performer is then used as data to animate the rigged face of the 3D model.

To do this well, and not fall into the uncanny valley, requires a lot of tech (there are multiple off-the-shelf solutions), and a lot of hand work and expertise. Also, this is not the actual performance, it's a translated visualisation of the facial performance onto the 3D model of a face resembling that of the performer, or a different face entirely.
/media/from_dust/la_noire_capture.jpg
To my knowledge, the only company that has successfully done true face video capture into real time rendering is Rockstar Games with the videogame L.A. Noire in 2011. The technology called MotionScan had an elaborate expensive setup with dozens of cameras, and the performer needed to be stationary while recording.

There was a lot of extra work and money involved to make this tech viable, and a sign that the standards used today were a far more reliable and cost effective way to capture performances.
/media/from_dust/neural_network_visual.png
So how are we going to get this to work in 2024? Building a custom neural network solutions ourselves would take a lot of time and resources. Plus the output would not represent the actual face of the singer, and would give off an uncanny valley feel with no real way to address the issues. Current weighted neural network technology is a black box solution with limited control, even with proper middleware and targeted model training. Also, rendering hours of footage this way consistently would be a major challenge.
/media/from_dust/azure_kinect_headmount.jpg
We looked into using a Microsoft Azure Kinect to capture 3D data of a face. We did some tests, but found that the resolution of the camera and the size and weight of the device would make stable capture during live performance very difficult.

Then I remembered that on the latest iPhone models there is a depth sensor on the inside facing camera module. The resolution of the iPhone selfie camera is high enough to get a decent picture. But the output from the depth sensor is very low resolution, misses data at sharp angles, and is very very noisy. But even with these caveats, we still thought this would be the best solution within budget, time, and comfort for the performers.

Plus if this risky experiment would not work out, we would have a backup plan to use the captured footage with industry standard technology that we mentioned above. After a lot of discussion we set out to do a test with one of the singers, to see how workable this idea would be.



Prototyping


Trying out several apps to capture picture and depth data simultaneously with the best quality possible, we choose to use Record3D since it would export the depth data as an 32bit EXR with float point precision. The video would be shot in 30fps if we wanted to use the maximum resolution of the camera.


/media/from_dust/source_image_data.png
As you can see in the capture example of one image frame next to this paragraph, the resolution of the depth map is very small. And has a lot of imperfections. Overcoming this became one of the biggest challenges of this project.

After setting up rudimentary tooling in WebGL (more on this later) to process and convert the data into something that we can stream into Unity, we set out to test several methods of “projecting” the video footage as a face with dimensionality.


/media/from_dust/first_realtime_lighting_tests.jpg
Here you can see 2 methods tested with dynamic lighting using the rough output from the tooling.

On the left is the 3D model of the singer as a traditional mesh based on a 3D scan.

In the middle sits a VFX graph object generating particles at certain depths based on the frame data.

On the right is a projection on a segmented mesh plane using vertex and fragment shaders. As you can see, the data from the depth image is very unstable, and lacks detail, especially around the eyes.

The initial tests revealed that the current depth data fidelity was not going to be enough. I really wanted to give this idea of projecting the face a proper shot, so after some convincing, I went and created an elaborate tooling environment to process the captured data. After weeks of research, contemplation, testing, and a lot of ah-ha moments under the shower, we have a stable workable solution!



I created a specialized tool GUI in JavaScript using WebGL and WebAssembly to process all the data we needed. The reason why I built it in JavaScript is the flexibility, wide support and especially fast PNG image compression in most browsers. And using a PHP backend to save image data blobs using multiple threads on the background to spread the workload as much as possible during rendering. Running it all on a local Apache session so it could easily be set up on any Mac or PC.


/media/from_dust/from_dust_custom_tooling_preview.jpg
This tool has a lot of features and settings using a comprehensive process pipeline spread out in 30+ (compute) shader passes including:
  • Multi frame interpolation and de-noising
  • Radial fill and smoothing of missing pixels in depth data
  • Face landmarking and stabilization (using MediaPipe Face Mesh)
  • Masking eyes, lips, teeth, mouth areas for PBR material properties
  • Teeth detection and depth correction / align with closest lip depth
  • SDF depth fill out for areas out of view from the camera, or missing depth fidelity (for eyes, eyelids, cheeks, chin, jaw and neck)
  • (Vertex) Normal map generation using interpolated depth data and face mesh
  • Vertical / horizontal depth offset data layer to have depth control on all 3 axes
  • Masking headset and environment using a radial noise fill
  • Selfie lens corrections to get close to a parallel view of the face
  • Per frame dynamic face gradient mask depending on the expression to blend with the rest of the 3d model
  • and a lot more...

Because almost all processes are ran through (compute) shaders or WebAssembly, the entire pipeline is extremely fast, and results display near real time with a semi decent GPU. This makes processing thousands of frames actually doable within a couple of hours. We can run parallel instances using multiple browser windows, and it ran very stable using a browser running the Chrome V8 engine.

In the end, we did not have to build / train any custom neural network on our data. I was surprised how much can be done by just layering fragment shaders, combining multiple frame buffers to interpolate and generate missing image data. This gave us a very stable output, super fast rendering, and a lot of control.



Production


/media/from_dust/recording_motion.jpg
Feeling confident enough about the tech, it was time to shoot the entire opera. We had 3 days to shoot all 6 singers in all possible outcomes and variants. Everything was choreographed and marked out with colored tape in a grid on the floor. The singers would wear full motion tracking suits, and a headset with an iPhone for facial recording.

In between recordings I would run the footage through the tooling, and make adjustments to the tooling on the fly. I remember the motion tracking crew looking over my shoulder asking what software I was using, and told them I built it myself. The expression on their faces was something I can still recall vividly.

It was an intense couple of days. Hearing the singing up close, feeling it in your chest, was quite an emotional experience for me. Seeing the results was very satisfying.
/media/from_dust/video_face_data.jpg
Since we have 6 virtual singers performing at once, we had to find a solution to store all the different face data layers needed, and be able to run them all in sync with the music. Running 6 separate videos created sync issues, so we decided to run all 6 singers in one giant 8K (8192 x 4320 pixels) video. This was the maximum the H.265 codec could support, and gave us enough fidelity to work with. Having all the data in one RenderTexture did create quite a lot of UV mapping complexity, since we have 2 dimensions to map, the singer, and then the specific data part of this singer.
/media/from_dust/from_dust_singer_vfx_controller.jpg
Combining both visual methods (custom shaders on meshes, and VFX particle systems) created a way to blend between a rigid and fluid form of the singer. With this we could change their form based on the world narrative and interactions with the viewer.

I built a massive controller script for each singer that would manage all parts of the shaders and VFX systems on the model while exposing a limited amount of values that can be animated on the timeline, or through other scripts.

Each singer model consists of multiple skinned meshes that each have multiple materials and shaders. Each mesh has a corresponding VFX object that generates the dust particles based on the position of the skinned mesh and the “fluid state” of the singer.

The control script overwrites and generates all these assets based on the singer model that is imported. When we needed to update the models, and had to do a new import, I just had to run a few script commands, and it would overwrite and repopulate all the custom shaders and VFX objects, saving a lot of time and effort.
/media/from_dust/from_dust_faces_with_motion.jpg
After a few weeks of processing the huge amount of data, timing all the motion capture and videos to the music tracks, we were finally at the stage to see it all work together.

Seeing the actual video of the face combined with the motion capture on the model really solidifies the performance and nuance of the singers. The moment it all clicked we had a huge sigh of relief.

There was still a huge amount of refinement and tweaking involved (there will always be something to improve), but seeing it finally move out of the uncanny valley was a feeling of immense satisfaction. Especially working with the time constraints and budget we had.

Official Trailer


Watch the official project trailer for the Cannes Film Festival viewing.

Behind the scenes video


Take a look behind the scenes of the entire production. From the vocal recording, motion capture, 3D modeling, visual effects and development.

recording shoot photos / sketches

Development screens

Project Information


Project duration: 40 weeks.

Roles:

• Technical art lead
• VFX Artist
• 3D modeling topology & texture cleanup
• face performance capture tech
• tooling face recording data processing
• development support

Tools used:

• Unity
• Blender
• Adobe Photoshop
• THREEJS
• PHP / ImageMagick
• Custom build tools

Project Credits:

Michel van der Aa
Roles : Composition, director, script
https://vanderaa.net/
Madelon Kooijman
Roles : Dramaturge
Roland Smeenk
Roles : Lead developer
Quint Vrolijk
Roles : Technical artist
Stein van de Ven
Roles : Lead 3D artist
Glenn Wustlich
Roles : 3D characters artist, 3D model cleanup, rigging
Bas Jansen
Roles : 3D artist
Niels Weber
Roles : Character animation, mocap cleanup, development support
Michael Hussar
Roles : Gen AI integration, onboarding tool
Sean Simon
Roles : Gen AI integration, onboarding tool
Iris van Wees
Roles : Digital clothing designer
Ruud op den Kelder
Roles : VFX artist, development support
Judith de Zwart
Roles : Stylist Sjaella
Bart Wagemakers (Exalto studio)
Roles : Vocal recording engineer
Joost Rietdijk
Roles : Director of Photography
Daniëlle de Jonge
Roles : Production
Aram Balian
Roles : Production
Rosita Wouda (doubleA)
Roles : Financial director
Norman Vladimir
Roles : Publicity manager
Het Nieuwe Kader, Niels Bosch
Roles : Mocap XSense suit recording
Tomas Valečka
Roles : Mocap Audio engineer
Arjen Oosterbaan
Roles : Mocap Producer
Anne van Brunschot
Roles : Assistant director
David Wolfswinkel
Roles : Assistant director