Radish Trial 6 Advanced Scenes
The trial of the radishes is meant as a guided, self-learning tutorial without step-by-step instructions. Instead it focuses on exploratory learning by actively using the tools to solve increasingly challenging tasks.
>> Trail 6 focuses on creating more advanced scenes <<.
This article provides some background information and various tips required or helpful to accomplish these objectives.
See forum post for more information about the trials.
- 1 Some background Information
- 2 Cinematography Techniques
- 3 Scene (Definition) Tuning
- 4 Cutscene Anchoring
- 5 Gameplay Scenes
- 6 Custom Voicelines
- 7 Phoneme Extractor GUI
- 8 Testing Custom Audio In Storyboard UI
- 9 Adding Other Languages
- 10 Generating Lipsync Animations Without Audio
- 11 Tips Scene Definitions
- 12 Tips Storyboard UI (SBUI)
- 13 Tips Phoneme Extraction/Tuning
- 14 Tips Speechfile Packing
Some background Information
In the previous trial the focus was set on interactive dialogue scenes and the dialogue flow. Now it’s time to concentrate on the presentation, that is: fine-tune animations, mimics, voicelines and camera framing. This will not only make dialogue scenes more interesting but also allow to create more engaging cinematic cutscenes which present questprogress without player interaction.
Adjusting animations and camera framing is only one part of the story. A cinematic scene should also have some visual structure to support the narrative. This a very broad topic and way beyond the scope of this article. As a start read at least this short introduction which touches some aspects as this article only focus on the technical side.
Rule of Thirds
One simple cinematography technique which is also mentioned in the above introduction is the “Rule of Thirds”. It’s a rule of thumb on how to frame the shot to make the composition look more interesting: the camera frame should be dived by two equally spaced horizontal and vertical lines and important elements of the scene should be placed along those lines or their intersections.
Storyboard UI supports this by providing a Rule-of-Thirds overlay in the camera mode:
The overlay can be toggled on and off with a hotkey.
In addition to a visually pleasing frame composition and varying between appropriate static shot types, scenes can also benefit from carefully defined camera movements. In the Witcher 3 scene system this works by defining two or more key frames (meaning a specific camera position and its direction) and letting the scene system smoothly interpolate the camera position and direction (and other parameters) automatically between the key frames.
Storyboard UI does not provide any direct support or preview for camera setting interpolation. Instead it is required to define separate shots with static cameras in SBUI and afterwards connect these cameras in the scene yml definition by setting up the first camera as a start cam and the second one as the end cam. In the encoder this is called camera blends and an example definition may look like this:
In this case the camera blend starts in the element
shot_1 and ends in the element
shot_3 because it spans multiple voicelines which have to be separate elements. But that is not necessary: it’s possible to define multiple (but not overlapping!) camera blends in one element, e.g. a long pause element without any voicelines. However camera blends cannot span dialog sections boundaries and cannot be defined in choice sections.
Depending on the defined duration (that is the time between the key frames) the interpolation between the camera parameters will be slower or faster. Most of the time only a subtle, very slow movement will improve the scene while fast changes will most likely draw too much attention from viewers and distract from the actual content.
Be aware that smoothly also means that interpolation between more than two key frames (which can be inserted between start and end) will be a curve that may overshot the key frame significantly if not carefully defined. Some experimentation will be necessary.
One simple encode-able example can be found in the docs.scenes/test.examples directory (test_storyboard_cam.blend.yml) which also explains the parameters. The “bigbunny.example” in docs.scenes is an example of a camera flyby with multiple keyframes.
Depth of Field and Focus
An advanced cinematography technique is to adjust the depth of field (DOF). It’s the area in front of the camera that appears sharp and can be used to direct the attention of viewers to certain elements and to blur out unimportant ones, e.g. crowds of npcs surrounding the main actor(s) of the scene.
Here is are three example shots of the same scene with different DOF settings to visualize the effect (click on the image to see the video with rapid switches between the dof settings):
The camera definitions logged from Storyboard UI contain a default DOF setting:
But in most cases it should be adjusted (or at least tuned down) as it makes the encoded scenes “blurry” if the actors are not in the focused sweet spot area.
Both (blur and focus) settings are specified by two distance values (near, far) which (as a simplified analogy) define the range from the camera position that the blur and focus is centered around. So to put some actor(s) (or the interesting scene element) into focus and blur out the near and far parts of the scene the settings should be spaced roughly like this:
[ ... blur near ... [focus near ... actor(s) ... focus far] ... blur far ... ]
It’s possible to change the values individually in SBUI but most of the time it is easier to use the rough (!) automatic DOF centering (see hotkey help). It tries to adjust the settings to “somewhat adequate” values that put the selected actor into the sweet spot of the current shot.
Though the settings can be tweaked individually afterwards, too.
Unfortunately due to technical limitations there is only limited support to preview the effect in SBUI. As a prerequisite for the preview to work at all no active layered environment definitions are allowed (for example no weather environment must be active). Even in this case the preview in SBUI works only partially and only the ‘far’ plane is blurred but not the ‘near’ area. However it is possible to deactivate layered environments with the console command
envui_disable_envs() which is part of the envui package. But make sure the depth of field is not switched off in the games post-processing settings!
As a side note it’s also noteworthy that the DOF settings like all camera parameters will be interpolated in camera blends which can be used to create interesting visual effects.
Scene (Definition) Tuning
As briefly hinted in the dialog scene tutorial (see docs.scenes/tutorial) the ‘
storyboard’ part of a radish scene definition is responsible for the actual presentation of a scene. Most of the tuning for the presentation has to be done in this part of the definition. But it’s important to understand where settings have to be added or changed, first.
For every section in the dialogscript part of the scene definition, a dedicated, equally named section in the storyboard part can be defined: it allows to attach a variety of different ‘scene events’ for specific “dialogline” or “pause” elements of the dialogscript section.
All dialogscript elements can be referenced in the storyboard part either positional (by their position in the respective section, e.g.
PAUSE_1 for the first pause in the section or
GERALT_2 for the second textline of the actor named
GERALT) or by the name of an additional, prepended named HINT/CUE definition (see example below). Most of the time (e.g. in all Storyboard UI dumped definitions) the named
CUE references should be preferred as they don’t rely on the order of elements and also improve the readability of the storyboard.
The previously described camera blends are an example for one type of scene events related to cameras (but there are many other types).
Every scene event type has a different purpose and most have different settings that can be adjusted.
See the short example definitions in the ‘docs.scenes/test.examples’ directory as a reference for of all supported scene events and their respective settings.
Of particular interest are the animation events (normal, additive and mimic animations): although adding custom animations is not supported by the radish modding tools, Witcher 3 has a lot of reusable animations that can be further tweaked by some settings to customize them for the scene. For example animations can be ‘clipped’ to just playback a specific part, have a smooth transition between previous and/or next or idle animation (‘blend’), the intensity can be adjusted by a ‘weight’ parameter and animations can be slowed down or played faster with a ‘stretch’ parameter. In addition a sequence of multiple animations can be setup with different starting positions for any dialogline or pause element in a scene section.
For technical reasons the Storyboard UI in-game mod cannot provide a preview for any of these settings and it also cannot assign multiple animations per shot or setup their starting point within a shot. As a consequence tuning these parameters is only possible in the yml definition and the results have to be checked by encoding and reviewing the scene in the game. Nevertheless SBUI is useful to lay out the rough scene, setup some variations of a shot (e.g. different cameras or animations) or simply to search and preselect animations or mimics.
To ease with the tuning of a scene and to get a better “overview” of the storyboard parts of a scene definition the radish scene encoder automatically “renders” a “debug timeline” which visualizes the positions, duration and sequence of animations and other scene events (e.g. camera changes) into a text file.
It looks like this:
At this point it is *highly* recommended to use sublime as an editor and to install the radish modding tools sublime support package (from the download section). The package contains auto completions for a couple of scene events, completions for animation names and a coloring scheme for scene timelines. This improves the overall usefulness of the debug timeline, especially with more complex scenes containing many actors and props (note the folded actor timelines in the screenshot):
With some practice many changes can be prepared using the debug timeline without the need to check every change in the game (see this short example video). Nevertheless make sure you setup the ‘Utility Scene Autostarter’ from trial 4 correctly to playback encoded scenes in the game as fast as possible (when you have to).
Adding/Adjusting Animations To A Scene
Once a scene definition is changed by manual tweaking it becomes impractical to use Storyboard UI to adjust and dump the *full* scene, again. One possible way to add or exchange some animation in an already defined scene requires creating a new (or adjusting an existing) shot in a SBUI scene, logging the definition and afterwards manually merging of multiple parts from the dump into the scene definition yml. These are namely the part of the dumped repository, part of the production assets and the actual animation usage in the storyboard section referencing the new animation in question:
However an easier way is just to use inline animations in the definition and thus make the manual transfer of the repository and production parts unnecessary. The above example will be considerably reduced to just adding/changing the actual animation and attaching it to the actor directly in the storyboard:
The only required information from the SBUI dump is the repository name of the animation (in this case
Poses and Animation Blending
Every actor has always a “pose” defined. Poses are basically the “idle” animations and are active all the time when no specific animation is played, e.g. in scene choice sections. For these reasons pose animations are always looped.
If a specific animation is played and stops the pose idle animation takes over. Most of the usable animations intended for dialogs are named in a way which indicates a compatible pose idle animation (e.g. geralt_high_standing_determined… is compatible with all pose idle animations defined as high, standing and determined). But even in a compatible transition there may be a visible, sudden change of the actors pose at the end of the animation. To reduce this animation jerk, for every anim a blend-in and also a blend-out can be defined, like this:
These settings are defined as the duration (in seconds) the animation will blend either from the previous (specific or idle) anim to the next anim (again, either the next or the idle one). However making these blends too long will make it look unnatural - so some experimentation will be required. Usually a good starting point is something between 0.3 - 0.7 seconds.
Since this blending is applied to the *played* part of the animation the clipping has to be considered, too:
It’s also possible to slowdown or speedup animations (‘stretch’ setting) and to change the ‘weight’ of the animation (which is basically the intensity of the overlay of the animation - try out a couple of different weights to get a feeling for the effect).
Also be aware that the ‘
stretch’ parameter is applied on the played range of the animation (that is: *after* the clipping). As seen in the above video it’s much easier to get a handle on the resulting duration by using the debug timeline visualization.
Some animations like generic walking, running or riding are only very short and have only one or at most only a couple of cycles - then they just stop when used as actor.anim scene events. In addition the actor is moving at the same spot and does not advance.
There are basically two options to loop these animations:
- setup a custom pose with the animation as idle animation in SBUI (select the animation intended to be looped in the pose animation list and add an
actor.poseevent in the appropriate element in the yml definition)
- setup a sequence of the same animation with
actor.animevents as many times as required
The first solution is easier to setup but does not work reliably if the animation has to be synced with another actors animation (e.g. rider and horse): pose animations seem to start at slightly random positions and this may result in some clipping (e.g. between rider and horse). The second one syncs reliably but the subsequent starting positions have to be set manually which may require more fine-tuning (or some calculations for correct positioning).
Additionally, in both cases the actor has to be moved accordingly by properly defined placement interpolation events.
Most of the actor animations with some movement also move the involved actor while playing. But at the end of the animation the actors position is always reset to the starting position. To reposition an actor (at any time) to a specific position the ‘actor.placement’ scene events can be used.
However for looped animation like walking cycles (or any other animation which do not have actor movement encoded) it is required to manually add continuous, smooth placement updates. In the yml definition this can be done with ‘placement interpolation’ scene events, similar to camera interpolation (see ‘test_storyboard_actor.placement.interpolation.yml’):
Additional key frames can be set between the start and end to define a more specific path. In addition to the position also the rotation will be interpolated. The last parameter defines the ‘ease-in’ and ‘ease-out’ just like for camera blends.
The easiest way to define an interpolation is to create two dedicated shots in SBUI (one for start the other for the end position), place the actor in both shots, log the definition and either change the placement events into interpolation events or just use the coordinates to write the above scene event sequence manually, e.g. attached only to one element (as above). However an appropriate amount of time needs to be between the events or you’ll just get an unnatural sliding - so expect some iterations for fine-tuning.
One thing to keep in mind is that actor movement (from animations) in scenes does not respect any collisions, that includes the ground: actors will clip with any terrain bump or float over terrain dents.
If necessary this can be manually fixed with placement interpolation (Z-Axis) as well. Although most of the time it’s just easier to frame the shot and hide the clipping.
Also, any scene props can be positioned and/or moved along a placement interpolation path with dedicated scene events (see test.examples).
Normal animations do not contain any facial movements. There is a dedicated set of animations for this and they are attached with ‘actor.anim.mimic’ scene events. The available mimic animations can be previewed in Storyboard UI.
Unfortunately SBUI does not support assigning a mimic animation and a voiceline at the same time in the same shot. However this constraint does not apply to yml scene definitions.
The settings for ‘
actor.anim.mimic’ events are basically the same as for “normal” animation events. And just like normal animations events an ‘
actor.anim.mimic’ can be defined as an “inline” animation and directly attached to an actor:
Take special care with the weight setting for mimic animations, e.g. if your actors are smiling like idiots you should tune down the weight for the mimic. Here are four examples with
anim.mimic weights of 1.0, 0.66, 0.33 and 0.0:
As a good rule of thumb you should not go overboard with the mimic weight:
the more subtle mimics are the more convincing and natural the result will look. Sometimes even one mildly raised eye brow will be enough to emphasize a reaction.
The encoder automatically generates ‘actor.lookat’ events for scenes with multiple actors and makes actors look at each other. However it’s possible to override these autogenerated scene events by specifying a custom looked-at actor or static point, the speed of change or the involved rotating bodyparts (
body). It’s also possible to “disable” a specific look-at by defining a ‘none’ target because look-at events are on-top modifiers for animations and some animations may already contain gaze changes, too.
SBUI provides a hotkey to cycle between all actors as look-at target and also the definition of static look-at points (make sure to adjust the distance or the actor will be squint-eyed) but the advanced settings, like speed and the involved bodyparts can only be setup in the definition (see above debug timeline video or the test.examples).
Most of cutscene type scenes probably include some camera shots embedding the narrative into the surrounding and are intended to be played back at exactly the same location. As mentioned in the 2nd part of the dialog scene tutorial the ‘placement’ key in the production part of a scene definition can be used to attach the scene to either an actor or to a specific world location (via the tag of a spawned entity).
Attaching a scene to a tagged, statically positioned entity is done via ‘scenepoints’ (a special type of layer entities) which can be created in radish quest UI at the location where the scene should be played back. However a precise placement of the scenepoint is not required: it defines only the anchorpoint (origin of the coordinate system for the scene) for all placement settings of this particular scene.
To ensure every scenepoint can be fetched individually the encoder automatically attaches to all scenepoints an autogenerated tag derived from the project- and scenepoint name.
The exact name of the attached tag can be inspected in modeditor in the appropriate “scenepoints.w2l” encoded layer file. But the scheme is always ‘
<modname>_<hubname>_<given scenepoint name>_sp’, so for example the ‘hubtest’ project scenepoint named ‘examine_corpse’ in the prologue area will be expanded to ‘hubtest_prologue_examine_corpse_sp’.
Setting this tag as placement will always playback the scene at this location IF the layer with the scenepoint is visible. But be aware that if a scenepoint entity’s location is changed afterwards the played scene will also move (or rotate).
If you want just to change (or simply bind to) a scenepoint for a Storyboard UI prepared scene without moving the scene actors and prop placements you can start SBUI with ‘
sbui_with_scenepoint(<scenepoint tagname>)’. This will keep the defined scene and just recalculate all positions. So it is safe to use the command on an “existing” storyboard even multiple times.
The other option to bind a scene to a specific location is to create the scene in Storyboard UI and use the coordinates information below the placement settings in the logged definition to create a new scenepoint (either in radish quest UI or manually in a quest layer definition).
Either way the placement tag in the logged scene description still needs to be set manually to the scenepoint tag.
Please note that SBUI logs the rotation in the dumped “world coordinates of used origin” info as ‘
[pitch, yaw, roll]’ like the EulerAngles struct in witcher scripts expects it. But the radish quest UI and the radish encoders expect rotations to be ordered as ‘
[roll, pitch, yaw]’.
Aside from dialog-type and cutscene-type scenes there is one other category supported by the radish modding tools: ‘gameplay scenes’. Those scenes do not interrupt the normal 3rd person view and can be used to playback some custom (even randomized) comments by a npc (or the player) or even more complex interactions between multiple actors with animations, dialoglines and even mimics, e.g.:
Be aware that using animations or pose changes in gameplay scenes may lead to poor blending between the currently active npc gameplay pose or animation. In practice, playing voiceline and mimics should work most of the time. And maybe also some gesture animations defined as ‘
actor.anim.additive’ scene events as well. Check out the two short example definitions for gameplay scenes in the ‘docs.scenes/test.examples’ folder and experiment for yourself (how about a mod adding some scenes with mimics, gestures and new comments from npc bystanders?).
The radish modding tools support adding custom audio voicelines to the game. Added voicelines can be used in any scene just like vanilla voicelines would be used, including their selection in Storyboard UI to setup new scenes.
In addition it’s also possible to generate somewhat passable lipsync animations for custom voicelines. There is a more detailed HOWTO in the docs.lipsync folder of the encoder package (HOWTO.generate.lipsynced.w3speech.txt) but as a short overview it basically works like this:
- prepare audio as wav files and the corresponding spoken text as a strings.csv with string ids assigned
- extract initial phoneme timings
- tune phoneme timings manually in the GUI
- convert the above wav audio into ‘wem’-format
- generate lipsync animation and pack wem audio as w3speech file
- test voicelines and lipsync animations in SBUI
In step 4 the audio is converted into the ‘wem’ format from Audiokinetic Wwise to be usable in Witcher 3. Since newer versions of Wwise generate incompatible wem formats, an older version that was used to create the Witcher 3 audio has to be downloaded from here.
A detailed HOWTO for step 4 can be found in the ‘docs.speech’ folder (HOWTO.wem.conversion.txt). Coincidently there is also a video doing more or less the same.
Step 5 is automated by the radish build pipeline so the following will describe the GUI and its usage in steps 2 and 3 and afterwards the necessary tasks to add custom voicelines into SBUI’s selection list (step 6).
The Big Picture In A Nutshell
In order to generate lipsync animation that is adequately synced with the audio it is required to have (a) correct timings tied to the audio and (b) some information what kind of lip animation should be generated at (a).
Information for (b) will be directly acquired from the text corresponding to the audio by translating the text into a ‘phoneme’ sequence, e.g.:
Phonemes are basically a set of standardized symbols and each defines a unique pronunciation.
Since the translated phoneme sequence does not contain any timing information the Phoneme Extractor GUI tries to extract phonemes from the audio (a) as well. Unfortunately this is rather difficult and while the result does contain phonemes and timings, this phoneme sequence does not necessary match the expected phoneme sequence from the text. Nevertheless most of the time it is a good baseline to start manual adjustments.
Once the timings are tuned (step 3) the radish speech encoder uses the phonemes from the *translated* phonemes sequence to pick lip animation snippets from a set of already extracted lipsync animation snippets from vanilla game voicelines. Using the translated phoneme sequence as groundtruth ensures the picked snippets resemble the *intended* text much more accurately. Based on the corrected timings these snippets are then combined into an animation sequence (step 5).
A more detailed “big picture” diagram can be found at the beginning of the “HOWTO.tuning-options.txt” in the docs.lipsync folder.
Phoneme Extractor GUI
After step 1 the strings csv with the text strings for the audio voicelines (and only for those!) should be in the speech folder of the project (make sure the string ids are within the projects string-idspace and do NOT overlap with any other strings of the project!). Additionally all of the audio files should be in the ‘speech/speech.en.wav’ folder.
The Phoneme GUI can be started with the ‘
speech/_extract-phonemes-from-audio.bat’ batch file. It automatically scans the
speech.en.wav folder and lists the found wav files in the ‘audio selection’ queue at the bottom. It will also automatically start extracting phonemes (step 2) for wav files named with a string id from the strings csv as prefix.
But the string ids can also be assigned to audio files interactively (string id assigment example video). Once the wav files are processed they are also renamed and the duration is added to the prefix of the filename. This information is required for the subsequent packaging into the w3speech file. As soon as the processing of one audio file finishes its phoneme timings can be adjusted while remaining unfinished files from the queue are processed in the background. Restarting the GUI will pick up where it left when it was stopped last time.
Selecting a processed audio from the queue displays its audio waveform (top) with some phoneme blocks positioned and scaled according to the extracted timing information (directly below the waveform):
The waveform can be zoomed (mouse wheel) and dragged (while middle button is pressed) with the phoneme blocks scaling and moving accordingly. The selected audio can be played by pressing the space bar. It’s also possible to playback only a specific part by setting a startpoint (left mouse button) and/or an endpoint (right mouse button) within the waveform. Setting an endpoint before the starting point will remove the endpoint.
A table with exact timings for every phoneme is also shown (left ‘phoneme segment positions’ panel) and allows to activate/deactive specific phoneme segments or to adjust the start, end or intensity (aka weight) by dragging the appropriate sliders. However most of the time it’s easier to adjust the timings by directly dragging (left mouse button) the left or right phoneme block boundary below the waveform. Depending on the ‘phoneme block drag mode’ (changeable in a panel to the right) the neighboring block boundaries will be adjusted slightly different (proportionally or not at all in ‘unconstrained’ mode). You’ll have to experiment to get a feeling how it works.
Additional panels (on the right) show useful but read-only information: the assigned input text for the audio, its phoneme translation and a table with all the initial timing and matching information at the time the audio was selected. It also highlights phonemes with a low confidence score from the automatic extraction in yellow.
Phoneme Timing Tuning
As already mentioned the automatic phonemes extraction is prone to errors (see below for explanation of some warnings and errors you might spot).
A common problem are mismatches between text-translated-phonemes and audio-extracted-phonemes. These mismatches are indicated by a placeholder phoneme segment labeled ‘
_’. No valid lipsync animation for such a block exist so all of these blocks will be set inactive by default. But this also (may) create gaps in the phoneme sequence at positions with spoken audio (easily seen in the block sequence below the audio waveform) and needs to be corrected.
There is a “gap close” button which automatically tries to close all gaps ripped by ‘
_’ by simply extending the neighboring segments. However this may fail in some cases and also close gaps where a gap *should* be according to the audio. In addition the newly extended timings of the neighboring blocks might now be off as well.
It’s *always* a good idea to verify and manually readjust the timings before saving.
In theory phoneme block boundaries can be moved to overlap neighboring blocks and sometimes this may be even useful to squeeze in or extend some phoneme blocks that otherwise would be very/too short. You can experiment to see the results in-game and decide for yourself.
As a rule of thumb it is better to include all text-translated-phonemes even if they cannot be heard or are only short or overlapped segments because they still define the lip movement and get smoothed anyway - so it may look better even with short phoneme segments.
Here is an example video showing the phoneme extraction and tuning workflow two audio files. Notice the usage of the auto-gap feature and the manual adjustments of falsely merged blocks.
Warnings and Errors In Steps 2 and 5
The automatic extraction of phoneme timings from an audio file and the subsequent matching with the phoneme sequence from the text in step 2 may result in a mismatch or a poor match indicated by a low “score”. This situation will be logged in the console window:
This is merely a strong suggestion to manually adjust this particular phoneme block, especially if it was replaced by a ‘
_’ segment - which should be checked anyway.
After adjusting all phonemes it’s also possible that in step 5 the following error(s) or warning(s) are produced by the lipsync generating:
The first warning means that no lipsync animation snippet was found for a specific phoneme id, in the above example for ‘
_’ which is by definition an invalid placeholder phoneme segment. The last two warnings indicate that an exact match for a required lipsync animation snippet including the previous and following phonemes (which form its context) was not found and some fallback snippet was used (with a similar but different context). This is not necessary a bad thing - but it’s not optimal either.
The error at the end means that even fallback snippets could not be found for some phonemes (in this case it is related to the first warning). This should not happen very often. But if it does, you have multiple options as a workaround:
- deactivate that specific phoneme in the phoneme extractor and just extend the neighboring phonemes. The quality of the lipsync animation result will depend on the lipsync animation of the surrounding phonemes
- create an alias for the missing phoneme in the repository ‘
<encoder dir>/repo.lipsync/phoneme.alias.repo.yml’ file and set it to use some more or less similar phoneme
- ignore it and live with the fact that no animation will be generated for the unknown phoneme(s) (probably a visible gap in the anim)
But most of the time it will be just a forgotten active ‘
_’-block which should be fixable easily.