Audio Nodes: Build your entire audio workflow in one Space
Audio has always been the most broken step in the creative process, not because the tools are bad, but because they live somewhere else. The script lives in a doc, the voiceover in a synthesis tool, the music in a licensing platform, the effects in a download folder somewhere. By the time you have everything, you’ve touched five tools and lost your train of thought twice.
The Audio nodes in Spaces are part of the workflow itself. They receive inputs, generate audio, and move it to the next step, all without leaving the canvas. Here’s what you can build right now.
Table of contents
What are Audio Nodes in Spaces?
There are three nodes, Voiceover, Music Generator, and Sound Effects, that bring audio generation directly into the Spaces canvas. They follow the same logic as every other node: they take text as input, generate audio as output, and connect to text nodes, video generators, and the Video Combiner. In a fragmented workflow, every variation costs time: another export, another import, another manual mix. In Spaces, that same change takes seconds: update the prompt, run again, and the workflow does the rest.
Meet the three audio nodes
- Voiceover converts text into natural narration using hundreds of voices powered by ElevenLabs and Google. You control speed, stability, and voice similarity, and get up to 10 takes per run to pick the one that fits best.
- Music Generator creates original compositions from a descriptive prompt using Google Lyria and ElevenLabs Music models. It produces up to 30 seconds of audio and up to 10 variations per run, with no licensing fees and no stock libraries.
- Sound Effects generates any sound effect from a text description, with control over duration, loop mode for continuous ambiances, and Prompt Influence to decide how literally the model interprets your instruction.
What you can do with Audio Nodes
1. Write the prompt directly inside the node
The most straightforward entry point is to open the node, write your description or script, and generate, with no intermediate nodes and no extra steps. The Voiceover receives the script, the Music Generator receives the musical description, and Sound Effects receives the sound description. If you have the idea, you execute it right there.

2. Run multiple prompts and get a full audio set in one go
Instead of connecting a fixed text to the node, you can connect a list of prompts so the node processes all of them in a single run. If you have 8 scenes that each need a different sound effect, you can describe all 8 in a list, and the node generates them in a single run. The same goes for music: 3 scene descriptions produce 3 tracks in parallel, from the same node, in the same execution.

3. Let the Assistant build the prompt for you
Unlike a plain text field, the Assistant node is a language model that understands context. You can give it a reference such as an image, a video, an audio file, or a piece of text, and it analyzes, interprets, and generates the ideal prompt for the audio node downstream, without any manual writing or back-and-forth between tools.
For Voiceover, the Assistant can draft the script from a brief or a product image. For Music Generator, it can describe the mood in precise musical terms based on the visual you pass in. For Sound Effects, it can translate a scene, whether visual or sonic, into a technical sound description. And when you need scale, the Assistant can generate a list of prompts instead of just one, so that list feeds the node in batch and you get multiple audio outputs from a single instruction.

4. Use a reference audio to generate a variation
If you have a reference audio that captures exactly the mood you’re after, or an effect you want to replicate with variations, you can connect it directly to the Assistant and ask it to analyze and describe the style. That description passes to the generator node, where the AI interprets the sound, translates it into language, and produces something new. Distinct from the original, but coherent with it.

5. Combine your audio and clips into one finished video
The Video Combiner node takes one or more video clips and a single audio track and assembles them into a finished video, joining the clips in sequence according to the order you define and placing the audio on top of the result in a single run. It does one thing: take everything you’ve built across the canvas, video clips, music, voiceover, and assemble it into a finished video in a single run.

Best use cases for audio nodes
These nodes fit naturally into a wide range of production scenarios. Here’s where they shine:
- Ad localization at scale. Generate all language versions of a voiceover in one batch run, keeping the same script structure across every output.
- Original music for brand content. Create custom background tracks for product demos, social ads, and campaign videos without licensing fees or stock libraries.
- Scene-specific sound design. Produce tailored effects for each moment in a short film, product video, or tutorial — all from a single List node connected to Sound Effects.
- Voiceover-driven explainers. Turn a product brief or a landing page into a narrated video by passing the text to a Voiceover node and connecting the output directly to the Video Combiner.
- Audio variations for A/B testing. Generate multiple takes or music variations in one run and test which version performs better before committing to a final edit.
- Multilingual training and educational content. Produce the same narration across different languages and accents without re-recording anything.
If you’re combining the audio nodes with the Assistant and a List Node, the scope expands further: one workflow can handle an entire content series with different scripts, different moods, and different formats in a single execution.
How to use Audio Nodes in Spaces
Prompts are creative briefs for the model. The more specific you are, the closer the output gets to what you actually have in mind. Here’s what works for each node.
- For Voiceover with ElevenLabs, the v3 model responds to audio tags and punctuation, not just descriptions. Use tags like [urgent], [whispers], or [dramatic] to shape tone, and [breathing] or [laughing] to make delivery feel human rather than studio-polished.
Ellipses (…) create natural pauses and tension, while a period forces a drop in intonation, useful for informational or serious lines. You can also insert precise timed pauses directly in the script: writing (pause 1.5) generates a real pause of that exact duration in the voiceover. Avoid overusing exclamation marks, as they can push the model toward an overly enthusiastic read.
“[urgent] The window closes in three days. Don’t miss it… [flatly] Terms and conditions apply.”
- For Voiceover with Google, describe the scenario, not just the voice type. Google’s synthesis responds well to narrative context and direct style instructions.
“A racing driver speaking through a low-fidelity intercom, engine noise in the background, slightly breathless. Informative tone, no dramatic inflection.”
- For Music Generator with Google Lyria, be explicit about genre, tempo, instrumentation, and the emotional function of the track in your project.
“Cinematic synthwave, 110 BPM, dark atmosphere, heavy bass and distorted synthesizers, driving rhythm for an action sequence.”
- For Sound Effects, specify whether you need a looping texture or a single event, and always include duration. The model responds very differently depending on whether you’re building a background layer or a precise moment.
“Busy city street ambiance, mid-morning, distant traffic, occasional footsteps on pavement, light wind. Loop-friendly.”
Why Audio Nodes make Spaces more powerful
Most AI tools generate one asset at a time. The audio nodes in Spaces let you build something more: a system where script, voiceover, music, and sound effects all live on the same canvas, connected and ready to run together.
The Assistant node adds another layer: give it multiple references at once, a product image, a reference track, a script brief, and it builds the prompt from all of them together, not from a single input in isolation. The more context you bring into the workflow, the sharper the output.
When a teammate opens the Space, they don’t just see the final video. They see every prompt, every connection, every decision. The workflow is the documentation. Nothing gets lost in the handoff, because there is no handoff.
Start simple: drop a Voiceover node on the canvas and write your script. The rest of the connections will follow naturally.