Veo 3.1 Prompts Guide: Audio, References, and Cinematic Control

Q: Should I use Veo 3.1 Quality or Fast in Flow?

Use `Quality` when the final clip matters more than flexibility. Use `Fast` when you need more iteration speed or a feature combination that is only available there. Flow’s own feature matrix currently shows that some controls, like `Ingredients to Video`, are not exposed identically across the two.

Q: When should I use Lite instead of Fast?

Use `Lite` when the goal is broad iteration, cheaper testing, or high-volume generation. Use `Fast` when you still care about speed, but want a stronger working-quality layer before moving to the flagship model.

When I first started looking at Veo 3.1 seriously, what stood out to me was not only the image quality. It was the fact that Google was clearly pushing it beyond silent clip generation.

Veo 3.1 feels different because it treats video more like an audiovisual scene. You are not only describing what the camera should see. You are often describing what the scene should sound like, how it should move, and how one shot should connect to the next.

That is why I would not describe Veo 3.1 as just another text-to-video model. I think of it more like a soundstage director. It is strongest when you give it a full scene to stage: subject, movement, environment, camera language, mood, and sometimes even dialogue or background sound.

In this guide I want to focus on four practical questions:

which Veo 3.1 version to use and why
which controls actually matter in a real workflow
how to prompt Veo 3.1 without fighting the model
how to use Veo inside a more repeatable pipeline in Phygital+

Veo 3.1 versions compared: Veo 3.1 vs Fast vs Lite

The Veo 3.1 family makes more sense once you stop asking for one universal “best” model.

As of April 2026, Google positions the current lineup like this on Vertex AI:

Veo 3.1 for the highest visual fidelity and final production quality
Veo 3.1 Fast for faster generation with high quality
Veo 3.1 Lite for lower-cost, high-volume iteration

That distinction matters because the models are not only priced differently. They also fit different stages of the workflow.

Here is the simplest practical way I would frame them:

Version	Best use	Why I would use it
`Veo 3.1`	premium final shots	best when the clip itself needs to look like the strongest possible output
`Veo 3.1 Fast`	standard production workflows	good when I want speed without dropping too far in quality
`Veo 3.1 Lite`	high-volume testing and scaling	best for rapid branching, cheaper prompt iteration, and production systems that need volume

Google’s model docs also separate the current GA models from older preview IDs. The important practical point is this: if you are building now, use the current Veo 3.1, Veo 3.1 Fast, and Veo 3.1 Lite generation models, not the older preview endpoints that were phased out.

My working rule

I would use the lineup like this:

start in Lite when I need many prompt branches cheaply
move to Fast when the idea is clearer and I want stronger iterations
move to Veo 3.1 when the shot is close to final and visual fidelity matters more than speed

That is a much better way to use Veo than throwing your highest-cost model at every first draft.

Veo 3.1 pricing in Phygital: what matters in practice

For the blog version, I would give pricing in Phygital credits, not in Vertex AI dollars.

Based on the current Phygital UI screenshots checked on April 16, 2026, the visible cost for the same setup is:

Veo 3.1 = 1460 credits
Veo 3.1 Fast = 560 credits

These screenshots show the same practical settings:

16:9
720p
4-second duration
Generate audio turned off

That is the most useful comparison for a reader because it reflects the real choice inside the Phygital workspace, not an abstract API billing table.

The practical conclusion is simple:

Veo 3.1 Fast is the cheaper branching layer
Veo 3.1 is the premium finishing layer

For this exact setup, Veo 3.1 costs a little more than 2.5x the credits of Veo 3.1 Fast, so I would not use it for early exploration unless I already knew the shot direction was strong.

I would keep Lite in the version comparison as the lower-cost family tier from Google’s official model lineup, but I would avoid quoting a Phygital credit number for it in the article until we have a verified in-product screenshot for that mode too.

Phygital Veo 3.1 credit pricing screenshot — **Veo 3.1** at 1460 credits for 16:9, 720p, 4 seconds, audio off.

Phygital Veo 3.1 Fast credit pricing screenshot — **Veo 3.1 Fast** at 560 credits for the same setup.

What Veo 3.1 officially supports

This is the part that matters most before you start promising things to yourself that the model or the surface may not fully support.

According to Google’s official Vertex AI docs, Veo 3.1 supports:

text-to-video
image-to-video
prompt rewriting
reference asset images
extend video
first-and-last-frame generation
9:16 and 16:9 output
4, 6, or 8 second clips
up to 4 outputs per prompt
24 FPS

There are also a few practical constraints worth remembering:

reference image-to-video only supports 8 seconds
image-to-video input images can be up to 20 MB
4k support is still marked as preview in Vertex AI docs
support can differ depending on whether you are using Vertex AI, Gemini API, or Flow

According to the official Vertex AI model docs:

Veo 3.1 and Veo 3.1 Fast support text and image input
Veo 3.1 Lite is currently preview and also documented for text and image workflows in the model family docs, but has fewer supported control features than the higher tiers
output video format is MP4
framerate is 24 FPS
output count can go up to 4 videos per prompt

According to Flow help:

some features are model-specific inside Flow
Ingredients to Video is supported in Veo 3.1 Fast, but not in Veo 3.1 Quality
Extend is landscape-only in Flow
unsupported feature combinations can trigger a “switching you to a compatible model” message

This is exactly why Veo should be treated as a family of workflows, not a single checkbox list.

Prompting theory for Veo 3.1: think like a soundstage director

Kling often rewards camera-direction thinking. Hailuo often rewards performance-first thinking. Runway often rewards edit-intent thinking.

Veo 3.1 rewards full-scene staging.

That means a good Veo prompt usually answers:

what the camera is looking at
what the subject is doing
what the environment contributes
what the scene should feel like visually
what the scene should sound like

Google’s official prompting guide suggests a five-part formula:

[Cinematography] + [Subject] + [Action] + [Context] + [Style & Ambiance]

That is already useful, but for real work I would extend it into a practical six-part prompt structure:

Subject	Action	Scene	Camera	Lighting	Audio
who or what is on screen	what happens physically	where it happens	how the shot is seen	what shapes the mood visually	dialogue, ambient sound, or SFX

This extra Audio column matters because Veo is one of the few mainstream video models where sound can no longer be treated like an afterthought.

The most useful prompting habits

I would use these rules:

start with the camera language, not just the subject
describe sound deliberately instead of leaving it vague
avoid contradicting your image references with your text prompt
use references when consistency matters more than improvisation
use first-and-last-frame when you need a specific transformation

A simple Veo prompt formula I would actually use

[Shot type / movement] + [main subject] + [precise action] + [environment] + [lighting / mood] + [sound design or dialogue]

Example:

Medium tracking shot, a woman in a silver raincoat moving through a neon market alley, glancing back over her shoulder as paper lanterns sway above her. Wet pavement, crowded night market, reflective surfaces, cool magenta-blue lighting, atmospheric cyberpunk realism. Ambient sound: distant chatter, rain, scooter hum.

That is much more useful than simply writing “a cool cyberpunk woman walking in the rain.”

21 Veo 3.1 prompt examples

These prompts are original examples written in the logic of Google’s official Veo prompting guidance, but not copied from official examples.

The two Veo generations below were made from prompts in this guide, so the reader can see what the model looks like in practice inside the article itself.

Cinematic example

Generated from the boxer prompt: low-angle morning run, pale blue haze, gritty realism.

Product ad example

Generated from the skincare serum prompt: bright editorial tabletop motion with a premium beauty-ad feel.

Cinematic

Cinematic prompt 1

Wide establishing shot, a solitary lighthouse keeper opening the door to a cliffside lantern room during a violent storm. Ocean spray lashes the windows, cold dawn light breaks through dense clouds, weathered realism. Ambient sound: crashing waves, metal rattling, wind howling.

Cinematic prompt 2

Slow dolly in, an elderly violin maker adjusting a single string by candlelight in a cramped wooden workshop. Dust floats in the air, warm amber light, intimate period-drama mood. Ambient sound: wood creaking softly, distant thunder, a delicate violin note.

Cinematic prompt 3

Low-angle tracking shot, a young boxer jogging alone under an elevated train before sunrise. Empty city blocks, pale blue morning haze, gritty cinematic realism. Ambient sound: rhythmic footsteps, train rumble overhead, faint breath in cold air.

Product ad

Product ad prompt 4

Macro lens close-up, a matte black smartwatch rotating slowly on a wet basalt stone while droplets slide across the glass. Minimal studio environment, high-contrast rim lighting, premium luxury ad style. SFX: soft electronic pulse, water droplets tapping stone.

Product ad prompt 5

Smooth tabletop dolly shot, a bottle of citrus skincare serum placed beside sliced yuzu and dewy leaves as morning light passes through a bathroom window. Clean editorial beauty setup, bright natural lighting, fresh premium tone. Ambient sound: quiet room tone, water trickle, soft glass contact.

Product ad prompt 6

Wide-to-medium reveal, a futuristic electric bike standing in a concrete parking structure as bands of sunlight move across the frame. Architectural shadows, polished commercial look, subtle metallic highlights. Ambient sound: light electrical hum, distant city traffic, soft tire roll.

Social media

Social media prompt 7

Vertical handheld shot, a creator opens a tiny studio apartment makeover reveal by pulling back a curtain to show a dramatically transformed space. Bright afternoon light, energetic lifestyle mood, clean modern decor. Ambient sound: excited laugh, fabric swish, apartment room tone.

Social media prompt 8

Vertical medium shot, a barista lifts a finished iced matcha latte toward the camera while sunlight flickers through a café window. Casual creator aesthetic, warm spring color palette, authentic social vibe. Ambient sound: ice clinking, espresso machine hiss, low café chatter.

Social media prompt 9

Vertical tracking shot, a streetwear designer walks through racks of garments and pauses to touch a bold embroidered jacket. Loft studio environment, soft directional light, documentary-fashion tone. Ambient sound: hangers sliding, muted footsteps, distant sewing machine.

Music video

Music video prompt 10

Slow circular dolly shot, a singer in a mirrored silver coat stands in the center of an empty swimming pool at night as shallow water reflects moving lights. Dream-pop atmosphere, blue and violet haze, cinematic concert styling. Audio: the singer softly repeats "stay with the light" over distant synth pads.

Music video prompt 11

Wide crane shot, a drummer performs alone on a rooftop in heavy fog while red aviation lights blink in the distance. Industrial skyline, moody monochrome palette, dramatic backlight. Ambient sound: drum hits echoing in open air, low city wind.

Music video prompt 12

POV shot moving through a crowded underground club toward a dancer in a gold mask beneath strobing amber light. Sweaty, chaotic, high-energy electronic music video mood. Ambient sound: muffled bass, crowd noise, sharp hi-hat accents.

Lifestyle

Lifestyle prompt 13

Medium handheld shot, a father teaches his daughter to ride a bicycle down a tree-lined neighborhood street at golden hour. Soft summer light, suburban realism, warm family-ad tone. Ambient sound: bicycle wheels on pavement, birds, distant laughter.

Lifestyle prompt 14

Close-up with shallow depth of field, a ceramic artist brushes glaze onto a handmade bowl near an open studio window. Calm creative workspace, muted earthy palette, tactile lifestyle storytelling. Ambient sound: brush on clay, soft breeze, distant urban birds.

Lifestyle prompt 15

Tracking shot, a traveler steps off an early train into a quiet mountain town wrapped in mist and carries a worn leather bag down the platform. Cool morning light, reflective travel-film mood. Ambient sound: train brakes fading, suitcase wheels, distant station announcements.

Fantasy

Fantasy prompt 16

Wide fantasy shot, a young queen walks through a flooded stone hall while glowing fish move beneath the water around her boots. Moonlit ruin, silver-blue palette, regal dark fantasy mood. Ambient sound: water ripples, distant choir, soft echoing footsteps.

Fantasy prompt 17

Low-angle push in, a masked forest guardian raises a lantern as thousands of spores illuminate the trees around them. Ancient woodland at dusk, enchanted realism, rich teal and gold light. Ambient sound: insects, leaves stirring, faint magical chime.

Fantasy prompt 18

High-angle reveal, a floating market suspended among clouds drifts past enormous tethered balloons while merchants cross rope bridges. Soft sunrise light, painterly fantasy worldbuilding. Ambient sound: ropes creaking, wind, faraway voices and bells.

Advanced control / transitions

Advanced control / transitions prompt 19

Create a smooth transition between the provided start frame of a deserted museum hallway and the provided end frame of a hidden gallery chamber. The camera glides forward through the corridor, passes a marble statue, and arrives in the final room as warm hidden lights gradually come alive. Ambient sound: echoing footsteps, low room tone, subtle electric hum.

Advanced control / transitions prompt 20

Using the provided character and apartment reference images, create a medium shot of the same woman sitting at a kitchen table at 2 a.m., staring at an unfinished letter before slowly folding it shut. Cool refrigerator light, quiet emotional realism. Ambient sound: refrigerator buzz, distant city traffic, paper folding.

Advanced control / transitions prompt 21

[00:00-00:02] Medium shot, a marine biologist in orange rain gear kneels on a stormy shoreline examining a glowing shell. Ambient sound: waves and wind. [00:02-00:04] Close-up, she lifts the shell and it pulses with blue light. SFX: low harmonic resonance. [00:04-00:06] Reverse shot, reflected light washes across her face as she looks toward the sea. [00:06-00:08] Wide crane shot, the shoreline begins to shimmer with hundreds of glowing shells beneath the night rain.

How to build a Veo 3.1 workflow in Phygital+

Veo becomes much more useful when it is treated as one layer in a bigger workflow rather than as the entire workflow.

In practice, I would use Phygital+ like this:

Start with a concept image, storyboard frame, or clean prompt concept.
Branch into Veo 3.1 Lite or Fast for several motion directions.
Keep only the branch where subject, motion, and audio logic actually work together.
Add references if the same character or product must survive into the next shot.
Use first-and-last-frame or extend when the scene needs continuity instead of another disconnected clip.
Move the most promising branch into Veo 3.1 for the strongest final output.

That matters because the creative problem is usually not “generate one random clip.”

Usually the problem is closer to:

keep this character stable
test three motion directions
preserve the same mood across several clips
connect one shot to the next
avoid restarting from zero every time

That is exactly where a node-based workspace helps. Instead of bouncing between tabs, lost prompts, and disconnected reference files, you can keep generation, branching, comparison, and refinement inside one visual pipeline.

[Screenshot placeholder: Phygital+ workflow with Veo branches, reference images, and final selected output]

This is the part people often underestimate. Veo is powerful, but the real creative advantage comes from how you organize the iterations around it.

Try Phygital+

A practical Veo 3.1 workflow for consistency

If I needed one character to stay recognizable across several clips, I would not rely on text alone.

I would do this instead:

Generate or prepare a strong anchor image of the character.
Create 2-3 Veo branches from that same starting point.
Compare which branch preserves identity, costume, and mood best.
Save the winning references and reuse them for the next shot.
Use first-and-last-frame only when the transition itself matters.

That kind of workflow is much more stable than rewriting the entire character from scratch in every prompt.

It is also where Phygital+ fits naturally. The value is not “Veo inside a box.” The value is seeing the whole decision tree clearly.

FAQ

Why is first-and-last-frame not working the way I expect?

In Google’s official model docs, some frame-controlled workflows have narrower constraints than normal text-to-video. For example, reference image-to-video is documented as 8 seconds only. In API workflows, feature support can also depend on the surface, model ID, and current SDK version.

Why did Flow switch me to a compatible model?

Because Flow does not expose every feature in every Veo mode. Google’s Flow help docs explicitly say that unsupported feature combinations can trigger a switch to a compatible model.

Why are my ingredients not keeping the character consistent?

Google’s Flow guidance warns against conflicting instructions between prompt and visual references. It also recommends clean subject or product references on plain or segmented backgrounds. If the references are noisy or your text contradicts them, consistency usually drops.

Why is audio behaving strangely in some generations?

Because audio generation is still an actively improving part of the Veo experience. Google’s Flow help pages specifically note known issues, including muted speech when minors appear and on-screen subtitles triggering incorrectly in some speech generations.

Should I use Veo 3.1 Quality or Fast in Flow?

Use Quality when the final clip matters more than flexibility. Use Fast when you need more iteration speed or a feature combination that is only available there. Flow’s own feature matrix currently shows that some controls, like Ingredients to Video, are not exposed identically across the two.

When should I use Lite instead of Fast?

Use Lite when the goal is broad iteration, cheaper testing, or high-volume generation. Use Fast when you still care about speed, but want a stronger working-quality layer before moving to the flagship model.

Is Veo better for free prompting or controlled workflows?

It can do both, but I think Veo becomes much more valuable in controlled workflows. The moment you need continuity, references, staged audio, or transitions between shots, the model benefits from a pipeline mindset.

Veo 3.1 prompts guide: how to create cinematic videos with audio, references, and more control

When I first started looking at Veo 3.1 seriously, what stood out to me was not only the image quality. It was the fact that Google was clearly pushing it beyond silent clip generation.

Veo 3.1 versions compared: Veo 3.1 vs Fast vs Lite

My working rule

Veo 3.1 pricing in Phygital: what matters in practice

What Veo 3.1 officially supports

How to use Veo 3.1 features without overcomplicating the workflow

Text-to-video

Image-to-video

Reference asset images / ingredients

First and last frame

Extend

Native audio

Veo 3.1 technical notes that are actually useful

Prompting theory for Veo 3.1: think like a soundstage director

The most useful prompting habits

A simple Veo prompt formula I would actually use

21 Veo 3.1 prompt examples

Cinematic

Product ad

Social media

Music video

Lifestyle

Fantasy

Advanced control / transitions

How to build a Veo 3.1 workflow in Phygital+

A practical Veo 3.1 workflow for consistency

FAQ

Explore more

ChatGPT Image 2.0 guide: how to prompt like a designer after the April 2026 update

Seedream 5.0 image guide: how to create better posters, edits, and structured visuals in Phygital+

Long-Form vs Short-Form Video Content: What Works Best in 2026?

Why AI-generated videos still need post-production

Runway AI prompts guide: how to create cinematic videos and edit shots with more control

Hailuo AI prompts guide: how to create expressive videos and keep one character consistent