Veo 3.1 prompts guide: how to create cinematic videos with audio, references, and more control

When I first started looking at Veo 3.1 seriously, what stood out to me was not only the image quality. It was the fact that Google was clearly pushing it beyond silent clip generation.

Veo 3.1 feels different because it treats video more like an audiovisual scene. You are not only describing what the camera should see. You are often describing what the scene should sound like, how it should move, and how one shot should connect to the next.

That is why I would not describe Veo 3.1 as just another text-to-video model. I think of it more like a soundstage director. It is strongest when you give it a full scene to stage: subject, movement, environment, camera language, mood, and sometimes even dialogue or background sound.

In this guide I want to focus on four practical questions:

  • which Veo 3.1 version to use and why
  • which controls actually matter in a real workflow
  • how to prompt Veo 3.1 without fighting the model
  • how to use Veo inside a more repeatable pipeline in Phygital+

Veo 3.1 versions compared: Veo 3.1 vs Fast vs Lite

The Veo 3.1 family makes more sense once you stop asking for one universal “best” model.

As of April 2026, Google positions the current lineup like this on Vertex AI:

  • Veo 3.1 for the highest visual fidelity and final production quality
  • Veo 3.1 Fast for faster generation with high quality
  • Veo 3.1 Lite for lower-cost, high-volume iteration

That distinction matters because the models are not only priced differently. They also fit different stages of the workflow.

Here is the simplest practical way I would frame them:

VersionBest useWhy I would use it
Veo 3.1premium final shotsbest when the clip itself needs to look like the strongest possible output
Veo 3.1 Faststandard production workflowsgood when I want speed without dropping too far in quality
Veo 3.1 Litehigh-volume testing and scalingbest for rapid branching, cheaper prompt iteration, and production systems that need volume

Google’s model docs also separate the current GA models from older preview IDs. The important practical point is this: if you are building now, use the current Veo 3.1, Veo 3.1 Fast, and Veo 3.1 Lite generation models, not the older preview endpoints that were phased out.

My working rule

I would use the lineup like this:

  • start in Lite when I need many prompt branches cheaply
  • move to Fast when the idea is clearer and I want stronger iterations
  • move to Veo 3.1 when the shot is close to final and visual fidelity matters more than speed

That is a much better way to use Veo than throwing your highest-cost model at every first draft.

Veo 3.1 pricing in Phygital: what matters in practice

For the blog version, I would give pricing in Phygital credits, not in Vertex AI dollars.

Based on the current Phygital UI screenshots checked on April 16, 2026, the visible cost for the same setup is:

  • Veo 3.1 = 1460 credits
  • Veo 3.1 Fast = 560 credits

These screenshots show the same practical settings:

  • 16:9
  • 720p
  • 4-second duration
  • Generate audio turned off

That is the most useful comparison for a reader because it reflects the real choice inside the Phygital workspace, not an abstract API billing table.

The practical conclusion is simple:

  • Veo 3.1 Fast is the cheaper branching layer
  • Veo 3.1 is the premium finishing layer

For this exact setup, Veo 3.1 costs a little more than 2.5x the credits of Veo 3.1 Fast, so I would not use it for early exploration unless I already knew the shot direction was strong.

I would keep Lite in the version comparison as the lower-cost family tier from Google’s official model lineup, but I would avoid quoting a Phygital credit number for it in the article until we have a verified in-product screenshot for that mode too.

Phygital Veo 3.1 credit pricing screenshot
Veo 3.1 at 1460 credits for 16:9, 720p, 4 seconds, audio off.
Phygital Veo 3.1 Fast credit pricing screenshot
Veo 3.1 Fast at 560 credits for the same setup.

What Veo 3.1 officially supports

This is the part that matters most before you start promising things to yourself that the model or the surface may not fully support.

According to Google’s official Vertex AI docs, Veo 3.1 supports:

  • text-to-video
  • image-to-video
  • prompt rewriting
  • reference asset images
  • extend video
  • first-and-last-frame generation
  • 9:16 and 16:9 output
  • 4, 6, or 8 second clips
  • up to 4 outputs per prompt
  • 24 FPS

There are also a few practical constraints worth remembering:

  • reference image-to-video only supports 8 seconds
  • image-to-video input images can be up to 20 MB
  • 4k support is still marked as preview in Vertex AI docs
  • support can differ depending on whether you are using Vertex AI, Gemini API, or Flow

That last point matters a lot. Veo is not one single interface. The model family is shared, but some controls arrive in one surface earlier than another or are exposed in slightly different ways.

How to use Veo 3.1 features without overcomplicating the workflow

The easiest mistake with Veo is trying to use every control at once.

The better approach is to ask: what problem am I solving in this shot?

Text-to-video

This is the cleanest starting point when the idea is still fluid.

Use it when:

  • you are exploring concepts
  • you want to test mood, action, or camera language
  • you do not yet have a strong visual anchor

Why it matters:

This is the fastest way to learn how the model interprets your scene logic before you add more constraints.

Image-to-video

This is where Veo becomes much more reliable for commercial work.

Use it when:

  • the opening frame matters
  • the composition already exists
  • brand, product, or character continuity matters

Why it matters:

Once the first frame is anchored, the generation stops feeling like a total re-invention every time.

Reference asset images / ingredients

Google’s official docs and Flow help pages make this one of the most important practical features.

Use it when:

  • the same character must stay recognizable across multiple clips
  • one object or product needs to remain stable
  • you want several shots to share the same visual identity

Why it matters:

Prompt wording alone is often not enough for consistency. Reference assets give the model something concrete to preserve.

First and last frame

This is one of the best storytelling features in the whole Veo stack.

Use it when:

  • you want a controlled transition between two images
  • you want to animate between two planned points of view
  • you need a specific beginning and ending composition

Why it matters:

This is much more controllable than hoping a freeform prompt lands on the transition you imagined.

Extend

Extend is the feature that turns short clip generation into sequencing.

Use it when:

  • one shot is close, but too short
  • you want to continue a motion beat
  • you want to build longer audiovisual continuity from existing clips

Why it matters:

It helps Veo become part of a scene-building workflow instead of a one-shot toy.

Native audio

This is the feature that changes how you prompt Veo.

Use it when:

  • dialogue matters
  • sound effects matter
  • ambient atmosphere is part of the scene itself

Why it matters:

You are not only directing visuals anymore. You are staging an audiovisual moment.

Veo 3.1 technical notes that are actually useful

Some technical details are worth surfacing because they change how I would work.

According to the official Vertex AI model docs:

  • Veo 3.1 and Veo 3.1 Fast support text and image input
  • Veo 3.1 Lite is currently preview and also documented for text and image workflows in the model family docs, but has fewer supported control features than the higher tiers
  • output video format is MP4
  • framerate is 24 FPS
  • output count can go up to 4 videos per prompt

According to Flow help:

  • some features are model-specific inside Flow
  • Ingredients to Video is supported in Veo 3.1 Fast, but not in Veo 3.1 Quality
  • Extend is landscape-only in Flow
  • unsupported feature combinations can trigger a “switching you to a compatible model” message

This is exactly why Veo should be treated as a family of workflows, not a single checkbox list.

Prompting theory for Veo 3.1: think like a soundstage director

Kling often rewards camera-direction thinking. Hailuo often rewards performance-first thinking. Runway often rewards edit-intent thinking.

Veo 3.1 rewards full-scene staging.

That means a good Veo prompt usually answers:

  • what the camera is looking at
  • what the subject is doing
  • what the environment contributes
  • what the scene should feel like visually
  • what the scene should sound like

Google’s official prompting guide suggests a five-part formula:

[Cinematography] + [Subject] + [Action] + [Context] + [Style & Ambiance]

That is already useful, but for real work I would extend it into a practical six-part prompt structure:

SubjectActionSceneCameraLightingAudio
who or what is on screenwhat happens physicallywhere it happenshow the shot is seenwhat shapes the mood visuallydialogue, ambient sound, or SFX

This extra Audio column matters because Veo is one of the few mainstream video models where sound can no longer be treated like an afterthought.

The most useful prompting habits

I would use these rules:

  • start with the camera language, not just the subject
  • describe sound deliberately instead of leaving it vague
  • avoid contradicting your image references with your text prompt
  • use references when consistency matters more than improvisation
  • use first-and-last-frame when you need a specific transformation

A simple Veo prompt formula I would actually use

[Shot type / movement] + [main subject] + [precise action] + [environment] + [lighting / mood] + [sound design or dialogue]

Example:

Medium tracking shot, a woman in a silver raincoat moving through a neon market alley, glancing back over her shoulder as paper lanterns sway above her. Wet pavement, crowded night market, reflective surfaces, cool magenta-blue lighting, atmospheric cyberpunk realism. Ambient sound: distant chatter, rain, scooter hum.

That is much more useful than simply writing “a cool cyberpunk woman walking in the rain.”

21 Veo 3.1 prompt examples

These prompts are original examples written in the logic of Google’s official Veo prompting guidance, but not copied from official examples.

The two Veo generations below were made from prompts in this guide, so the reader can see what the model looks like in practice inside the article itself.

Cinematic example

Generated from the boxer prompt: low-angle morning run, pale blue haze, gritty realism.

Product ad example

Generated from the skincare serum prompt: bright editorial tabletop motion with a premium beauty-ad feel.

Cinematic

Cinematic prompt 1
Wide establishing shot, a solitary lighthouse keeper opening the door to a cliffside lantern room during a violent storm. Ocean spray lashes the windows, cold dawn light breaks through dense clouds, weathered realism. Ambient sound: crashing waves, metal rattling, wind howling.
Cinematic prompt 2
Slow dolly in, an elderly violin maker adjusting a single string by candlelight in a cramped wooden workshop. Dust floats in the air, warm amber light, intimate period-drama mood. Ambient sound: wood creaking softly, distant thunder, a delicate violin note.
Cinematic prompt 3
Low-angle tracking shot, a young boxer jogging alone under an elevated train before sunrise. Empty city blocks, pale blue morning haze, gritty cinematic realism. Ambient sound: rhythmic footsteps, train rumble overhead, faint breath in cold air.

Product ad

Product ad prompt 4
Macro lens close-up, a matte black smartwatch rotating slowly on a wet basalt stone while droplets slide across the glass. Minimal studio environment, high-contrast rim lighting, premium luxury ad style. SFX: soft electronic pulse, water droplets tapping stone.
Product ad prompt 5
Smooth tabletop dolly shot, a bottle of citrus skincare serum placed beside sliced yuzu and dewy leaves as morning light passes through a bathroom window. Clean editorial beauty setup, bright natural lighting, fresh premium tone. Ambient sound: quiet room tone, water trickle, soft glass contact.
Product ad prompt 6
Wide-to-medium reveal, a futuristic electric bike standing in a concrete parking structure as bands of sunlight move across the frame. Architectural shadows, polished commercial look, subtle metallic highlights. Ambient sound: light electrical hum, distant city traffic, soft tire roll.

Social media

Social media prompt 7
Vertical handheld shot, a creator opens a tiny studio apartment makeover reveal by pulling back a curtain to show a dramatically transformed space. Bright afternoon light, energetic lifestyle mood, clean modern decor. Ambient sound: excited laugh, fabric swish, apartment room tone.
Social media prompt 8
Vertical medium shot, a barista lifts a finished iced matcha latte toward the camera while sunlight flickers through a café window. Casual creator aesthetic, warm spring color palette, authentic social vibe. Ambient sound: ice clinking, espresso machine hiss, low café chatter.
Social media prompt 9
Vertical tracking shot, a streetwear designer walks through racks of garments and pauses to touch a bold embroidered jacket. Loft studio environment, soft directional light, documentary-fashion tone. Ambient sound: hangers sliding, muted footsteps, distant sewing machine.

Music video

Music video prompt 10
Slow circular dolly shot, a singer in a mirrored silver coat stands in the center of an empty swimming pool at night as shallow water reflects moving lights. Dream-pop atmosphere, blue and violet haze, cinematic concert styling. Audio: the singer softly repeats "stay with the light" over distant synth pads.
Music video prompt 11
Wide crane shot, a drummer performs alone on a rooftop in heavy fog while red aviation lights blink in the distance. Industrial skyline, moody monochrome palette, dramatic backlight. Ambient sound: drum hits echoing in open air, low city wind.
Music video prompt 12
POV shot moving through a crowded underground club toward a dancer in a gold mask beneath strobing amber light. Sweaty, chaotic, high-energy electronic music video mood. Ambient sound: muffled bass, crowd noise, sharp hi-hat accents.

Lifestyle

Lifestyle prompt 13
Medium handheld shot, a father teaches his daughter to ride a bicycle down a tree-lined neighborhood street at golden hour. Soft summer light, suburban realism, warm family-ad tone. Ambient sound: bicycle wheels on pavement, birds, distant laughter.
Lifestyle prompt 14
Close-up with shallow depth of field, a ceramic artist brushes glaze onto a handmade bowl near an open studio window. Calm creative workspace, muted earthy palette, tactile lifestyle storytelling. Ambient sound: brush on clay, soft breeze, distant urban birds.
Lifestyle prompt 15
Tracking shot, a traveler steps off an early train into a quiet mountain town wrapped in mist and carries a worn leather bag down the platform. Cool morning light, reflective travel-film mood. Ambient sound: train brakes fading, suitcase wheels, distant station announcements.

Fantasy

Fantasy prompt 16
Wide fantasy shot, a young queen walks through a flooded stone hall while glowing fish move beneath the water around her boots. Moonlit ruin, silver-blue palette, regal dark fantasy mood. Ambient sound: water ripples, distant choir, soft echoing footsteps.
Fantasy prompt 17
Low-angle push in, a masked forest guardian raises a lantern as thousands of spores illuminate the trees around them. Ancient woodland at dusk, enchanted realism, rich teal and gold light. Ambient sound: insects, leaves stirring, faint magical chime.
Fantasy prompt 18
High-angle reveal, a floating market suspended among clouds drifts past enormous tethered balloons while merchants cross rope bridges. Soft sunrise light, painterly fantasy worldbuilding. Ambient sound: ropes creaking, wind, faraway voices and bells.

Advanced control / transitions

Advanced control / transitions prompt 19
Create a smooth transition between the provided start frame of a deserted museum hallway and the provided end frame of a hidden gallery chamber. The camera glides forward through the corridor, passes a marble statue, and arrives in the final room as warm hidden lights gradually come alive. Ambient sound: echoing footsteps, low room tone, subtle electric hum.
Advanced control / transitions prompt 20
Using the provided character and apartment reference images, create a medium shot of the same woman sitting at a kitchen table at 2 a.m., staring at an unfinished letter before slowly folding it shut. Cool refrigerator light, quiet emotional realism. Ambient sound: refrigerator buzz, distant city traffic, paper folding.
Advanced control / transitions prompt 21
[00:00-00:02] Medium shot, a marine biologist in orange rain gear kneels on a stormy shoreline examining a glowing shell. Ambient sound: waves and wind. [00:02-00:04] Close-up, she lifts the shell and it pulses with blue light. SFX: low harmonic resonance. [00:04-00:06] Reverse shot, reflected light washes across her face as she looks toward the sea. [00:06-00:08] Wide crane shot, the shoreline begins to shimmer with hundreds of glowing shells beneath the night rain.

How to build a Veo 3.1 workflow in Phygital+

Veo becomes much more useful when it is treated as one layer in a bigger workflow rather than as the entire workflow.

In practice, I would use Phygital+ like this:

  1. Start with a concept image, storyboard frame, or clean prompt concept.
  2. Branch into Veo 3.1 Lite or Fast for several motion directions.
  3. Keep only the branch where subject, motion, and audio logic actually work together.
  4. Add references if the same character or product must survive into the next shot.
  5. Use first-and-last-frame or extend when the scene needs continuity instead of another disconnected clip.
  6. Move the most promising branch into Veo 3.1 for the strongest final output.

That matters because the creative problem is usually not “generate one random clip.”

Usually the problem is closer to:

  • keep this character stable
  • test three motion directions
  • preserve the same mood across several clips
  • connect one shot to the next
  • avoid restarting from zero every time

That is exactly where a node-based workspace helps. Instead of bouncing between tabs, lost prompts, and disconnected reference files, you can keep generation, branching, comparison, and refinement inside one visual pipeline.

[Screenshot placeholder: Phygital+ workflow with Veo branches, reference images, and final selected output]

This is the part people often underestimate. Veo is powerful, but the real creative advantage comes from how you organize the iterations around it.

A practical Veo 3.1 workflow for consistency

If I needed one character to stay recognizable across several clips, I would not rely on text alone.

I would do this instead:

  1. Generate or prepare a strong anchor image of the character.
  2. Create 2-3 Veo branches from that same starting point.
  3. Compare which branch preserves identity, costume, and mood best.
  4. Save the winning references and reuse them for the next shot.
  5. Use first-and-last-frame only when the transition itself matters.

That kind of workflow is much more stable than rewriting the entire character from scratch in every prompt.

It is also where Phygital+ fits naturally. The value is not “Veo inside a box.” The value is seeing the whole decision tree clearly.

FAQ

Why is first-and-last-frame not working the way I expect?

In Google’s official model docs, some frame-controlled workflows have narrower constraints than normal text-to-video. For example, reference image-to-video is documented as 8 seconds only. In API workflows, feature support can also depend on the surface, model ID, and current SDK version.

Why did Flow switch me to a compatible model?

Because Flow does not expose every feature in every Veo mode. Google’s Flow help docs explicitly say that unsupported feature combinations can trigger a switch to a compatible model.

Why are my ingredients not keeping the character consistent?

Google’s Flow guidance warns against conflicting instructions between prompt and visual references. It also recommends clean subject or product references on plain or segmented backgrounds. If the references are noisy or your text contradicts them, consistency usually drops.

Why is audio behaving strangely in some generations?

Because audio generation is still an actively improving part of the Veo experience. Google’s Flow help pages specifically note known issues, including muted speech when minors appear and on-screen subtitles triggering incorrectly in some speech generations.

Should I use Veo 3.1 Quality or Fast in Flow?

Use Quality when the final clip matters more than flexibility. Use Fast when you need more iteration speed or a feature combination that is only available there. Flow’s own feature matrix currently shows that some controls, like Ingredients to Video, are not exposed identically across the two.

When should I use Lite instead of Fast?

Use Lite when the goal is broad iteration, cheaper testing, or high-volume generation. Use Fast when you still care about speed, but want a stronger working-quality layer before moving to the flagship model.

Is Veo better for free prompting or controlled workflows?

It can do both, but I think Veo becomes much more valuable in controlled workflows. The moment you need continuity, references, staged audio, or transitions between shots, the model benefits from a pipeline mindset.

Explore more