Guide, updated May 10, 2026 · 6 min read

Turn a song into a video, from one audio file in

You have a track you keep coming back to. Suno output, a DAW export, a studio cut, a voiceover, a podcast clip. To put it on a feed, the audio is not enough. You need visuals, synced captions, and a structure that survives autoplay. This guide shows how to do all of that from a single audio file.

To turn a song into a video, take the audio file and pair it with visuals (one cover image or a set of scene images), generate synced captions from the lyrics or speech, add an outro, and render the whole thing as a vertical 9:16 video. The cleanest way to do this without juggling tools is a music to video flow that takes the audio in and outputs the finished video, with the captions, scene transitions, and overlays already wired together.

Most creators already have the song. The bottleneck is the video around it. The default DIY recipe is: open a video editor (CapCut, Premiere), find a stock cover or generate images somewhere, drop the audio in, run a separate transcription tool for the lyrics, hand-time the captions, build an end card, export, hope nothing drifts. This guide walks through what each part actually needs (audio, captions, cover or scenes, elements, outro, render) and shows the same job inside one flow that takes an audio file in and gives a finished video out.

Before you start

  • An audio file you own the rights to (MP3, WAV, M4A). The track, narration, or any audio that will drive the timeline.
  • An idea of the visual: a single cover image, or a few images you want as scenes mapped to the song.
  • Optional: a logo or a watermark you want overlaid on the video.

What changes when one tool handles the whole job

StepDIY stackDayvid Music to Video
Audio inDrop into a video editor, line up the timeline by handUpload one file, the timeline is the song
CaptionsRun a separate transcription tool, copy the SRT, hand-time correctionsAuto-transcribed from the audio, edit inline, style in the same step
VisualsStock search or AI image tool in another tab, download, drag inPick a single cover image, or moving images mode for multiple scene images
Scene timingSet keyframes manually, eyeball where the chorus hitsMap images to time ranges in the Scenes step
Logo and overlaysRe-add per project, hope sizes match across videosPick from saved elements, including MOV overlays with alpha
OutroBuild an end card from a template, save and reuse if you rememberPick from your outros or skip
OutputExport, transcode, hope the aspect ratio is rightVertical 9:16 video, ready for a feed

1Start the project and upload the audio

Open the Music to Video flow and name the project. The Setup step lets you pick a preset (a saved style for caption font, color, animation, layout) or save a new one as you go. From there, the Audio step is the only place the song needs to be uploaded. Drop the file in. The audio becomes the spine of the video: every other step times itself off this track. One audio track per project, so if you are layering narration over music, mix them down to a single file before this step.

  • ·Supported audio: standard formats (MP3, WAV, M4A). Use a clean export from your DAW or AI music tool.
  • ·If the song has a long silent intro, trim it before upload. The video will follow whatever the file gives it.

2Auto-transcribe and style the captions

The Subtitles step runs the audio through transcription and produces word-level captions synced to the song. Review the transcript, fix anything the model heard wrong (artist names, made-up words, stylized lyrics), then style: font, size, color, highlight color for the active word, position on screen. The captions are part of the render, not a separate file. They survive feed compression and they keep the video watchable when sound is off, which is most of the autoplay pattern on Shorts and Reels.

  • ·Word-level captions (one or two words highlighted at a time) read better in vertical than full-line captions.
  • ·If the lyrics are repetitive, fix one chorus then copy the corrections through. The transcription is editable plain text.

3Pick a cover, or switch on moving images

The Cover step decides the look behind the audio. Two modes. Single image: one static background that holds the whole song. Good for lyric videos, tracks with strong cover art, or when the song carries the energy by itself. Moving images: multiple images that change with the song, each tied to a time range. This is where a song stops feeling like a still and starts feeling like a music video. Pick the mode based on how busy the audio is. A slow ballad does fine on one image. A four-on-the-floor track wants the visual to move.

  • ·If you are not sure, start with single image. You can switch to moving images on the next render and reuse the rest.
  • ·The cover sits behind everything else: captions and overlays sit on top of it.

4If moving images, upload assets and map scenes

When moving images mode is on, two extra steps appear. The Assets step is where the scene images come from: upload your own (drag a folder in), pick from the library (anything you have used or generated before), or mix both. The Scenes step is where each image gets a time range in the song. Drag the boundaries to match the structure of the track: intro on image 1, verse on image 2, chorus on image 3, bridge on image 4, outro back on the cover. The boundaries are how the visual matches the audio without keyframes.

  • ·A four-minute song does well with 6 to 10 scene images. More than that gets choppy in vertical.
  • ·Reuse images across projects through the Library; assets you uploaded once stay there.

5Add elements and an outro

The Elements step is where overlays go: a channel logo in a corner, a watermark, a decorative piece tied to the brand, a MOV overlay with alpha for animated effects (rain, sparkles, light leaks). These persist across the whole video by default, or limit to a time range if you want them to come and go. The Outro step picks the end card: a saved outro video that runs after the song finishes (subscribe prompt, channel name, social handles), or skip if the audio carries to the end. Both steps draw from saved assets, so videos in the same channel stay visually consistent without rebuilding from scratch each time.

  • ·MOV overlays with alpha keep transparency, so they stack cleanly on top of the cover or scenes.
  • ·If you ship videos on a recurring schedule, save one outro and reuse it across every track to anchor the channel.

6Render and grab the video

The Render step submits the job. The output is a vertical 9:16 video with captions, scenes, elements, and outro baked in. Wait for the render to finish, download the file, or go straight to the Publish step if your brand is connected to a YouTube channel. The publish step is optional: the rendered video sits in your library either way, ready to upload manually to whatever platform you ship on.

  • ·The render is one finished file, not a project file. There is no second editor pass needed.
  • ·If you want to tweak the captions or swap a scene image, edit the project and re-render. The earlier choices stay saved.

Frequently asked questions

Anything you have as an audio file. Original tracks from a DAW, AI music output (Suno, Udio), studio recordings, narration, podcast clips, voice memos. The flow does not care about the source. It cares that the audio is clean and the file is a standard format.

No. The transcription comes from the audio in the Subtitles step. The visuals come from a cover image you provide (or a set of scene images for moving images mode). Nothing in the flow asks for a script. The audio is the script.

Start with single-image mode and a placeholder cover. The video still renders with full captions and structure. You can swap the cover or switch to moving images later and re-render. The captions, elements, and outro stay saved.

Vertical 9:16, sized for short-form feeds and Shorts. The flow renders one video per project at this aspect ratio.

No. You bring the audio. The flow handles everything that wraps around the audio (captions, visuals, scenes, elements, outro, render), but the song itself comes from you, your DAW, your studio session, or an AI music tool you already use.

Word-level transcription is accurate on clean vocals and clear narration. It struggles with heavy reverb, stylized lyrics, made-up words, and rare proper nouns. The transcript is editable inline before render, so anything the model misses is a one-line fix.

Yes. The Setup step saves the project as a preset: caption style, layout, animation, brand colors. The next song you make in the flow starts from that preset, so a series of singles or a weekly drop stays visually consistent without rebuilding.

If your brand is connected to a YouTube channel, the Publish step at the end of the flow uploads the rendered video to your channel as a private draft, with title, description, tags, and thumbnail filled in. You flip it to public from YouTube Studio when you are ready. For other platforms, download the video from your library and upload manually.

Ready to make videos people watch?

Start free, no credit card. Generate your first video in under five minutes.

Related guides

Sources and methodology

Stats, figures, and external references cited in this guide were taken from the linked sources on the dates listed below. Information may be out of date by the time you read this.