This pipeline lets you create unlimited high-quality AI avatar videos from just a portrait image and an audio file, completely free and self-hosted.
It runs entirely inside ComfyUI using WAN 2.1 (14B, fp8) for image-to-video generation and InfiniteTalk for realistic lip-sync animation, producing output comparable to paid services like HeyGen and Kling.
Setup is streamlined with an included shell script that automatically downloads all required models (WAN 2.1, InfiniteTalk, MelBandRoformer VAE, UMT5-XXL text encoder, CLIP Vision, and Lightx2v LoRA) into the correct ComfyUI folder structure.
The workflow is packaged as a single JSON file. Drag it into ComfyUI, upload your portrait and audio, set the frame count based on audio duration, and run. Cloud GPU rental via vast.ai makes it accessible without expensive local hardware.
Key Features
Talking-head avatar video generation from a single portrait image and MP3 audio
Lip-sync powered by InfiniteTalk inside ComfyUI
WAN 2.1 (14B) image-to-video model with optional 720p upscaling
Automated model download script for one-command setup
Drag-and-drop ComfyUI workflow JSON for instant use
Free alternative to HeyGen and Kling. Runs on cloud GPUs via vast.ai