So, lately I’ve been doing a lot of AI video work. Tweaking tools, making tools, experimenting with tools. I don’t have an impressive rig, so the first thing I usually have to do is optimize things. I have 16gb of vram, so I’m limited on what I can do. I’m on the higher consumer end of things, but at the bottom of the higher end.
My preferred model right now is Wan Vace Multitalk FusionX 14B. It pretty does everything. Single or multi speaker lip sync, image to video, text to video, video to video. All in one type of solution that has great output with lower resources that most other solutions.
The below video is a single speaker test. I generated the image, recorded myself doing a Spongebob impression, used the recording of myself to create a cloned voice, use the cloned voice with a text to speech tool called OpenAudio to generate the vocal output. Then ran the image with the audio file through the model mentioned above to generate the video with a text prompt to help guide it.
I recorded me and my wife doing character voices for the next video. This time I used Dia to create the vocals so that I could generate the conversation with more emotion than OpenAudio allows. Same video model used as above.
If nothing else, this should give you an idea of how far AI tools have come. I use various tools for text to speech, sound effects, music, background music, songs, and etc.
MMAudio - Run a video through MMAudio with a simple prompt to automatically generate sound effects for the video that match the what’s going on in the video. I do recommend a vocal remover as this occasionally produces garbled vocals if the video has mouth movement.
https://github.com/hkchengrex/MMAudio
Audiocraft - Generate sound effects, and even music (With the musicgen side of audiocraft)
https://github.com/GrandaddyShmax/audiocraft_plus
Dia - Conversational AI TTS system. Generates a conversation between multiple speakers. With emotion.
https://github.com/nari-labs/dia
OpenAudio - Simple voice cloning TTS system.
https://github.com/fishaudio/fish-speech
Kokoro TTS - TTS system with prebuilt voices, with the ability to blend the voices. I made a comfyUI node for this.
https://github.com/GeekyGhost/ComfyUI-Geeky-Kokoro-TTS
https://github.com/GeekyGhost/Geeky-Ghost-Kokoro-TTS-Webui
WanGP - A Video centric UI for AI video models.
https://github.com/deepbeepmeep/Wan2GP
ComfyUI - Most versatile AI UI around, steep learning curve for advanced features, but easy for newer users at the same time. Jack of all trades AI UI.
https://github.com/comfyanonymous/ComfyUI
Geeky Ghost Tools - Used a lot of my own tools as well.
https://github.com/GeekyGhost
Thanks for reading, as a bonus here are some additional test videos below! Some more tests videos using various models and techniques.
Hope you enjoyed this and this sparked your curiosity and creativity in a some way.