Music

Caffeine and daydreams - Reprise

It truly "takes a village" and many helpful members from the Pixorama Discord community helped me create this video locally on my Computer

Sebastian Antony

13 Feb 2026 • 4 min read

This time I re-did my favorite Suno song, Caffeine and Daydreams, but with the animations created with LTX2 locally on my computer.

In a recent post I wrote that I would not attempt to do local video generation as it simply took too long to be viable.

Discovering GGUF models

Discord member syndRome69 (Bill) was the catalyst who told me that LTX2 with GGUF models can run very fast.

He was kind enough to share a couple of workflows with me and he even helped me download the necessary GGUF models, as well as make a change to my YAML file so that the extra model path could be found.

I was getting pretty decent run times, but the text-to-video vocals were not too great. He also shared with me an image-to-video workflow.

I discussed the poor audio tonality with Discord member Hammel (Chris) who has experimented a lot with music videos, and had originally shared with me the LTX model that gave great quality (but was extremely slow on my PC).

He suggested it would be better to use mp3 audio, for example, from Suno, rather than use the LTX-generated audio.

The workflow

Discord member Asd (Jeet) told me he has a workflow that can work with GGUF models and use audio input. He helped me set up this workflow. This was the basis for my LTX-2 music video generations locally. The only change I did was to use GGUF models.

Hammel assisted with optimizing the prompts. He had done several experiments using a Gemini gem. For Example:

Building upon the reference frame, the cinematic scene begins with the girl confidently placing a wide-brimmed cowboy hat onto her head, adjusting the brim with a smile. She then leans forward gracefully to retrieve an acoustic guitar from a stand just out of frame. As she straightens up with the instrument, she immediately falls into the rhythm of the music. The scene is accompanied by a pristine, studio-quality recording of an upbeat country song with clear vocals and acoustic strumming. The audio is high-fidelity with no digital artifacts. She begins to play the guitar with enthusiasm, her fingers strumming in perfect time with the track. She sings along with a warm, joyful expression, her lips perfectly lip-syncing to the lyrics. She sways to the beat, fully embodying the happy country vibe.

This prompt gave very good quality output but I was getting a lot of error messages in my console.

Discord Knight Ivo , who developed and maintains the ComfyUI Easy Installer, help me debug the error messages. I ran some batch files that updated the necessary components and removed some unused items.

The workflow runs smoothly now and on my 4090 RTX with 24 GB VRAM. These are the typical run times.

Prompt executed in 318.33 seconds - Opening 30 sec Sequence

Prompt executed in 114.64 seconds - 11 sec Closing clip

30-second segments

I realized that longer video generations were making my computer extremely sluggish and unresponsive. I decided to break my music into 30-second clips. I used Gemini to teach me how to break up the music mp3 file into 30-second clips via Audacity.

Multiple character angles.

I realized that breaking the video into smaller segments gives me the freedom to actually have some variety of images rather than the singer in a static pose. I used Openart / Nano Banana Pro to give some alternate views of the singer, whom I created as a character named Wynonna. This makes it easy to re-use her for other images.

Upscale to 4K

The video was created at 720p resolution. When I combined in CapCut and added my watermark, when I saved it, I saved it as 1080p resolution. But then I had Topaz video, which I used to upscale to 4K. The Topaz 4K upscale might have taken me about 25-30 minutes! I believe this gives better quality than simply saving as 4K directly out of CapCut. The file became a whopping 1.27 GB in size!

This video is in 4K resolution so watch on a large monitor or TV if possible

Conclusion:

LTX2 GGUF version make it feasible to create a three-minute 720p music video in about 30 minutes of rendered time on an RTX 4090 with 22 GB VRAM. Formerly this quality was only possible only possible rendering via cloud apps, which are very costly in terms of credits (i.e. $ out of your pocket).

Generating locally also has an advantage in that you don't encounter arbitrary censorship rules.