Visual FaQtory
ComfyUI-first local visuals yard for story-driven text2img → img2vid generation, live visuals playout, and QR crowd control.

About This Project
Visual FaQtory has moved way past vague “AI video” territory. The attached v0.6.5-beta repo shows a real Python pipeline that reads long-form story text, splits it into overlapping windows, generates chained visuals cycle by cycle, and finishes them through a proper stitch / interpolate / upscale pass. The centre of gravity right now is still the local ComfyUI path, because it gives the best-tested free-unlimited results with JuggernautXL SDXL for key imagery and SVD XT for motion. Around that core, the repo also ships LTX-Video and Veo backend lanes, OBS/SRT live-visual helpers, and a QR-driven crowd-control system for show use.
Current Highlights
- ComfyUI-first local generation flow with text-to-image first and image-to-video second
- Best-tested free-unlimited stack built around JuggernautXL SDXL + SVD XT
- Paragraph-driven sliding-story engine with reinject chaining between cycles
- Finalizer pipeline for stitching, interpolation, upscaling, and optional audio muxing
- QR crowd control with queue, overlay, rate limiting, and fail-open behaviour
- Extra backend lanes for LTX-Video, Veo, Diffusers-style experiments, and live show routing
Actual State
The inspected repo is Visual FaQtory v0.6.5-beta. It already ships the main CLI, the sliding story engine, backend abstractions, prompt synthesis, run-state tracking, quality inspection helpers, and the full post-processing chain. That makes the honest story pretty simple: this is a working visuals pipeline, not a concept slide.
ComfyUI First, On Purpose
- Local ComfyUI remains the main production lane for this project
- JuggernautXL SDXL handles the strongest keyframe generation right now
- SVD XT is the best-tested free-unlimited motion layer in the current setup
- LTX-Video and Veo are present as serious expansion lanes, not the current centre of gravity
Live Visuals Setup
The repo also shows the live-show angle clearly. There is a crowd-control server with QR generation, overlay endpoints, queue APIs, and prompt filtering, plus OBS/SRT watcher scripts for split-box deployments where the GPU machine and the streaming machine are separate. On stage, the system can evolve visuals from a running text story while still letting the crowd steer the next turns without wrecking the whole operator flow.