Mastering Stable Diffusion with ComfyUI: A Complete Workflow Guide

If you have spent any time generating images with Stable Diffusion, you have likely encountered the limitations of basic web interfaces. They get the job done for simple prompts, but the moment you want to chain multiple models together, add conditional logic, or build a repeatable production pipeline, you hit a wall. ComfyUI tears that wall down. It is a node-based visual editor that gives you granular, programmatic control over every step of the diffusion process while remaining approachable enough that you do not need to write a single line of code. In this guide, we will take you from zero to building advanced workflows that rival anything the proprietary platforms can produce.

What Is ComfyUI?

ComfyUI is an open-source, node-based graphical interface for Stable Diffusion and related generative AI models. Unlike traditional text-box-and-button interfaces, ComfyUI represents each step of the image generation process as a visual node on a canvas. You connect nodes together to form workflows, essentially programming the diffusion pipeline without writing code.

Created by developer comfyanonymous and released on GitHub, ComfyUI has rapidly grown into one of the most popular ways to interact with Stable Diffusion. Its architecture mirrors how professional creative tools like Houdini, Nuke, and Unreal Engine's Blueprint system work, where complex operations are broken into discrete, reusable blocks that you wire together visually. Every node represents a single operation: loading a model, encoding a prompt, sampling noise, decoding latent images, and everything in between. The result is complete transparency into what is happening at every stage of generation.

ComfyUI supports Stable Diffusion 1.5, SDXL, Stable Diffusion 3, Flux, and a growing list of other architectures. It runs locally on your hardware, meaning your images stay private and you are not paying per-generation fees. The community has built hundreds of custom nodes that extend its functionality into areas like video generation, 3D model creation, and batch processing automation.

Why ComfyUI Over Other Interfaces?

The Stable Diffusion ecosystem offers several interface options, with AUTOMATIC1111's Web UI and Forge being the most well-known alternatives. Each has its strengths, but ComfyUI occupies a unique position that makes it the tool of choice for serious practitioners.

AUTOMATIC1111 (A1111) is the veteran of the space. It provides a clean, tab-based interface that is easy to understand and quick to get started with. However, its architecture is fundamentally linear. You fill in settings, click generate, and wait. If you want to chain an img2img pass after your initial generation, you have to manually copy the image over, switch tabs, and adjust settings again. Workflows are not reproducible in any structured way, and extending functionality requires installing extensions that can conflict with each other.

Forge improved on A1111 by optimizing memory management and adding better support for newer models like SDXL. It is faster on lower-end hardware and supports some features that A1111 does not. But it still inherits the same linear, form-based interaction model.

ComfyUI takes a fundamentally different approach. Here is what sets it apart:

Full pipeline visibility — You can see every operation, every connection, and every parameter at a glance. There is no hidden magic happening behind the scenes.
Reproducibility — Workflows can be saved as JSON files and shared. Someone else can load your exact workflow and get identical results with the same seed.
Efficiency — ComfyUI only re-executes nodes whose inputs have changed. If you tweak a prompt but keep the same model, it does not reload the model.
VRAM optimization — Its execution engine is remarkably memory-efficient, often using less VRAM than alternatives for identical operations.
Extensibility — Custom nodes integrate seamlessly into the graph system. There is no separate extension architecture that can break between updates.
Multi-model pipelines — You can chain different models, apply multiple ControlNets, and route outputs through complex logic without any workarounds.

Installation & Setup

Getting ComfyUI running on your machine is straightforward. You will need a computer with a dedicated GPU (NVIDIA is best supported, though AMD and Apple Silicon work), Python 3.10 or newer, and at least 8 GB of VRAM for comfortable SDXL generation.

System Requirements

GPU: NVIDIA GTX 1070 or better (RTX 3060 12GB+ recommended)
RAM: 16 GB system RAM minimum, 32 GB recommended
Storage: 20 GB free for ComfyUI plus models (models range from 2-7 GB each)
OS: Windows 10/11, Linux, or macOS (Apple Silicon supported)
Python: 3.10, 3.11, or 3.12

Installation Steps

The fastest method is to clone the repository directly from GitHub and install the Python dependencies:

git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
python main.py

For Windows users, the community provides a portable package that bundles everything including Python and Git. Download the latest release from the GitHub releases page, extract it, and run the included batch file. No manual Python installation required.

Once launched, ComfyUI opens in your browser at http://127.0.0.1:8188. You should see a blank canvas or the default workflow. Before generating anything, you need at least one checkpoint model. Download a Stable Diffusion model (SDXL base is a solid starting point) and place it in the models/checkpoints/ folder inside your ComfyUI directory.

Understanding the Node System

Everything in ComfyUI revolves around nodes and connections. If you have ever used Blender's shader editor, Unreal Engine Blueprints, or even visual programming tools like Scratch, the concept will feel familiar. If not, it is simple to grasp.

A node is a rectangular block on the canvas that performs a single operation. It has inputs (on the left side) and outputs (on the right side). You drag connections from an output of one node to the input of another to create a flow of data. The data types are color-coded: purple for model data, yellow for CLIP embeddings, pink for latent images, blue for VAE, and so on.

The execution order is determined automatically by the connections. ComfyUI reads the graph, figures out which nodes depend on which, and runs them in the correct sequence. You do not have to worry about ordering; just connect the nodes logically and press "Queue Prompt" to run the workflow.

Key Concepts

Checkpoint: The base model file that contains the learned weights. This is the foundation of any generation.
Latent space: A compressed mathematical representation of the image. Diffusion happens in latent space, not pixel space, which is what makes it fast.
Conditioning: The text prompt after it has been encoded by CLIP into a numerical format the model understands.
Sampling: The iterative denoising process that transforms random noise into a coherent image guided by the conditioning.
VAE (Variational Autoencoder): The component that translates between latent space and pixel space. It encodes images into latents and decodes latents back into viewable images.

Your First Workflow: Text-to-Image

Let us build the most fundamental workflow from scratch. This will generate an image from a text prompt, and understanding each connection here will serve as the foundation for everything more complex.

Start with a blank canvas (right-click and choose "Clear" if the default workflow is loaded). Then add these nodes by double-clicking the canvas or right-clicking and selecting "Add Node":

Load Checkpoint — Select your downloaded model file from the dropdown.
CLIP Text Encode (Prompt) — You will need two of these: one for the positive prompt and one for the negative prompt.
Empty Latent Image — Set your desired resolution (1024x1024 for SDXL).
KSampler — The heart of the generation process.
VAE Decode — Converts the latent output to a visible image.
Save Image — Writes the result to disk.

Now connect them: Load Checkpoint's MODEL output goes to KSampler's model input. Its CLIP output goes to both CLIP Text Encode nodes. Connect the positive encoder's CONDITIONING output to KSampler's positive input, and the negative to the negative input. Empty Latent Image's LATENT output connects to KSampler's latent_image input. KSampler's LATENT output goes to VAE Decode's samples input. Load Checkpoint's VAE output goes to VAE Decode's vae input. Finally, VAE Decode's IMAGE output connects to Save Image.

Type a prompt in the positive CLIP encoder (something like "a majestic mountain landscape at golden hour, photorealistic, 8k"), add common negative terms in the negative encoder ("blurry, low quality, deformed"), set the KSampler to 20 steps with the Euler sampler and a CFG of 7.0, then hit "Queue Prompt." Your first ComfyUI image should appear within seconds.

Essential Nodes You Should Know

As you move beyond basic text-to-image, these core nodes will appear in nearly every workflow you build. Understanding what they do and how to configure them is essential.

KSampler

The KSampler is where the actual diffusion magic happens. Its key parameters include steps (how many denoising iterations to run, typically 20-30), cfg (classifier-free guidance scale, controlling how closely the output matches your prompt), sampler_name (the algorithm used for each step), and scheduler (how the noise schedule is distributed across steps). For most uses, Euler or DPM++ 2M SDE with the Karras scheduler produces excellent results. Higher CFG values create images that follow the prompt more literally but can introduce artifacts above 10-12.

VAE Encode / Decode

VAE Decode converts latent tensors into RGB images you can see. VAE Encode does the reverse, converting a pixel image into a latent representation, which is essential for img2img workflows where you start from an existing image. Some checkpoint models include a built-in VAE, but you can also load a separate, higher-quality VAE using the "Load VAE" node for crisper colors and sharper details.

CLIP Text Encode

This node takes your text prompt and converts it into conditioning vectors. SDXL models use dual CLIP encoders (CLIP-L and CLIP-G), and the Load Checkpoint node handles both automatically. Prompts can include emphasis syntax like (important word:1.3) to weight certain tokens more heavily.

LoRA Loader

LoRA (Low-Rank Adaptation) files are small model add-ons trained to inject specific styles, characters, or concepts. The LoRA Loader node sits between your checkpoint and the rest of the pipeline. It takes the MODEL and CLIP from Load Checkpoint, applies the LoRA modification, and outputs the modified MODEL and CLIP. You can chain multiple LoRA Loaders in sequence to combine effects, adjusting the strength of each independently.

Intermediate Techniques

Image-to-Image (img2img)

Image-to-image generation starts from an existing picture rather than pure noise. To set this up, replace the Empty Latent Image node with a "Load Image" node followed by a "VAE Encode" node. The encoded latent then connects to the KSampler. The key parameter is denoise on the KSampler: a value of 0.7 means 70% of the image is regenerated while 30% retains the original structure. Lower values preserve more of the source, higher values allow more creative freedom.

Inpainting

Inpainting lets you regenerate specific regions of an image while keeping the rest intact. You need a dedicated inpainting model or a regular model with the correct setup. Load your image, create a mask (using the built-in mask editor or a "Load Image (as Mask)" node), and connect both to a "Set Latent Noise Mask" node before feeding it into the KSampler. The masked region will be regenerated while unmasked areas remain untouched. This is incredibly powerful for fixing hands, swapping backgrounds, or adding elements to existing compositions.

ControlNet

ControlNet is where ComfyUI truly shines. ControlNet models accept a conditioning image (an edge map, depth map, pose skeleton, or similar structural guide) and use it to control the spatial composition of the output. In ComfyUI, add a "Load ControlNet Model" node and an "Apply ControlNet" node. The Apply node takes the conditioning from your CLIP encoder and the ControlNet's conditioning image, merging them so the KSampler respects both the text prompt and the structural guide. You can preprocess source images using nodes from the ControlNet Preprocessors pack, which includes Canny edge detection, depth estimation, OpenPose, and many more.

Advanced Workflows

IP-Adapter for Style and Subject Transfer

IP-Adapter allows you to condition your generation on a reference image rather than text alone. This is extraordinary for style transfer ("generate in the style of this painting"), face consistency across multiple images, and subject-driven generation. The setup requires the IPAdapter Plus custom node pack, an IP-Adapter model file, and a CLIP Vision model. Once installed, you add an "IPAdapter Apply" node that takes your model, a reference image, and optionally an attention mask. It modifies the model's behavior so that outputs share visual characteristics with the reference. Combining IP-Adapter with text prompts and ControlNet gives you an incredibly precise level of control.

Multi-ControlNet Pipelines

One of ComfyUI's greatest advantages is the ability to stack multiple ControlNets simultaneously. For example, you might use a depth ControlNet to maintain the 3D structure of a scene while simultaneously applying a Canny edge ControlNet to preserve fine details and an OpenPose ControlNet to lock in character poses. Each "Apply ControlNet" node chains to the previous one's conditioning output, and you can adjust the strength of each independently. This multi-ControlNet approach is how professionals achieve the kind of precise, production-quality results that seem impossible with simpler interfaces.

Upscaling Pipelines

For high-resolution output, ComfyUI supports multi-stage upscaling workflows. A common approach is to generate at your model's native resolution, then use an upscale model (like 4x-UltraSharp or RealESRGAN) via the "Upscale Image (using Model)" node, followed by a second KSampler pass at a low denoise value (0.3-0.5) to add fine detail at the higher resolution. This two-stage process produces images that are dramatically sharper and more detailed than a single high-resolution generation. You can even route the pipeline through different models for each stage, using a fast model for the initial composition and a detail-oriented model for the refinement pass.

Working with Flux Models

Flux, developed by Black Forest Labs, is one of the most significant open-source model releases since Stable Diffusion itself. Its architecture differs from SD 1.5 and SDXL, requiring a slightly different workflow configuration in ComfyUI. Fortunately, ComfyUI has supported Flux natively since shortly after its release.

To use Flux in ComfyUI, download the Flux model files (available in various quantized sizes from the community). Place the model in your models/checkpoints/ or models/unet/ folder, depending on the format. Flux uses a T5-XXL text encoder in addition to CLIP, so you will need the "DualCLIPLoader" node instead of relying on the checkpoint's built-in CLIP. Load the T5 and CLIP-L models separately.

The workflow structure is similar to standard Stable Diffusion, but with these key differences: Flux uses a different noise scheduler, so set the KSampler scheduler to "simple" or "normal" with the "euler" sampler for best results. CFG values should be much lower than SD (1.0 for the guidance-distilled version, 3.5 for the full version). Flux generates at 1024x1024 natively and handles various aspect ratios well. The results are remarkable, with Flux excelling at prompt adherence, text rendering within images, and photorealistic output that rivals proprietary solutions.

For users with limited VRAM, Flux can be run using quantized versions (GGUF format via the ComfyUI-GGUF custom node pack) that reduce memory requirements from 24 GB down to as little as 8 GB, with only marginal quality loss. This makes Flux accessible on consumer hardware, which is a significant achievement for the open-source community.

Performance Tips

Getting the most out of your hardware requires some tuning. Here are the most impactful optimizations you can apply right away.

Enable FP16/FP8 mode — Running models in half precision (or quarter precision with FP8) dramatically reduces VRAM usage with minimal quality loss. Use the --force-fp16 launch argument or configure it per-node.
Use the TAESD decoder — While generating, swap VAE Decode for the TAESD (Tiny AutoEncoder for Stable Diffusion) preview node. It provides real-time previews of the generation in progress at a fraction of the VRAM cost.
Smart queue management — ComfyUI caches loaded models in VRAM. If you are working with a single model, it stays loaded between generations. Avoid unnecessary model switches to prevent constant loading and unloading.
Resolution matters — Generate at native resolution first, then upscale. Trying to generate a 2048x2048 image in a single pass requires four times the VRAM and often produces worse results than a 1024 base with an upscale pass.
Attention optimizations — ComfyUI uses PyTorch's scaled dot product attention by default, which is well-optimized on modern GPUs. On older hardware, you can use --use-split-cross-attention to reduce peak VRAM usage at a slight speed cost.
Batch processing — When generating multiple images, use the batch_size parameter on the Empty Latent Image node rather than queuing multiple single generations. Batching is more efficient because the model stays loaded and shared across all images in the batch.

Community Resources

The ComfyUI ecosystem is vibrant and growing rapidly. Here are the best places to find workflows, custom nodes, and help when you get stuck.

ComfyUI Manager — An essential custom node that adds a package manager directly into the ComfyUI interface. It lets you browse, install, and update custom nodes without touching the command line. Install this first and it will make everything else easier.
OpenArt and Civitai — Both platforms host large libraries of downloadable ComfyUI workflows. You can find complete pipelines for specific styles, techniques, and model combinations, ready to import with a single drag-and-drop onto your canvas.
GitHub repositories — The comfyanonymous/ComfyUI repo is the primary source for updates and issue tracking. Many popular custom node packs like ComfyUI-Impact-Pack, ComfyUI-AnimateDiff-Evolved, and ComfyUI-KJNodes have their own repos with detailed documentation.
Reddit and Discord — The r/comfyui subreddit and various Stable Diffusion Discord servers have active ComfyUI channels where users share workflows, troubleshoot issues, and discuss new techniques. The ComfyUI Matrix/Discord server is particularly active.
YouTube tutorials — Channels dedicated to ComfyUI tutorials have exploded in popularity. Search for beginner walkthroughs and specific technique breakdowns for visual, step-by-step guidance on building workflows.

ComfyUI represents a paradigm shift in how we interact with generative AI models. Its node-based approach gives you the power and flexibility that creative professionals demand, wrapped in a visual interface that makes complex pipelines manageable. Whether you are generating your first image or building production automation workflows, the investment in learning ComfyUI pays dividends that compound with every new model and technique the community develops. Start with the basic text-to-image workflow, experiment with ControlNet and IP-Adapter, and before long you will be building pipelines that would be impossible in any other tool.