How to Train Custom LoRA Models for Stable Diffusion and Flux

Pre-trained models like Stable Diffusion and Flux are extraordinarily capable, but they know nothing about your specific character designs, your brand aesthetic, or the particular style you have spent years developing. Custom LoRA training changes that. It lets you teach an existing model new concepts, styles, faces, and objects in a matter of hours, using as few as fifteen images and a consumer GPU. The result is a small, portable file that you can share, stack with other LoRAs, and use across any compatible interface. If you have ever wished a model could generate something that looks exactly like your vision rather than an approximation of it, training your own LoRA is the path forward.

What Is a LoRA?

LoRA stands for Low-Rank Adaptation, a technique originally developed for fine-tuning large language models that has been adapted brilliantly for image generation. The core insight behind LoRA is that you do not need to modify every weight in a multi-billion-parameter model to teach it something new. Instead, LoRA injects small, trainable matrices into specific layers of the model, typically the attention layers where the model decides which parts of the image to focus on during generation. These matrices capture the difference between the original model and your desired output.

In practical terms, a full Stable Diffusion SDXL checkpoint is roughly 6.5 GB. A LoRA file that modifies that same model to produce a specific art style or character is typically between 10 MB and 200 MB, depending on the network rank you choose during training. This is possible because LoRA decomposes the weight updates into two low-rank matrices. If a layer has a weight matrix of dimensions 1024 by 1024, a full fine-tune would require updating all 1,048,576 parameters. A LoRA with a rank of 32 only updates 65,536 parameters for that same layer, less than seven percent of the original, while capturing the vast majority of the desired adaptation.

When you use a LoRA during inference, the small adapter matrices are merged with the base model's weights on the fly. You can adjust the strength of this merge from 0.0 (no effect) to 1.0 (full effect) and even beyond, giving you precise control over how much influence the LoRA has on the final output. You can also stack multiple LoRAs simultaneously, combining a style LoRA with a character LoRA, for example, though you will need to balance their strengths to avoid conflicts.

What Can You Train a LoRA For?

The versatility of LoRA training is one of its greatest strengths. Here are the most common use cases that people train custom LoRAs for:

Art styles — Capture a specific illustration style, painting technique, or visual aesthetic. Whether it is anime cel-shading, oil painting impasto, watercolor washes, or a particular artist's signature look, a style LoRA can teach the model to reproduce it consistently.
Characters — Train a LoRA on a fictional character so the model can generate them in any pose, outfit, or setting while maintaining consistent facial features and proportions.
Real faces — Train on photographs of a specific person to generate realistic portraits of them in various contexts. This is commonly used for personalized avatar generation and creative projects.
Objects and products — Teach the model to generate a specific product, vehicle, piece of furniture, or any physical object that does not exist in the base model's training data.
Concepts and aesthetics — Train on a broader visual concept like "cyberpunk neon rain" or "vintage Kodachrome photography" to give the model a nuanced understanding of a particular mood or atmosphere.
Clothing and fashion — Capture specific garments, fabric patterns, or fashion styles that the model can then apply to any character or scene.

Hardware Requirements

LoRA training is significantly less demanding than full model fine-tuning, but it still requires a capable GPU. The minimum practical configuration is an NVIDIA GPU with 12 GB of VRAM, which includes cards like the RTX 3060 12GB, RTX 4070, and RTX 4070 Ti. With 12 GB you can train SDXL LoRAs at reasonable batch sizes and resolutions. Training on an RTX 3060 12GB typically takes two to four hours for a standard dataset, while an RTX 4090 with 24 GB can finish the same job in under an hour.

If you do not have a suitable local GPU, cloud GPU services offer an excellent alternative. RunPod is the most popular choice in the Stable Diffusion community, offering RTX 4090 and A100 instances at competitive hourly rates. Vast.ai provides even cheaper options by aggregating idle GPU capacity from individuals and data centers, though availability can be less predictable. Google Colab's free tier has become restrictive for training workloads, but its paid Pro tier with A100 access works well for shorter training runs.

Beyond the GPU, you will need at least 32 GB of system RAM and enough storage for your dataset, the base model, and the training outputs. A typical training session generates multiple checkpoint files at different epochs, so plan for 2 to 5 GB of output storage. Training time scales roughly linearly with dataset size and epoch count. A 30-image dataset trained for 15 epochs at a resolution of 1024 by 1024 will take approximately 90 minutes on an RTX 4090, three hours on an RTX 3060 12GB, or around 40 minutes on a cloud A100.

Preparing Your Dataset

Your dataset is the single most important factor in training quality. A well-prepared set of 20 images will outperform a sloppy set of 200 every time. For most LoRA training goals, aim for 15 to 50 high-quality images. Character LoRAs tend toward the lower end of that range (15 to 25 images showing the character from multiple angles), while style LoRAs benefit from more examples (30 to 50 images demonstrating the style across varied subjects).

Every image in your dataset should meet these quality standards:

Resolution — At least 1024 by 1024 pixels for SDXL training. Smaller images get upscaled, which introduces artifacts that the model will learn. Higher resolution source material is always better.
Variety — For a character, include different angles (front, three-quarter, side, back), different expressions, different lighting conditions, and different outfits if relevant. For a style, include different subjects rendered in that style.
Consistency — All images should clearly demonstrate the concept you are training. If you are training a character, every image should unmistakably show that character. Remove any images where the character is partially obscured, poorly drawn, or off-model.
Clean backgrounds — Simple or varied backgrounds help the model focus on the subject rather than associating specific backgrounds with the concept.
No watermarks or text — The model will learn to reproduce watermarks and text overlays if they appear in your training data.

Crop your images to focus on the relevant subject matter. For character training, include a mix of close-up portraits, upper-body shots, and full-body images. Use square or near-square aspect ratios when possible, though modern training scripts support bucketing, which handles mixed aspect ratios automatically by grouping similarly shaped images together during training.

Captioning Your Images

Every training image needs a text caption that describes what the model should associate with that image. Captioning quality directly impacts how well your LoRA responds to prompts after training. You have two main approaches: automatic captioning followed by manual refinement, or fully manual captioning.

For automatic captioning, BLIP-2 and WD Tagger (also called WD 1.4 Tagger) are the most commonly used tools. BLIP-2 generates natural language descriptions ("a woman with red hair standing in a garden"), while WD Tagger produces comma-separated booru-style tags ("1girl, red hair, garden, standing, blue dress, sunny"). Most training tools include built-in captioning integration, or you can use standalone tools like the Kohya caption helper.

Regardless of which auto-captioning tool you use, manual refinement is essential. Review every caption and correct errors. The most critical step is establishing a trigger word, a unique token that activates your LoRA during inference. For a character named "Aria," your trigger word might be aria_char. For a style, it might be painterly_style_v1. Use something that does not conflict with existing tokens in the model's vocabulary. Add the trigger word to the beginning of every caption. Then remove any tags that describe the core concept you are training, because you want the model to associate those visual features with your trigger word rather than with generic tags.

For example, if you are training a red-haired character and a caption reads aria_char, 1girl, red hair, blue eyes, standing in a park, sunny day, you might remove "red hair" and "blue eyes" since those are defining features that should be bound to the trigger word. Keep tags that describe variable elements like pose, setting, and lighting, because you want the model to learn that those features can change while the core identity remains constant.

Choosing a Training Tool

Several tools exist for LoRA training, each with different interfaces and capabilities. The most established and widely recommended option is Kohya_ss GUI, a graphical interface built on top of the Kohya SD-Scripts training library. It provides a comprehensive set of controls for every training parameter, supports both SD 1.5 and SDXL, and has been battle-tested by thousands of users. Installation involves cloning the GitHub repository and running the setup script, which handles all Python dependencies.

LoRA Easy Training Scripts is a more streamlined alternative that simplifies the configuration process. It trades some of Kohya's advanced options for a more approachable interface, making it a good choice for beginners who do not need granular control over every parameter. For cloud-based training, Civitai's on-site trainer lets you train LoRAs directly in the browser without any local setup. It is limited in customization compared to local tools, but it handles the entire pipeline from dataset upload to trained model with minimal friction.

Training Parameters Explained

Understanding the key training parameters is essential for getting good results. Here are the most important settings and what they control:

Learning rate — Controls how much the model adjusts its weights in response to each training example. Too high and the model overfits or produces artifacts. Too low and training takes forever with minimal learning. For LoRA training, a learning rate between 1e-4 and 5e-4 works well with the AdamW optimizer. If using Prodigy, set it to 1.0 as the optimizer self-adjusts.
Epochs — One epoch means the model has seen every image in your dataset once. Most LoRA training runs between 10 and 25 epochs. Fewer epochs produce a more subtle, flexible LoRA. More epochs produce a stronger effect but risk overfitting.
Batch size — How many images the model processes simultaneously before updating weights. A batch size of 1 works but produces noisy gradients. A batch size of 2 to 4 is ideal if your VRAM can handle it. Larger batches generally produce smoother, more stable training.
Network rank (dim) — The rank of the LoRA matrices, controlling how much information the adapter can store. Common values are 16, 32, 64, and 128. A rank of 32 is a solid default. Higher ranks capture more detail but produce larger files and can overfit more easily. For simple concepts, rank 16 suffices. For complex styles, rank 64 or 128 may be warranted.
Network alpha — A scaling factor applied to the LoRA weights. Setting alpha equal to half the rank (for example, alpha 16 with rank 32) is the most common convention. This scales the learning rate effectively by alpha/rank, providing more stable training.
Optimizer — AdamW is the standard choice and works reliably. Prodigy is a newer adaptive optimizer that automatically adjusts the learning rate during training, eliminating the need to manually tune it. Prodigy has become increasingly popular because it removes one of the most finicky parameters from the equation. AdamW8bit reduces memory usage with negligible quality loss.
Scheduler — Controls how the learning rate changes over the course of training. Cosine with restarts is a popular choice that gradually reduces the learning rate, then resets it periodically. Constant with warmup starts with a low learning rate and ramps up to the target over a specified number of steps, preventing early instability.

The Training Process

With your dataset prepared, captions written, and parameters configured, here is the step-by-step process for launching a training run using Kohya_ss GUI:

Step 1: Organize your dataset into a folder structure that Kohya expects. Create a folder named with the format [repeats]_[concept], for example 10_aria_char. The number indicates how many times each image is repeated per epoch. Place your images and their corresponding caption text files (same filename, .txt extension) inside this folder.
Step 2: Open Kohya_ss GUI in your browser and navigate to the LoRA training tab. Set the model path to your base checkpoint (for example, an SDXL model). Set the output directory where trained LoRA files will be saved.
Step 3: Configure the training parameters as discussed above. Set the resolution to 1024 for SDXL, choose your optimizer and scheduler, and set the epoch count. Enable "Save every N epochs" to generate intermediate checkpoints you can test.
Step 4: Under the network settings, choose LoRA as the network type, set the rank and alpha values, and ensure the training target includes both the UNet and the text encoder (training the text encoder helps the model associate your trigger word more strongly with the visual concept).
Step 5: Click Start Training. Monitor the loss curve in the console output or through the built-in TensorBoard integration. A healthy training run shows the loss decreasing steadily and then leveling off. If the loss drops to near zero, the model is likely overfitting.

# Example folder structure
training_data/
  image/
    10_aria_char/
      image_001.png
      image_001.txt
      image_002.png
      image_002.txt
      ...
  model/
    sd_xl_base_1.0.safetensors
  output/
    aria_char_e10.safetensors
    aria_char_e15.safetensors
    aria_char_e20.safetensors

Testing and Iterating

Training a LoRA is rarely a one-shot process. The checkpoints saved at different epochs will have different characteristics, and testing them systematically is essential for finding the best one. Load each checkpoint into your preferred generation interface (ComfyUI, A1111, or Forge) and generate a batch of test images using your trigger word combined with various prompts.

Start with simple prompts that directly invoke the trained concept, like aria_char standing in a meadow, to verify the LoRA has learned the basic identity. Then test with more complex prompts that push the model into unfamiliar territory: different settings, different poses, different lighting conditions. A well-trained LoRA should maintain consistency while adapting to new contexts.

Watch for signs of overfitting, which occurs when the model has memorized the training images too literally. Overfitted LoRAs tend to reproduce training image compositions regardless of the prompt, produce artifacts or distortions when asked for poses or angles not present in the training data, and show a "burned" look with oversaturated colors or unnatural sharpness. If you see these symptoms, try the checkpoint from an earlier epoch. If all checkpoints show overfitting, reduce the epoch count, lower the learning rate, or increase your dataset size.

Also experiment with the LoRA weight (strength) during inference. Some LoRAs work best at 0.7 or 0.8 rather than full strength, giving a cleaner result with fewer artifacts while still capturing the trained concept. Finding the right weight is part of the testing process.

Training LoRAs for Flux

Flux, developed by Black Forest Labs, introduced a new model architecture that differs significantly from Stable Diffusion 1.5 and SDXL. Training LoRAs for Flux requires some adjustments to your workflow and expectations.

The most immediate difference is VRAM requirements. Flux is a larger model, and training LoRAs for it typically requires 24 GB of VRAM at minimum, putting it firmly in RTX 4090 or cloud A100 territory. Quantization techniques can bring this down to 16 GB in some configurations, but training quality may suffer. If you are using cloud GPUs, plan for A100-40GB or A100-80GB instances for comfortable Flux LoRA training.

The Kohya SD-Scripts library has added Flux support, and the Kohya_ss GUI can be used for Flux LoRA training with the correct configuration. You will need to select the Flux model as your base, set the architecture to Flux in the model settings, and adjust your resolution and training parameters accordingly. Flux natively handles 1024 by 1024 resolution, and its text encoder setup differs from SDXL, using T5-XXL rather than the dual CLIP architecture.

The AI-toolkit by Ostris is another popular option specifically designed for Flux LoRA training. It provides a streamlined configuration file format and supports gradient checkpointing to reduce VRAM usage. Training parameters for Flux LoRAs tend to favor lower learning rates (around 1e-4 with AdamW or 1.0 with Prodigy) and lower network ranks (16 to 32), as Flux's architecture appears to require less adapter capacity to capture new concepts. Training times are longer than SDXL due to the larger model size, typically two to three times longer for equivalent epoch counts.

Despite the higher hardware requirements, Flux LoRAs can produce stunning results. Flux's superior prompt adherence and photorealism extend to fine-tuned outputs, meaning your trained concepts will integrate more naturally into generated scenes with better lighting coherence, more accurate anatomy, and sharper detail than their SDXL equivalents.

Sharing Your LoRA

Once you have trained and tested a LoRA that you are proud of, sharing it with the community is straightforward and rewarding. The two primary platforms for distributing LoRA models are Civitai and Hugging Face.

Civitai is the largest community hub for Stable Diffusion and Flux models. It provides a dedicated upload flow for LoRAs, including version management, tagging, and a gallery system for showcasing example outputs. When uploading to Civitai, include comprehensive metadata: the base model your LoRA was trained on, recommended weight settings, trigger words, and any known limitations. Upload at least four to six high-quality preview images that demonstrate the LoRA's capabilities across different prompts and settings. Good preview images are the biggest factor in whether people download and use your model.

Hugging Face is the more technical platform, favored by researchers and developers. It uses a Git-based repository system where each model gets its own repo with version history. Hugging Face is particularly good for LoRAs you want to integrate into automated pipelines, as its API makes programmatic access straightforward. Include a detailed model card (README) with training parameters, dataset description, usage instructions, and license information.

Regardless of where you share, document your work thoroughly. State which base model the LoRA was trained for, list the recommended inference parameters including weight strength and trigger words, describe any known issues or limitations, and specify the license under which you are releasing the model. Well-documented LoRAs build trust, attract more users, and contribute meaningfully to the open-source generative AI ecosystem.