qidao123.com技术社区-IT企服评测·应用市场

标题: 使用diffusers来练习自己的Stable Diffusion 3大模型 [打印本页]

作者: 自由的羽毛 时间: 2024-9-3 07:33
标题: 使用diffusers来练习自己的Stable Diffusion 3大模型
基于diffusers的Stable diffusion练习代码

这里给大家介绍一个基于diffusers库来练习stable diffusion相干模型的练习代码，包含Lora、ControlNet、IP-adapter、Animatediff，以及最新的stable diffusion 3 lora版本的练习代码。
现有的一些雷同kohya-ss练习器虽然用起来方便，但源代码封装地比力冗长，对于像我这样的新手小白阅读起来比力困难。因此我基于diffusers库重新写了相干练习代码，并删除了很多冗余部分，想要相识代码层级是如何练习的可以帮我点点star。
github地址：https://github.com/SongwuJob/simple-SD-trainer
代码重要修改至diffusers，也参考了一些开源项目，本人是新手小白，不免堕落请见谅。
Image Caption

图片描述是练习文本到图像模型的紧张构成部分，可用于 Lora、ControlNet 等。常见的caption方法大致可分为两类：

SDWebUI Tagger：这种方法是在webui界面中使用的一个标签器，其本质是一个多分类模型来天生标签。
VLM： VLM能更好地明白图像中的密集语义，并能提供具体的标签，这也是我们保举的方法。

在我们的实验中，我们使用GLM-4v-9b为练习过的图像添加标注。具体来说，我们使用query = "please describe this image into prompt words, and reply us with keywords like xxx, xxx, xxx, xxx"来提示 VLM 输出图片标注。例如，我们可以使用 GLM-4v 为单张图像添加prompt：

import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4v-9b", trust_remote_code=True)
query = "please describe this image into prompt words, and reply us with keywords like xxx, xxx, xxx, xxx"
image = Image.open("your image").convert('RGB')
inputs = tokenizer.apply_chat_template([{"role": "user", "image": image, "content": query}],
add_generation_prompt=True, tokenize=True, return_tensors="pt",
return_dict=True) # chat mode
inputs = inputs.to(device)
model = AutoModelForCausalLM.from_pretrained(
"THUDM/glm-4v-9b",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).to(device).eval()
gen_kwargs = {"max_new_tokens": 77, "do_sample": True, "top_k": 1}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
caption = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(caption)

复制代码

Lora练习

Stable Diffusion XL（SDXL）是IDM的高级变体，旨在根据文本描述天生高质量图像。在原有 SD1.5(2.1)的根本上，SDXL 提供了更强的功能和更高的性能，使其成为天生式人工智能范畴各种应用的强盛工具。
我们的 Lora 练习代码 train_text_to_image_lora_sdxl.py 是根据 diffusers 和 kohya-ss 修改而来的。

我们将dataset类重写为BaseDataset.py方便阅读如何加载数据的。
为了简化练习过程，我们删除了diffusers代码内部的一些参数，并调整了一些设置。

如果你想要基于diffusers来练习自己的lora模型，请先对练习数据进行标注，可以使用SDWebUI的tagger器，也可以使用一些视觉语言模型进行标注，标注之后的data.json格式如下：

[
{
"image": "1.jpg",
"text": "white hair, anime style, pink background, long hair, jacket, black and red top, earrings, rosy cheeks, large eyes, youthful, fashion, illustration, manga, character design, vibrant colors, hairstyle, clothing, accessories, earring design, artistic, contemporary, youthful fashion, graphic novel, digital drawing, pop art influence, soft shading, detailed rendering, feminine aesthetic"
},
{
"image": "2.jpg",
"text": "cute, anime-style, girl, long, wavy, hair, green, plaid, blazer, blush, big, expressive, eyes, hoop, earrings, soft, pastel, colors, youthful, innocent, charming, fashionable"
}
]

复制代码

图像标注之后，我们就可以实行 sh train_text_to_image_lora_sdxl.sh 来练习你的 lora 模型了，具体代码见Github：

export MODEL_NAME="/path/to/your/model"
export OUTPUT_DIR="lora/rank32"
export TRAIN_DIR="/path/to/your/data"
export JSON_FILE="/path/to/your/data/data.json"
accelerate launch ./stable_diffusion/train_text_to_image_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--train_data_dir=$TRAIN_DIR \
--output_dir=$OUTPUT_DIR \
--json_file=$JSON_FILE \
--height=1024 --width=1024 \
--train_batch_size=2 \
--random_flip \
--rank=32 --text_encoder_rank=8 \
--gradient_accumulation_steps=2 \
--num_train_epochs=30 --repeats=5 \
--checkpointing_steps=1000 \
--learning_rate=1e-4 \
--text_encoder_lr=1e-5 \
--lr_scheduler="constant_with_warmup" \
--lr_warmup_steps=500 \
--mixed_precision="fp16" \
--train_text_encoder \
--seed=1337 \

复制代码

Stable Diffusion 3 Lora练习

SD3的 Lora 练习代码 train_text_to_image_lora_sd3.py 是根据diffusers 中的train_dreambooth_lora_sd3.py 修改而来。SD3 练习还存在很多问题，本代码只是基于 diffusers 的简朴练习代码，在设置 max_sequence_length=77 时看起来很有用。

数据预处理（image caption）和 data.json 格式与 SDXL 一致。
我们将数据集重写为 SD3BaseDataset.py方便阅读数据如何加载。
为了简化练习过程，我们删除了diffusers一些参数，并调整了一些设置。
您必要设置一个较大的rank，以使DiT布局产生精良效果，建议设置为 64-128（64 用于较少的练习数据，128 用于较多的练习数据）。

为练习好的图片添加完标注后，我们就可以实行 sh train_text_to_image_lora_sd3.sh 来练习你的 lora 模型了，具体代码见Github：

export MODEL_NAME="/path/to/your/stable-diffusion-3-medium-diffusers"
export OUTPUT_DIR="lora/rank32"
export TRAIN_DIR="/path/to/your/data"
export JSON_FILE="/path/to/your/data/data.json"
accelerate launch ./stable_diffusion/train_text_to_image_lora_sd3.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--train_data_dir=$TRAIN_DATA_DIR \
--output_dir=$OUTPUT_DIR \
--json_file=$JSON_FILE \
--mixed_precision="fp16" \
--height=1024 --width=1024 \
--random_flip \
--train_batch_size=2 \
--checkpointing_steps=1000 \
--gradient_accumulation_steps=2 \
--learning_rate=1e-4 \
--text_encoder_lr=5e-6 \
--rank=64 --text_encoder_rank=8 \
--lr_scheduler="constant_with_warmup" --lr_warmup_steps=500 \
--num_train_epochs=30 \
--scale_lr --train_text_encoder \
--seed=1337

复制代码

ControlNet练习

我们的 ControlNet 练习代码 train_controlnet_sdxl.py 由 diffusers 修改而来。

我们将数据集重写为 ControlNetDataset.py方便阅读如何加载数据的。
我们重写了数据加载过程，并删除了 diffusers 中的一些参数，以简化练习过程。

要测试ControlNet的练习，可以下载在Hugging face上下载相干的数据集，如 controlnet_sdxl_animal。同时，您必要对这些练习数据进行如下简朴的预处理：

练习数据目次

controlnet_data
├──images/ (image files)
│ ├──0.png
│ ├──1.png
│ ├──......
├──conditioning_images/ (conditioning image files)
│ ├──0.png
│ ├──1.png
│ ├──......
├──data.json

复制代码

data.json 格式

[
{
"text": "a person walking a dog on a leash",
"image": "images/1.png",
"conditioning_image": "conditioning_images/1.png"
},
{
"text": "a woman walking her dog in the park",
"image": "images/2.png",
"conditioning_image": "conditioning_images/2.png"
}
]

复制代码

准备好完备的练习图像后，我们可以实行 sh train_controlnet_sdxl.sh 来练习ControlNet模型，具体代码见Github：

export MODEL_DIR="/path/to/your/model"
export OUTPUT_DIR="controlnet"
export TRAIN_DIR="controlnet_data"
export JSON_FILE="controlnet_data/data.json"
accelerate launch ./stable_diffusion/train_controlnet_sdxl.py \
--pretrained_model_name_or_path=$MODEL_DIR \
--train_data_dir=$TRAIN_DIR \
--output_dir=$OUTPUT_DIR \
--json_file=$JSON_FILE \
--mixed_precision="fp16" \
--width=1024 --height=1024 \
--learning_rate=1e-5 \
--checkpointing_steps=1000 \
--num_train_epochs=5 \
--lr_scheduler="constant_with_warmup" \
--lr_warmup_steps=500 \
--train_batch_size=1 --dataloader_num_workers=4 \
--gradient_accumulation_steps=2 \
--seed=1337 \

复制代码

IP-adapter 练习

IP-adapter 是一种无需练习的个性化文本到图像天生方法，有多个版本，如 IP-Adapter-Plus 和 IP-Adapter-FaceID。在此，我们重现了 IP-Adapter-Plus 的练习代码，让您可以使用小型数据集对其进行微调。例如，您可以使用动漫数据集对 IP-Adapter-Plus 进行微调，以实现个性化动漫图像天生。
我们的练习代码train_ip_adapter_plus_sdxl.py是从IP-adapter修改而来的。

我们将数据集重写为IPAdapterDataset.py方便相识数据如何加载。
为了更好地明白细粒度的图像信息，我们进行了 IP-Adapter-Plus-SDXL 练习。

从本质上讲，IP-adapter 的练习目标是重修使命，因此数据集的格式与 Lora 微调的格式雷同。在为完备的练习图像添加caption后，我们可以实行 sh train_ip_adapter_plus_sdxl.sh 来练习 IP -adapter，具体代码见Github：

export MODEL_NAME="/path/to/your/stable-diffusion-xl-base-1.0"
export PRETRAIN_IP_ADAPTER_PATH="/path/to/your/.../sdxl_models/ip-adapter-plus_sdxl_vit-h.bin"
export IMAGE_ENCODER_PATH="/path/to/your/.../models/image_encoder"
export OUTPUT_DIR="ip-adapter"
export TRAIN_DIR="images"
export JSON_FILE="images/data.json"
accelerate launch ./stable_diffusion/train_ip_adapter_plus_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--image_encoder_path=$IMAGE_ENCODER_PATH \
--pretrained_ip_adapter_path=$PRETRAIN_IP_ADAPTER_PATH \
--data_json_file=$JSON_FILE \
--data_root_path=$TRAIN_DIR \
--mixed_precision="fp16" \
--height=1024 --width=1024\
--train_batch_size=2 \
--dataloader_num_workers=4 \
--learning_rate=1e-05 \
--weight_decay=0.01 \
--output_dir=$OUTPUT_DIR \
--save_steps=10000 \
--seed=1337 \

复制代码

AnimateDiff 练习

AnimateDiff 是一种开源的文本到视频（T2V）技术，它通过整合运动模块和从大规模视频数据会合学习可靠的运动先验，扩展了原始的文本到图像模型。在此，我们使用 LoRA 重写了 AnimateDiff 的练习代码。请注意，我们使用最新的Diffusers库在SD1.5模型上重现了练习代码：

我们的练习代码参考了AnimationDiff with train，为了简化练习代码，我们使用了最新的 Diffusers。
我们将数据集重写为AnimateDiffDataset.py方便相识如何加载视频数据及其标签。

请注意，我们使用 lora 来微调预练习的 AnimateDiff，它可以大大减少对 CUDA 内存的需求。如果你想微调 animatediff 模型，可以下载来自Hugging face的视频数据，如 webvid10M。同时，处理后的 data.json 格式如下：

[
{
"video": "stock-footage-grilled-chicken-wings.mp4",
"text": "Grilled chicken wings."
},
{
"video": "stock-footage-waving-australian-flag-on-top-of-a-building.mp4",
"text": "Waving Australian flag on top of a building."
}
]

复制代码

具体来说，我们使用 PEFT使用 Lora 的 animatediff 模型进行微调：

# Load scheduler, tokenizer and models.
noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_name_or_path, subfolder="tokenizer", revision=args.revision)
text_encoder = CLIPTextModel.from_pretrained(args.pretrained_model_name_or_path, subfolder="text_encoder", revision=args.revision)
vae = AutoencoderKL.from_pretrained(args.pretrained_model_name_or_path, subfolder="vae", revision=args.revision, variant=args.variant)
unet = UNet2DConditionModel.from_pretrained(args.pretrained_model_name_or_path, subfolder="unet", revision=args.revision, variant=args.variant)
# Animatediff: UNet2DConditionModel -> UNetMotionModel
motion_adapter = MotionAdapter.from_pretrained(args.motion_module, torch_dtype=torch.float16)
unet = UNetMotionModel.from_unet2d(unet, motion_adapter)
# freeze parameters of models to save more memory
unet.requires_grad_(False)
vae.requires_grad_(False)
text_encoder.requires_grad_(False)
# use PEFT to load Lora, finetune the parameters of SD model and motion_adapter.
unet_lora_config = LoraConfig(
r=args.rank,
lora_alpha=args.rank,
init_lora_weights="gaussian",
target_modules=["to_k", "to_q", "to_v", "to_out.0"],
)
# Add adapter and make sure the trainable params are in float32.
unet.add_adapter(unet_lora_config)

复制代码

准备好完备的练习视频后，我们就可以实行 sh train_animatediff_with_lora.sh 来练习 animatediff 模型，具体代码见Github：

export MODEL_NAME="/path/to/your/Realistic_Vision_V5.1_noVAE"
export MOTION_MODULE="/path/to/your/animatediff-motion-adapter-v1-5-2"
export OUTPUT_DIR="animatediff"
export TRAIN_DIR="webvid"
export JSON_FILE="webvid/data.json"
accelerate launch ./stable_diffusion/train_animatediff_with_lora.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--motion_module=$MOTION_MODULE \
--train_data_dir=$TRAIN_DIR \
--output_dir=$OUTPUT_DIR \
--json_file=$JSON_FILE \
--resolution=512 \
--train_batch_size=1 \
--rank=8 \
--gradient_accumulation_steps=2 \
--num_train_epochs=10 \
--checkpointing_steps=10000 \
--learning_rate=1e-5 \
--lr_scheduler="constant_with_warmup" \
--lr_warmup_steps=500 \
--mixed_precision="fp16" \
--seed=1337 \

复制代码

免责声明：如果侵犯了您的权益，请联系站长，我们会及时删除侵权内容，谢谢合作！更多信息从访问主页：qidao123.com:ToB企服之家，中国第一个企服评测及商务社交产业平台。

欢迎光临 qidao123.com技术社区-IT企服评测·应用市场 (https://dis.qidao123.com/)