1 任务要求

Diffusers版本的StableDiffusionXL中复现Training-Free Layout Control with Cross-Attention Guidance论文(无代码)。


2 实现过程

大体意思就是把两部分代码给整合在一起,主要是把一篇论文的思想复现到StableDiffusionXL中,首先得找一下StableDiffusionXL的代码在哪吧。

这里论文的主要思想是在一个文生图的基础上,不用做任何的微调和训练就能生成对应布局控制的图像,所以StableDiffusionXL的代码不用实现,直接调用一个API即可。


3 U-Net网络

U-Net算法是一种用于图像分割的卷积神经网络(Convolutional Neural Network,简称CNN)架构。它由Olaf Ronneberger等人在2015年提出,主要用于解决医学图像分割的问题。

U-Net算法的特点是采用了U型的网络结构,因此得名U-Net,该网络结构具有编码器(Encoder)和解码器(Decoder)两个部分。

编码器负责逐步提取输入图像的特征并降低空间分辨率。解码器则通过上采样操作将特征图恢复到原始输入图像的尺寸,并逐步生成分割结果。

U-Net算法的关键创新是在解码器中引入了跳跃连接(Skip Connections),即将编码器中的特征图与解码器中对应的特征图进行连接。这种跳跃连接可以帮助解码器更好地利用不同层次的特征信息,从而提高图像分割的准确性和细节保留能力。

网络架构:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
UNet2DConditionModel(
(conv_in): Conv2d(4, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(time_proj): Timesteps()
(time_embedding): TimestepEmbedding(
(linear_1): Linear(in_features=320, out_features=1280, bias=True)
(act): SiLU()
(linear_2): Linear(in_features=1280, out_features=1280, bias=True)
)
(add_time_proj): Timesteps()
(add_embedding): TimestepEmbedding(
(linear_1): Linear(in_features=2816, out_features=1280, bias=True)
(act): SiLU()
(linear_2): Linear(in_features=1280, out_features=1280, bias=True)
)
(down_blocks) : ModuleList(
(0): DownBlock2D(
(resnets): ModuleList(
(0-1): 2 x ResnetBlock2D(
(norm1): GroupNorm(32, 320, eps=1e-05, affine=True)
(conv1): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(time_emb_proj): Linear(in_features=1280, out_features=320, bias=True)
(norm2): GroupNorm(32, 320, eps=1e-05, affine=True)
(dropout): Dropout(p=0.0, inplace=False)
(conv2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(nonlinearity): SiLU()
)
)
(downsamplers): ModuleList(
(0): Downsample2D(
(conv): Conv2d(320, 320, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
)
)
)
(1): CrossAttnDownBlock2D(
(attentions): ModuleList(
(0-1): 2 x Transformer2DModel(
(norm): GroupNorm(32, 640, eps=1e-06, affine=True)
(proj_in): Linear(in_features=640, out_features=640, bias=True)
(transformer_blocks): ModuleList(
(0-1): 2 x BasicTransformerBlock(
(norm1): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
(attn1): Attention(
(to_q): Linear(in_features=640, out_features=640, bias=False)
(to_k): Linear(in_features=640, out_features=640, bias=False)
(to_v): Linear(in_features=640, out_features=640, bias=False)
(to_out): ModuleList(
(0): Linear(in_features=640, out_features=640, bias=True)
(1): Dropout(p=0.0, inplace=False)
)
)
(norm2): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
(attn2): Attention(
(to_q): Linear(in_features=640, out_features=640, bias=False)
(to_k): Linear(in_features=2048, out_features=640, bias=False)
(to_v): Linear(in_features=2048, out_features=640, bias=False)
(to_out): ModuleList(
(0): Linear(in_features=640, out_features=640, bias=True)
(1): Dropout(p=0.0, inplace=False)
)
)
(norm3): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
(ff): FeedForward(
(net): ModuleList(
(0): GEGLU(
(proj): Linear(in_features=640, out_features=5120, bias=True)
)
(1): Dropout(p=0.0, inplace=False)
(2): Linear(in_features=2560, out_features=640, bias=True)
)
)
)
)
(proj_out): Linear(in_features=640, out_features=640, bias=True)
)
)
(resnets): ModuleList(
(0): ResnetBlock2D(
(norm1): GroupNorm(32, 320, eps=1e-05, affine=True)
(conv1): Conv2d(320, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(time_emb_proj): Linear(in_features=1280, out_features=640, bias=True)
(norm2): GroupNorm(32, 640, eps=1e-05, affine=True)
(dropout): Dropout(p=0.0, inplace=False)
(conv2): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(nonlinearity): SiLU()
(conv_shortcut): Conv2d(320, 640, kernel_size=(1, 1), stride=(1, 1))
)
(1): ResnetBlock2D(
(norm1): GroupNorm(32, 640, eps=1e-05, affine=True)
(conv1): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(time_emb_proj): Linear(in_features=1280, out_features=640, bias=True)
(norm2): GroupNorm(32, 640, eps=1e-05, affine=True)
(dropout): Dropout(p=0.0, inplace=False)
(conv2): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(nonlinearity): SiLU()
)
)
(downsamplers): ModuleList(
(0): Downsample2D(
(conv): Conv2d(640, 640, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1))
)
)
)
(2): CrossAttnDownBlock2D(
(attentions): ModuleList(
(0-1): 2 x Transformer2DModel(
(norm): GroupNorm(32, 1280, eps=1e-06, affine=True)
(proj_in): Linear(in_features=1280, out_features=1280, bias=True)
(transformer_blocks): ModuleList(
(0-9): 10 x BasicTransformerBlock(
(norm1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
(attn1): Attention(
(to_q): Linear(in_features=1280, out_features=1280, bias=False)
(to_k): Linear(in_features=1280, out_features=1280, bias=False)
(to_v): Linear(in_features=1280, out_features=1280, bias=False)
(to_out): ModuleList(
(0): Linear(in_features=1280, out_features=1280, bias=True)
(1): Dropout(p=0.0, inplace=False)
)
)
(norm2): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
(attn2): Attention(
(to_q): Linear(in_features=1280, out_features=1280, bias=False)
(to_k): Linear(in_features=2048, out_features=1280, bias=False)
(to_v): Linear(in_features=2048, out_features=1280, bias=False)
(to_out): ModuleList(
(0): Linear(in_features=1280, out_features=1280, bias=True)
(1): Dropout(p=0.0, inplace=False)
)
)
(norm3): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
(ff): FeedForward(
(net): ModuleList(
(0): GEGLU(
(proj): Linear(in_features=1280, out_features=10240, bias=True)
)
(1): Dropout(p=0.0, inplace=False)
(2): Linear(in_features=5120, out_features=1280, bias=True)
)
)
)
)
(proj_out): Linear(in_features=1280, out_features=1280, bias=True)
)
)
(resnets): ModuleList(
(0): ResnetBlock2D(
(norm1): GroupNorm(32, 640, eps=1e-05, affine=True)
(conv1): Conv2d(640, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(time_emb_proj): Linear(in_features=1280, out_features=1280, bias=True)
(norm2): GroupNorm(32, 1280, eps=1e-05, affine=True)
(dropout): Dropout(p=0.0, inplace=False)
(conv2): Conv2d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(nonlinearity): SiLU()
(conv_shortcut): Conv2d(640, 1280, kernel_size=(1, 1), stride=(1, 1))
)
(1): ResnetBlock2D(
(norm1): GroupNorm(32, 1280, eps=1e-05, affine=True)
(conv1): Conv2d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(time_emb_proj): Linear(in_features=1280, out_features=1280, bias=True)
(norm2): GroupNorm(32, 1280, eps=1e-05, affine=True)
(dropout): Dropout(p=0.0, inplace=False)
(conv2): Conv2d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(nonlinearity): SiLU()
)
)
)
)
(up_blocks) : ModuleList(
(0): CrossAttnUpBlock2D(
(attentions): ModuleList(
(0-2): 3 x Transformer2DModel(
(norm): GroupNorm(32, 1280, eps=1e-06, affine=True)
(proj_in): Linear(in_features=1280, out_features=1280, bias=True)
(transformer_blocks): ModuleList(
(0-9): 10 x BasicTransformerBlock(
(norm1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
(attn1): Attention(
(to_q): Linear(in_features=1280, out_features=1280, bias=False)
(to_k): Linear(in_features=1280, out_features=1280, bias=False)
(to_v): Linear(in_features=1280, out_features=1280, bias=False)
(to_out): ModuleList(
(0): Linear(in_features=1280, out_features=1280, bias=True)
(1): Dropout(p=0.0, inplace=False)
)
)
(norm2): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
(attn2): Attention(
(to_q): Linear(in_features=1280, out_features=1280, bias=False)
(to_k): Linear(in_features=2048, out_features=1280, bias=False)
(to_v): Linear(in_features=2048, out_features=1280, bias=False)
(to_out): ModuleList(
(0): Linear(in_features=1280, out_features=1280, bias=True)
(1): Dropout(p=0.0, inplace=False)
)
)
(norm3): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
(ff): FeedForward(
(net): ModuleList(
(0): GEGLU(
(proj): Linear(in_features=1280, out_features=10240, bias=True)
)
(1): Dropout(p=0.0, inplace=False)
(2): Linear(in_features=5120, out_features=1280, bias=True)
)
)
)
)
(proj_out): Linear(in_features=1280, out_features=1280, bias=True)
)
)
(resnets): ModuleList(
(0-1): 2 x ResnetBlock2D(
(norm1): GroupNorm(32, 2560, eps=1e-05, affine=True)
(conv1): Conv2d(2560, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(time_emb_proj): Linear(in_features=1280, out_features=1280, bias=True)
(norm2): GroupNorm(32, 1280, eps=1e-05, affine=True)
(dropout): Dropout(p=0.0, inplace=False)
(conv2): Conv2d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(nonlinearity): SiLU()
(conv_shortcut): Conv2d(2560, 1280, kernel_size=(1, 1), stride=(1, 1))
)
(2): ResnetBlock2D(
(norm1): GroupNorm(32, 1920, eps=1e-05, affine=True)
(conv1): Conv2d(1920, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(time_emb_proj): Linear(in_features=1280, out_features=1280, bias=True)
(norm2): GroupNorm(32, 1280, eps=1e-05, affine=True)
(dropout): Dropout(p=0.0, inplace=False)
(conv2): Conv2d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(nonlinearity): SiLU()
(conv_shortcut): Conv2d(1920, 1280, kernel_size=(1, 1), stride=(1, 1))
)
)
(upsamplers): ModuleList(
(0): Upsample2D(
(conv): Conv2d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
)
)
(1): CrossAttnUpBlock2D(
(attentions): ModuleList(
(0-2): 3 x Transformer2DModel(
(norm): GroupNorm(32, 640, eps=1e-06, affine=True)
(proj_in): Linear(in_features=640, out_features=640, bias=True)
(transformer_blocks): ModuleList(
(0-1): 2 x BasicTransformerBlock(
(norm1): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
(attn1): Attention(
(to_q): Linear(in_features=640, out_features=640, bias=False)
(to_k): Linear(in_features=640, out_features=640, bias=False)
(to_v): Linear(in_features=640, out_features=640, bias=False)
(to_out): ModuleList(
(0): Linear(in_features=640, out_features=640, bias=True)
(1): Dropout(p=0.0, inplace=False)
)
)
(norm2): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
(attn2): Attention(
(to_q): Linear(in_features=640, out_features=640, bias=False)
(to_k): Linear(in_features=2048, out_features=640, bias=False)
(to_v): Linear(in_features=2048, out_features=640, bias=False)
(to_out): ModuleList(
(0): Linear(in_features=640, out_features=640, bias=True)
(1): Dropout(p=0.0, inplace=False)
)
)
(norm3): LayerNorm((640,), eps=1e-05, elementwise_affine=True)
(ff): FeedForward(
(net): ModuleList(
(0): GEGLU(
(proj): Linear(in_features=640, out_features=5120, bias=True)
)
(1): Dropout(p=0.0, inplace=False)
(2): Linear(in_features=2560, out_features=640, bias=True)
)
)
)
)
(proj_out): Linear(in_features=640, out_features=640, bias=True)
)
)
(resnets): ModuleList(
(0): ResnetBlock2D(
(norm1): GroupNorm(32, 1920, eps=1e-05, affine=True)
(conv1): Conv2d(1920, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(time_emb_proj): Linear(in_features=1280, out_features=640, bias=True)
(norm2): GroupNorm(32, 640, eps=1e-05, affine=True)
(dropout): Dropout(p=0.0, inplace=False)
(conv2): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(nonlinearity): SiLU()
(conv_shortcut): Conv2d(1920, 640, kernel_size=(1, 1), stride=(1, 1))
)
(1): ResnetBlock2D(
(norm1): GroupNorm(32, 1280, eps=1e-05, affine=True)
(conv1): Conv2d(1280, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(time_emb_proj): Linear(in_features=1280, out_features=640, bias=True)
(norm2): GroupNorm(32, 640, eps=1e-05, affine=True)
(dropout): Dropout(p=0.0, inplace=False)
(conv2): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(nonlinearity): SiLU()
(conv_shortcut): Conv2d(1280, 640, kernel_size=(1, 1), stride=(1, 1))
)
(2): ResnetBlock2D(
(norm1): GroupNorm(32, 960, eps=1e-05, affine=True)
(conv1): Conv2d(960, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(time_emb_proj): Linear(in_features=1280, out_features=640, bias=True)
(norm2): GroupNorm(32, 640, eps=1e-05, affine=True)
(dropout): Dropout(p=0.0, inplace=False)
(conv2): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(nonlinearity): SiLU()
(conv_shortcut): Conv2d(960, 640, kernel_size=(1, 1), stride=(1, 1))
)
)
(upsamplers): ModuleList(
(0): Upsample2D(
(conv): Conv2d(640, 640, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)
)
)
(2): UpBlock2D(
(resnets): ModuleList(
(0): ResnetBlock2D(
(norm1): GroupNorm(32, 960, eps=1e-05, affine=True)
(conv1): Conv2d(960, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(time_emb_proj): Linear(in_features=1280, out_features=320, bias=True)
(norm2): GroupNorm(32, 320, eps=1e-05, affine=True)
(dropout): Dropout(p=0.0, inplace=False)
(conv2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(nonlinearity): SiLU()
(conv_shortcut): Conv2d(960, 320, kernel_size=(1, 1), stride=(1, 1))
)
(1-2): 2 x ResnetBlock2D(
(norm1): GroupNorm(32, 640, eps=1e-05, affine=True)
(conv1): Conv2d(640, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(time_emb_proj): Linear(in_features=1280, out_features=320, bias=True)
(norm2): GroupNorm(32, 320, eps=1e-05, affine=True)
(dropout): Dropout(p=0.0, inplace=False)
(conv2): Conv2d(320, 320, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(nonlinearity): SiLU()
(conv_shortcut): Conv2d(640, 320, kernel_size=(1, 1), stride=(1, 1))
)
)
)
)
(mid_block) : UNetMidBlock2DCrossAttn(
(attentions): ModuleList(
(0): Transformer2DModel(
(norm): GroupNorm(32, 1280, eps=1e-06, affine=True)
(proj_in): Linear(in_features=1280, out_features=1280, bias=True)
(transformer_blocks): ModuleList(
(0-9): 10 x BasicTransformerBlock(
(norm1): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
(attn1): Attention(
(to_q): Linear(in_features=1280, out_features=1280, bias=False)
(to_k): Linear(in_features=1280, out_features=1280, bias=False)
(to_v): Linear(in_features=1280, out_features=1280, bias=False)
(to_out): ModuleList(
(0): Linear(in_features=1280, out_features=1280, bias=True)
(1): Dropout(p=0.0, inplace=False)
)
)
(norm2): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
(attn2): Attention(
(to_q): Linear(in_features=1280, out_features=1280, bias=False)
(to_k): Linear(in_features=2048, out_features=1280, bias=False)
(to_v): Linear(in_features=2048, out_features=1280, bias=False)
(to_out): ModuleList(
(0): Linear(in_features=1280, out_features=1280, bias=True)
(1): Dropout(p=0.0, inplace=False)
)
)
(norm3): LayerNorm((1280,), eps=1e-05, elementwise_affine=True)
(ff): FeedForward(
(net): ModuleList(
(0): GEGLU(
(proj): Linear(in_features=1280, out_features=10240, bias=True)
)
(1): Dropout(p=0.0, inplace=False)
(2): Linear(in_features=5120, out_features=1280, bias=True)
)
)
)
)
(proj_out): Linear(in_features=1280, out_features=1280, bias=True)
)
)
(resnets): ModuleList(
(0-1): 2 x ResnetBlock2D(
(norm1): GroupNorm(32, 1280, eps=1e-05, affine=True)
(conv1): Conv2d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(time_emb_proj): Linear(in_features=1280, out_features=1280, bias=True)
(norm2): GroupNorm(32, 1280, eps=1e-05, affine=True)
(dropout): Dropout(p=0.0, inplace=False)
(conv2): Conv2d(1280, 1280, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(nonlinearity): SiLU()
)
)
)
(conv_norm_out): GroupNorm(32, 320, eps=1e-05, affine=True)
(conv_act): SiLU()
(conv_out): Conv2d(320, 4, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
)

4 改代码报错

4.1 Error 1

把加载的模型换成SDXL之后,运行inference.py文件,完整报错如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Traceback (most recent call last):
File "D:\Code\StableDiffusionXL\layout-guidance-main\inference.py", line 114, in main
unet = unet_2d_condition.UNet2DConditionModel(**unet_config).from_pretrained(cfg.general.model_path,
File "F:\anaconda\anaconda3\envs\DeltaZero\lib\site-packages\diffusers\modeling_utils.py", line 483, in from_pretrained
model = cls.from_config(config, **unused_kwargs)
File "F:\anaconda\anaconda3\envs\DeltaZero\lib\site-packages\diffusers\configuration_utils.py", line 210, in from_config
model = cls(**init_dict)
File "F:\anaconda\anaconda3\envs\DeltaZero\lib\site-packages\diffusers\configuration_utils.py", line 567, in inner_init
init(self, *args, **init_kwargs)
File "D:\Code\StableDiffusionXL\layout-guidance-main\my_model\unet_2d_condition.py", line 136, in __init__
down_block = get_down_block(
File "D:\Code\StableDiffusionXL\layout-guidance-main\my_model\unet_2d_blocks.py", line 65, in get_down_block
return CrossAttnDownBlock2D(
File "D:\Code\StableDiffusionXL\layout-guidance-main\my_model\unet_2d_blocks.py", line 537, in __init__
out_channels // attn_num_head_channels,
TypeError: unsupported operand type(s) for //: 'int' and 'list'

这个由于对应的代码还是SD版本,更换乘SDXL即可。

4.2 Error 2

更换成SDXL之后,应该是输入的地方也要改,但是我没有读过论文,所以不知道要改哪里。

主要就是的SDXL的输入中有一个参数added_cond_kwargs应该设置,但是我不知道这个参数是什么意思。

add_cond_kwargs —( dict可选):包含附加嵌入的 kwargs 字典,如果指定,则将其添加到传递到 UNet 块的嵌入中。

能否找一下网上使用SDXL的例子,看一下这个参数是什么设置的。

https://littlenyima.github.io/posts/27-sdxl-stable-diffusion-xl/index.html

Kaggle环境中,我们需要安装这三个来解决这个问题。我相信关闭这个问题就可以了,感谢您的回复。

1
2
3
!pip install diffusers==0.26.3
!pip install transformers==4.38.1
!pip install accelerate==0.27.2

看一下原来是怎么使用注意力机制的。

4.3 Error 3

第一次运行代码后报错如下:

1
2
3
4
5
6
7
8
Traceback (most recent call last):
File "/home/zql/code/layout-guidance-main/inference_xl.py", line 74, in main
pil_images = pipe(
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/zql/code/layout-guidance-main/sdxl.py", line 1223, in __call__
noise_pred, attn_map_integrated_up, attn_map_integrated_mid, attn_map_integrated_down = \
ValueError: not enough values to unpack (expected 4, got 1)

应该是返回值只有1个,但是想要4个返回值,看一下哪里有问题。

尽管我修改了UNet2DConditionModel的代码,但是还是1个返回值,应该是它使用了标准库中的UNet2DConditionModel,而不是加载我自定义的UNet2DConditionModel,所以我应该显示的重新初始化。

1
2
3
4
5
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.unet = UNet2DConditionModel(**unet_config).from_pretrained(cfg.general.model_path, subfolder="unet")
pipe.to("cuda:0")

成功解决。

4.4 Error 4

成功初始化之后,遇到数据类型不同,完整报错如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Error executing job with overrides: ['general.save_path=./example_output']
Traceback (most recent call last):
File "/home/zql/code/layout-guidance-main/inference_xl.py", line 73, in main
pil_images = pipe(
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/zql/code/layout-guidance-main/sdxl.py", line 1224, in __call__
self.unet(
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zql/code/layout-guidance-main/my_model/unet_2d_condition_xl.py", line 1146, in forward
emb = self.time_embedding(t_emb, timestep_cond)
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/diffusers/models/embeddings.py", line 579, in forward
sample = self.linear_1(sample)
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 117, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype, but got Half and Float

可以在变量的后面加上.float(),得以解决。

4.5 Error 5

上面的错误解决后,报了以下错误:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
  File "/home/zql/code/layout-guidance-main/my_model/unet_2d_blocks_xl.py", line 1288, in forward
hidden_states = resnet(hidden_states, temb)
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/diffusers/models/resnet.py", line 327, in forward
hidden_states = self.norm1(hidden_states)
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 288, in forward
return F.group_norm(
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/nn/functional.py", line 2606, in group_norm
return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: Expected weight to be a vector of size equal to the number of channels in input, but got weight of shape [640] and input of shape [640, 64, 64]

可能形状出现了问题?先把输入的形状打印出来看看。

1
hidden_states.shape: torch.Size([2, 640, 64, 64])

后来不知道怎么就解决了…

4.6 Error 6

之后在Unet的up部分又出现了以下报错,马上就要全部解决了。

1
2
3
4
5
6
7
8
9
  File "/home/zql/code/layout-guidance-main/my_model/unet_2d_condition_xl.py", line 1285, in forward
sample, cross_atten_prob = upsample_block(
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zql/code/layout-guidance-main/my_model/unet_2d_blocks_xl.py", line 2529, in forward
hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1)
RuntimeError: Tensors must have same number of dimensions: got 3 and 4

打印相关变量的形状如下所示:

1
2
hidden_states.shape: torch.Size([1280, 32, 32])
res_hidden_states.shape: torch.Size([2, 1280, 32, 32])

可以看到形状不一样,但是我不知道哪个形状出了问题。

发现问题了,是注意力计算的结果应该是2个,但是我只接收了一个。

4.7 Error 7

最后计算得到losslatents计算梯度的时候报错如下:

1
2
3
4
5
6
7
  File "/home/zql/code/layout-guidance-main/sdxl.py", line 1263, in __call__
grad_cond = torch.autograd.grad(loss.requires_grad_(True), [latents])[0]
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/autograd/__init__.py", line 436, in grad
result = _engine_run_backward(
File "/home/zql/anaconda3/envs/sdxl/lib/python3.10/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

就是说在计算过程中有一个或多个需要求导的 Tensor 并没有在计算的图中被使用。应该确保 loss 是依赖于 latents 的。如果 loss 的计算与 latents 无关,那么对 latents 求梯度就没有意义,因为梯度将会是零或者未定义。

这里相关的代码如下:

1
2
# 使用自动求导机制计算损失对 latents 的梯度
grad_cond = torch.autograd.grad(loss.requires_grad_(True), [latents])[0]

打印loss结果是有数值的,所以估计loss的计算过程和latent没有关系?但是按理来说应该是有关系的。

现在应该怎么去解决这个错误?我觉得还是先看一下原来正确的代码运行结果应该是怎么样的。

运行了一下原来的代码,为什么之前的打印latent_model_input。requires_grad就是True呢?

最后发现竟然是运行的这个函数开始设置了不需要梯度,如下:

1
2
3
4
5
6
7
@torch.no_grad()
@replace_example_docstring(EXAMPLE_DOC_STRING)
def __call__(
self,
vae,
tokenizer,
text_encoder,

无语了,以后就知道注意这个问题了。

4.8 Error 8

GPU的显存不够,报OOM,如下:

1
2
3
4
5
6
torch.OutOfMemoryError: CUDA out of memory. 
Tried to allocate 160.00 MiB. GPU 0 has a total capacity of 23.60 GiB of which 132.50 MiB is free.
Including non-PyTorch memory, this process has 23.44 GiB memory in use.
Of the allocated memory 22.79 GiB is allocated by PyTorch, and 351.34 MiB is reserved by PyTorch but unallocated.
If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

做推理的时候由于FFM的维度太大了,24G的显存不够,我换了32G也不够,主要是因为layout中需要保留梯度,所以在sdxl计算过程中和latent相关的变量都需要保存下来。


5 服务器配置报错

5.1 update时不能连接docker.com

问题描述:在使用sudo apt-get update命令时,无法连接到download.docker.com,完整报错如下:

1
2
3
4
5
Err:10 https://download.docker.com/linux/ubuntu focal InRelease                                                                                                                             
Cannot initiate the connection to download.docker.com:443 (2001::caa0:8010). - connect (101: Network is unreachable) Could not connect to download.docker.com:443 (108.160.167.30), connection timed out
Reading package lists... Done
W: Failed to fetch https://download.docker.com/linux/ubuntu/dists/focal/InRelease Cannot initiate the connection to download.docker.com:443 (2001::caa0:8010). - connect (101: Network is unreachable) Could not connect to download.docker.com:443 (108.160.167.30), connection timed out
W: Some index files failed to download. They have been ignored, or old ones used instead.

之后其实也没有解决,因为主要想安装gpustat,但是后来装好conda环境之后,gpustat就有了,也不知道为什么。

5.2 空间不足

安装torch的时候显示磁盘空间不足,完整报错如下:

1
2
ERROR: Could not install packages due to an OSError: 
[Errno 28] No space left on device: '/home/zql/.local/lib/python3.8/site-packages/mpmath'

需要释放一些空间。