Zero-Shot Text-to-Image Generation
๐Ÿฅ‘

Zero-Shot Text-to-Image Generation

Created
Apr 20, 2022
Editor
Tags
Multimodal
cleanUrl: "paper/DALL-E"
๐Ÿ“„
๋…ผ๋ฌธ : Zero-Shot Text-to-Image Generation ์ €์ž : Aditya Ramesh,ย Mikhail Pavlov,ย Gabriel Goh,ย Scott Gray,ย Chelsea Voss,ย Alec Radford,ย Mark Chen,ย Ilya Sutskever

๋…ผ๋ฌธ ์„ ์ • ๊ณ„๊ธฐ

notion image
CLIP๊ณผ ๊ฐ™์ด ๋‘ ๊ฐœ์˜ modality๋ฅผ ๋™์‹œ์— ์ดํ•ดํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์ด ์ง„ํ–‰์ค‘์ด๋‹ค. ๋˜, ํŠน์ •ํ•œ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด ์–ธ์–ด์˜ ํ˜•ํƒœ๋กœ ์ด๋ฏธ์ง€๋ฅผ ํ‘œํ˜„ํ•˜๋Š” image captioning ๋ถ„์•ผ๋Š” ์˜ค๋ž˜๋œ ์—ฐ๊ตฌ ๋ถ„์•ผ ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, language์—์„œ vision์œผ๋กœ, ๋‹ค์‹œ ๋งํ•ด ์ฃผ์–ด์ง„ ์–ธ์–ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์€ ์•ž์„  ๋‘ ์˜ˆ์ œ์— ๋น„ํ•ด ํ›จ์”ฌ ์–ด๋ ค์šด ์ผ์ด๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— language์—์„œ image๋ฅผ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ํ•ด๋‹น ๋…ผ๋ฌธ์ด ํฐ Contribution ์ด ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜์—ฌ, ๋ณธ ๋…ผ๋ฌธ์„ ์„ ์ •ํ•˜์˜€๋‹ค.

Introduction

Zero-Shot Image Generation
๋‚ด๊ฐ€ ๋ณด์ง€ ์•Š์€ ๋ฌผ์ฒด ์ƒํ™ฉ์„ ๊ทธ๋ ค๋‚ด๋Š” ๊ฒƒ์€ ์‰ฌ์šด ๊ฒƒ์ด ์•„๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์—ฌ๋Ÿฌ ์˜ˆ์ˆ ๊ฐ€๋“ค์€ ๊ตฌ์ „์œผ๋กœ ๋‚ด๋ ค์˜ค๋Š” ์‹ ํ™”์˜ ํ•œ ์žฅ๋ฉด๋„ ๊ณง ์ž˜ ๊ทธ๋ ค๋‚ธ๋‹ค. DALL-E์—์„œ๋Š” ์ถฉ๋ถ„ํ•œ ์–‘์˜ ๋ฐ์ดํ„ฐ์™€ ๋ชจ๋ธ ํฌ๊ธฐ๋งŒ ์žˆ๋‹ค๋ฉด ๊ณผ๊ฑฐ์— ํ•™์Šตํ•˜์ง€ ๋ชปํ•œ ์ƒˆ๋กœ์šด ์ด๋ฏธ์ง€๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” Zero-shot ํ•™์Šต ๋ฐฉ๋ฒ•์˜ ์ ์šฉ์ด ๊ฐ€๋Šฅํ•จ์„ ๋ณด์—ฌ์ค€๋‹ค.
Two-Stage Approach
vision ์— ๋Œ€ํ•œ tokenizing ์„ ํ†ตํ•œ high-quality ์ด๋ฏธ์ง€ ์ƒ์„ฑ๊ณผ ์–ธ์–ด์™€ ๊ฐ™์€ stream ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๊ฒŒ ํ•˜์˜€๋‹ค. ๋˜ํ•œ ๊ธฐ์กด์— ์—ฐ๊ตฌ๋“ค์ด MS-COCO์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์—ฐ๊ตฌ๋ฅผ ํ•˜๋Š”๋ฐ ํ•œ์ •๋˜์–ด์žˆ์ง€๋งŒ, ๋งŽ์€ ์ตœ์‹  ์—ฐ๊ตฌ๋“ค์ด Large-Scale Generative model ์— ๋Œ€ํ•œ ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ, ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๋ฅผ Scaling ํ•˜๋Š” ๊ฒƒ์ด ์„ฑ๋Šฅ์— ์–ด๋–ค ์˜ํ–ฅ์„ ์ฃผ๋Š”์ง€ ์‹คํ—˜ํ•ด๋ณด์•˜๋‹ค๋Š”๋ฐ ์˜๋ฏธ๊ฐ€ ์žˆ๋‹ค.
DALL-E๋Š” 12B ๋ชจ๋ธ์‚ฌ์ด์ฆˆ์™€ 250M ๋ฐ์ดํ„ฐ์…‹ (ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€) ์Œ์œผ๋กœ ํ›ˆ๋ จํ•œ๋‹ค.

DALL-E

DALL-E๋Š” ์šฐ์„  ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ๊ฐ™์€ ์ŠคํŠธ๋ฆผ์˜ ๋ฐ์ดํ„ฐ๋กœ ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ์—์„œ ์ถœ๋ฐœํ•œ๋‹ค. ๊ทธ๋ž˜์„œ ์ฒ˜์Œ์—๋Š” ์ด๋ฏธ์ง€๋ฅผ Sequence ๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด Vector Quantized ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์„ค๋ช…ํ•œ๋‹ค.

1. Stage-1

dVAE (discrete Variational AutoEncoder)
dVAE ๋ฅผ ์ด์šฉํ•˜์—ฌ 256x256 โ†’ 32x32 ์˜ image tokens ์œผ๋กœ ๋งŒ๋“ ๋‹ค. DALL-E์—์„œ๋Š” token ์˜ codebook ์‚ฌ์ด์ฆˆ๋ฅผ 8192๋กœ ํ•˜์˜€๋‹ค.
notion image
์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด dVAE ๊ตฌ์กฐ๋กœ q(z|x)๋ฅผ ์–ป๋Š” ๊ฒƒ์ด Stage-1์˜ ๋ชฉ์ ์ด๋‹ค. ์œ„์™€ ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์„œ transformer ์˜ context size๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค.
ํ•œ ๊ฐ€์ง€ ๋ฌธ์ œ์ ์€, encoder ๋ฅผ ๊ธฐ์กด VQ ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉํ•˜๋ฉด codebook ์˜ ํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆฌ๋ฉด encoder ์˜ ๋ฒกํ„ฐ์™€ codebook ์˜ ๋ฒกํ„ฐ๋ฅผ ๋น„๊ตํ•˜๋Š”๋ฐ ์‹œ๊ฐ„์ด ๊ธฐํ•˜๊ธ‰์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€ํ•œ๋‹ค. ๊ทธ๋ž˜์„œ encoder๊ฐ€ codebook ์˜ indice ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์˜ continuous relaxation ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.
continuous relaxation
gumbel softmax ๋…ผ๋ฌธ์—์„œ gumbel distribution ๋ฅผ ๋”ํ•œ ํ›„ temperature scale ๋ฅผ ํ•ด์ฃผ๋ฉด ์ด๊ฒƒ์ด ๊ฐ€ ๋ฌดํ•œํžˆ ์ž‘์•„์ง€๋ฉด argmax์™€ ๊ฐ™์€ ์—ญํ• ์„ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์ฆ๋ช…ํ•˜์˜€๋‹ค.
notion image
encoder ์˜ ๊ฐ’์— gumbel softmax ๋ฅผ ์ทจํ•˜์—ฌ ํ•ด๋‹น ์ธ๋ฑ์Šค๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ์‹์„ ์“ฐ๋ฉด ํ›จ์”ฌ ๊ฐ„๋‹จํ•˜๋‹ค.

2. Stage-2

Stage-2 ์—์„œ๋Š” 256 BPE token ๊ณผ ์œ„์—์„œ ์ธ์ฝ”๋”ฉํ•œ 32x32 = 1024 ์ด๋ฏธ์ง€ ํ† ํฐ์„ concat ํ•˜์—ฌ ์ด 1280 tokens ๋ฅผ ๋งŒ๋“ ๋‹ค. ์ด๋ ‡๊ฒŒ ๋งŒ๋“ค์–ด์ง„ text์™€ image ํ† ํฐ์„ autoregressive transformer์— ๋„ฃ์–ด image ์™€ text์˜ joint distribution ์˜ log-likelihood ๋ฅผ maximize ํ•˜๋Š”๋ฐ ๋ชฉ์ ์ด ์žˆ๋‹ค.
  • image :
  • text :
  • token :
  • : dVAE encoder ์˜ distribution ( ๋Š” dVAE encoder)
  • : dVAE decoder given image tokens
  • : transformer ์˜ (text,image) joint distribution
๋‹ค์Œ์—์„œ joint distribution (text, image) ๋Š”
์ฆ‰, x,y,z(text,image,token)์˜ joint distribution ์€ y,z ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ transformer ์•„์›ƒํ’‹์œผ๋กœ ๋‚˜์˜จ ์ด๋ฏธ์ง€ distribution x ์˜ decoding ๋œ ๊ฒฐ๊ณผ์™€ text distribution ์˜ ๊ณฑ์œผ๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.
  1. ์ฒซ๋ฒˆ์งธ ํ…€์€ dVAE ๋””์ฝ”๋”๊ฐ€ ๋ณต์›ํ•˜๋Š” distbribution์„ maximize ํ•˜๋Š” ํ…€์ด๋‹ค.
  1. ๋‘๋ฒˆ์งธ ํ…€์€ dVAE ์ธ์ฝ”๋”๊ฐ€ ํ† ํฐ์„ ์˜ˆ์ธกํ•˜๋Š” ํ…€์ด๋‹ค.
  1. ์„ธ๋ฒˆ์งธ ํ…€์€ Autoregressive Transformer์˜ Text, Image Joint distribution์„ Maximizeํ•˜๋Š” ํ…€์ด๋‹ค.
์ด๋ฅผ ๋ชจ๋‘ Maximize ํ•˜๋ฉด x,y (ํ…์ŠคํŠธ,์ด๋ฏธ์ง€)์— ๋Œ€ํ•œ distribution ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค.
Transfomer input
Transfomer input
transformer์— ๋“ค์–ด๊ฐ€๋Š” input์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ตฌ์กฐ๊ฐ€ ๋œ๋‹ค.
Masked attention on Vision
notion image
image๊ฐ™์€ ๊ฒฝ์šฐ๋Š” ๊ทผ์ฒ˜์˜ ํ”ฝ์…€์— ์˜ํ–ฅ์„ ๋ฐ›๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. transformer์˜ masking์„ ์กฐ์ ˆํ•˜์—ฌ row-masking, colume masking, conv maksing์„ layer ๋งˆ๋‹ค ๋‹ค๋ฅด๊ฒŒ ์‚ฌ์šฉํ•˜์˜€๋‹ค๊ณ  ํ•œ๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด row-masking์€
notion image
์ด๋ฏธ์ง€์˜ ํŠน์ • token ์€ ์•ž์˜ ๋ชจ๋“  text๋ฅผ ์ฐธ์กฐํ•˜์ง€๋งŒ ํ•ด๋‹น row์˜ ์ „๋‹จ๊ณ„ ํ† ํฐ๋งŒ ์ฐธ์กฐํ•˜๊ฒŒ ๋˜์–ด์žˆ๋‹ค.

3. Data Collection

PoC ๊ฐœ๋…์œผ๋กœ ๋น„๊ต์  ์ ์€ ๋ฐ์ดํ„ฐ ์—์„œ ์‹คํ—˜์„ ํ•ด๋ณด์•˜๋‹ค๊ณ ํ•จ
Conceptual Captions : 3.3 million text-image pairs ์—์„œ ์šฐ์„  ์‹คํ—˜์„ ํ•˜๊ณ ,
์ดํ›„์—๋Š” JFT ๋ฐ์ดํ„ฐ์…‹๊ณผ ๋น„์Šทํ•œ ํฌ๊ธฐ๋กœ 250 million text-images pairs ๋ฅผ ์ธํ„ฐ๋„ท์—์„œ ์ˆ˜์ง‘ํ•จ.

4. Training Big Model

Mixed Precision, PowerSGD, Distributed Training ๋“ฑ ๊ฑฐ๋Œ€ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ธฐ ์œ„ํ•œ ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•๋“ค์„ ์ ์šฉํ•˜์˜€๋‹ค.

Experiments

1. MS-COCO and CUB

notion image
๊ธฐ์กด GAN์ด๋‚˜ ๋‹ค๋ฅธ ๋ฐฉ์‹๋“ค๋ณด๋‹ค FID๋‚˜ IS ๊ฐ€ ๋†’๋‹ค. FID์™€ IS ๋Š” ๋ณดํ†ต Perceptual metric, ์ฆ‰ ์–ผ๋งˆ๋‚˜ ์ง„์งœ ๊ฐ™์€์ง€์— ๋Œ€ํ•œ Score ๋กœ์„œ ๊ธฐ์กด GAN๋ณด๋‹ค ํ›จ์”ฌ ์žˆ์„ ๋ฒ•ํ•œ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•œ๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค.

2. Reranking

notion image
k ์žฅ์„ sample ํ•œ ํ›„ CLIP score ๋กœ ranking์„ ๋งค๊ฒจ์„œ ๊ฐ€์žฅ ๋†’์€ ๊ฒƒ์„ ์‚ฌ์šฉํ•˜๋ฉด ํ›จ์”ฌ ๋” ๊ทธ๋Ÿด์‹ธํ•œ ์ด๋ฏธ์ง€๊ฐ€ ์ƒ์„ฑ๋œ๋‹ค. ์‚ฌ์‹ค ์ด๋Ÿฌํ•œ ํ›„์ฒ˜๋ฆฌ ๊ณผ์ •์€ ๋งค์šฐ Resource-Heavy ํ•œ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ์ƒ๊ฐ๋œ๋‹ค.

3. Qualatative

notion image
์žฌ๋ฐŒ๋Š” ์ ์€ ๊ธฐ์กด์— ๋ณด์ง€๋ชปํ•œ โ€œ์•„๋ณด์นด๋„ ๋ชจ์–‘์˜ ์˜์ž" ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ ๋ฟ๋งŒ์ด ์•„๋‹ˆ๋ผ
์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด โ€œ์œ„์— ๊ทธ๋ฆผ๊ณผ ๋˜‘๊ฐ™์€ ๊ณ ์–‘์ด ์Šค์ผ€์น˜"๋ผ๊ณ  ํ•˜๊ณ  ์ด๋ฏธ์ง€๋ฅผ ๋ฐ˜๋งŒ ์ฃผ๊ณ  ์•„๋ž˜ ๊ทธ๋ฆผ์„ ๊ทธ๋ ค๋ณด๋ผ๊ณ  ํ•˜๋ฉด ์‹ค์ œ ์•„๋ž˜ ์Šค์ผ€์น˜๋ฅผ ๊ทธ๋ฆฐ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
This suggests that it has developed a rudimentary ability to compose unusual concepts at high levels of abstraction.
๋งŒ์•ฝ, ๋ชจ๋ธ์ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹จ์ˆœํžˆ memorization์„ ํ–ˆ๋‹ค๋ฉด ์ด๋Ÿฌํ•œ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜์ง€ ๋ชปํ•  ๊ฒƒ์ด๋‹ค. ๊ทธ๋ ‡์ง€๋งŒ ๋ชจ๋ธ์ด ํ•™์Šต๊ณผ์ •์—์„œ ์ ‘ํ•˜์ง€ ๋ชปํ–ˆ๊ณ , ํ˜„์‹ค์—์„œ๋„ ์กด์žฌํ•˜๊ธฐ ์–ด๋ ค์šด ์ด๋ฏธ์ง€๋„ ๊ทธ ์˜๋ฏธ๋ฅผ ์ถ”๋ก ํ•˜์—ฌ ์ž˜ ์ƒ์„ฑํ•ด ๋‚ธ๋‹ค๋Š” ๊ฒƒ์€ ๊ณ ๋„์˜ ์ถ”์ƒํ™”๋ฅผ ํ†ตํ•ด ๊ธฐ๋ณธ์ ์ธ ๊ฐœ๋…๋“ค์— ๋Œ€ํ•ด ๋ชจ๋ธ์ด ์ž˜ ํ•™์Šตํ–ˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.

Conclusion

๊ธฐ์กด Text-to-Image Generation ์„ Qualitative ํ•œ ์ธก๋ฉด์—์„œ ๋ฟ๋งŒ์ด ์•„๋‹ˆ๋ผ Generalization ์—์„œ๋„ ๋Šฅ๊ฐ€ํ•˜๋Š” DALL-E ๋ชจ๋ธ์„ ์ œ์‹œํ•œ๋‹ค. ์ฆ‰, ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ๊ณผ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์…‹์ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์˜ฌ๋ฆฌ๋Š” ๋ŒํŒŒ๊ตฌ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Œ์„ ์‹œ์‚ฌํ•œ๋‹ค.

Reference & Further Reading