YOLOv4: Optimal Speed and Accuracy of Object Detection
โœ‹๐Ÿป

YOLOv4: Optimal Speed and Accuracy of Object Detection

Created
May 24, 2022
Editor
Tags
Vision
cleanUrl: "paper/YOLOv4"
ย 
๐Ÿ“„
๋…ผ๋ฌธ : YOLOv4: Optimal Speed and Accuracy of Object Detection ์ €์ž : Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao

๋ชจ๋ธ ์„ ์ • ์ด์œ 

YOLO ๋ชจ๋ธ v1๋ถ€ํ„ฐ ๊ณ„์† ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ๋ฅผ ์ง„ํ–‰ํ•˜๋Š” ์žˆ๋Š” YOLO์˜ v4 ๋…ผ๋ฌธ์ด๋‹ค. YOLO ๋ชจ๋ธ๋“ค์˜ ๋ฒ„์ „์ด ๋†’์•„์ง์— ๋”ฐ๋ผ ์ƒˆ๋กœ์šด ๊ธฐ๋ฒ•์„ ์ ์šฉํ•˜์˜€๋Š”๋ฐ, ์ด๋ฒˆ YOLOv4 ๋ชจ๋ธ์€ ์ตœ์‹  ๊ธฐ๋ฒ•๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ต‰์žฅํžˆ ๋งŽ์€ ๋ณ€ํ™”๋ฅผ ์ค€ ๋ชจ๋ธ์ด๋‹ค. ๋”์šฑ ์ •ํ™•ํ•ด์ง€๊ณ  ๋”์šฑ ๋นจ๋ผ์ง„ YOLOv4์˜ ๋ชจ๋ธ์— ๋Œ€ํ•ด ์†Œ๊ฐœํ•ด ๋ณด๋ ค ํ•œ๋‹ค.
ย 

0.Abstract

CNN์˜ ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋งŽ์€ feature๋“ค์ด ์žˆ๋‹ค. ์ด๋Ÿฐ feature๋“ค์€ ํŠน์ • ๋ฌธ์ œ๋‚˜ ๋ฐ์ดํ„ฐ์…‹, ๋ชจ๋ธ์—๋งŒ ์ž‘๋™ํ•˜์ง€๋งŒ batch-normalization, residual-connections๋Š” ๋‹ค์ˆ˜์˜ ๋ชจ๋ธ, task, ๋ฐ์ดํ„ฐ์…‹์— ์ด์šฉ๊ฐ€๋Šฅํ•˜๋‹ค.
YOLOv4 ๋ชจ๋ธ์€๋”ฅ๋Ÿฌ๋‹ ์ตœ์‹  ๊ธฐ๋ฒ•์ธ
1)ย WRCย (Weighted-Residual-Connections)
2)ย CSPย (Cross-Stage-Partial-Connections)
3)ย CmBNย (Cross mini-Batch Normalizations)
4)ย SATย (Self-Adversarial-Training)
5)ย Mish Activation
6)ย Mosaic Data Agumentation
7)ย Drop Block Regularization
8)ย CIOU Loss
๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ SOTA๋ฅผ ๋‹ฌ์„ฑํ•˜์˜€๋‹ค.

1. Introduction

ย 
๋Œ€๋ถ€๋ถ„์˜ ์ •ํ™•๋„๊ฐ€ ๋†’์€ ์ตœ์‹ ์˜ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ๋“ค์€ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ž‘๋™ํ•˜์ง€ ์•Š๊ณ (no real-time) ํฐ mini-batch-size๋กœ ์ธํ•ด ํ›ˆ๋ จํ•  ๋•Œ ๋งŽ์€ GPU๋ฅผ ํ•„์š”๋กœ ํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์กด์žฌํ•œ๋‹ค. ๊ทธ๋ฆฌํ•˜์—ฌ Real-time object detector๋ฅผ ๋งŒ๋“ค์–ด ์ปดํ“จํŒ… ํŒŒ์›Œ๊ฐ€ ๋†’์ง€ ์•Š์€ ํ™˜๊ฒฝ์—์„œ๋„ ๊ธฐ์กด์˜ 1๊ฐœ์˜ GPU๋ฅผ ๊ฐ€์ง€๊ณ  ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.
notion image
YOLOv4์˜ ์ฃผ๋œ ๋ชฉํ‘œ๋Š” ์ œํ’ˆ ์‹œ์Šคํ…œ์—์„œ ๋น ๋ฅธ object detector๋ฅผ ๋งŒ๋“œ๋ฉฐ ์ ์€ ์—ฐ์‚ฐ์„ ๊ฐ€์ง€๋Š” ๊ฒƒ๋ณด๋‹ค ๋ณ‘๋ ฌ ๊ณ„์‚ฐ์œผ๋กœ ์ตœ์ ํ™”๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์ด๋‹ค. ์ €์ž๋Š” ๋ชจ๋ธ์„ ์ด์šฉํ•˜๋ฉด figure1์˜ ๊ฒฐ๊ณผ ์ด๋ฏธ์ง€ ์ฒ˜๋Ÿผ, ๊ธฐ์กด์˜ GPU ํ•˜๋‚˜๋กœ train, test๊ฐ€ ๋˜๊ณ  real-time์— high qulity์˜ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์–ด ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋งํ•œ๋‹ค.

YOLOv4์˜ contribution

  1. ํšจ์œจ์ ์ด๊ณ  ๊ฐ•๋ ฅํ•œ object detection ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ–ˆ๋‹ค. 1080 Ti ํ˜น์€ 2080 Ti GPU๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•™์Šต ํ–ˆ์„ ๋•Œ ๋ˆ„๊ตฌ๋‚˜ ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค.
  1. ํ•™์Šตํ•˜๋Š” ๋™์•ˆ object detection์˜ ์ตœ์‹  ๊ธฐ๋ฒ•์ธ Bag-of-Freebies์™€ Bag-of-Specials๊ฐ€ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์ž…์ฆํ–ˆ๋‹ค. (Bag-of-Freebies์™€ Bag-of-Specials๋Š” ๋’ค์ด์–ด ์„ค๋ช… ์˜ˆ์ •)
  1. CBN, PAN, SAM ๋“ฑ์˜ ์ตœ์‹  ๊ธฐ๋ฒ•๋“ค์„ ์ˆ˜์ •ํ•˜์—ฌ GPU ํ•˜๋‚˜๋กœ๋„ ํ›ˆ๋ จํ•˜๊ธฐ์— ์ ํ•ฉํ•˜๋„๋ก ๋งŒ๋“ค์—ˆ๋‹ค.
ย 

2. Related work

2.1. Object detection models

ย 
notion image
์ตœ์‹ ์˜ detector๋Š” backbone๊ณผ head์˜ ๋‘ ๋ถ€๋ถ„์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. Backbone์€ ์ž…๋ ฅ ์ด๋ฏธ์ง€์—์„œ CNN์„ ๊ฑฐ์ณ feature map์„ ์ƒ์„ฑํ•˜๋Š” ๋‹จ๊ณ„์ด๋ฉฐ image classification dataset์ธ ImageNet์— ํ”„๋ฆฌํŠธ๋ ˆ์ธ ๋œ ๋ชจ๋ธ์ธ ResNet, DenseNet ๊ทธ๋ฆฌ๊ณ  VGGNet ๋“ฑ์„ ์‚ฌ์šฉํ•œ๋‹ค. Head ๋ถ€๋ถ„์€ ๋ฌผ์ฒด์˜ ๋ฐ”์šด๋”ฉ ๋ฐ•์Šค์™€ ํด๋ž˜์Šค๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋‹จ๊ณ„์ด๋ฉฐ ๋ฌผ์ฒด์˜ ๊ฒ€์ถœ์ด ์ด๋ฃจ์–ด์ง€๋Š” ๋ถ€๋ถ„์ด๋‹ค.
Head ๋ถ€๋ถ„์€ one-stage object detector์™€ two-stage object detector์˜ ๋‘ ๊ฐ€์ง€๋กœ ๋ถ„๋ฅ˜๋˜์–ด ์žˆ๋‹ค. R-CNN ๊ณ„์—ด์˜ detector๋Š” two-stage์ด๋ฉฐ one-stage detector์—๋Š” YOLO, SSD, RetinaNet ๋“ฑ์ด ์žˆ๋‹ค. one-stage๋Š” ๋ฌผ์ฒด์˜ ์œ„์น˜๋ฅผ ์ฐพ๋Š”ย Localization ๋ฌธ์ œ์™€ ๋ฌผ์ฒด๋ฅผ ์‹๋ณ„ํ•˜๋Š”ย Classification ๋ฌธ์ œ๊ฐ€ ๋™์‹œ์— ์ด๋ฃจ์–ด์ง€๋ฉฐ, two-stage๋Š” ๋‘ ๋ฌธ์ œ๊ฐ€ ์ˆœ์ฐจ์ ์œผ๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค. ์ตœ๊ทผ์—๋Š” anchor๋Š” ์ด์šฉํ•˜์ง€ ์•Š๋Š” one-stage detector์ธ CenterNet, CornerNet ๋“ฑ์ด ๋“ฑ์žฅํ•˜์˜€๋‹ค.
ย 
์ตœ๊ทผ์—๋Š” ๋ฐฑ๋ณธ๊ณผ ํ—ค๋“œ ์‚ฌ์ด์— ๋ ˆ์ด์–ด๋“ค์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ฐ๊ฐ ๋‹ค๋ฅธ stage๋กœ๋ถ€ํ„ฐ featrue map์„ ์–ป๋Š” neck ๋‹จ๊ณ„๊ฐ€ ๋ฐœ์ „๋˜์—ˆ๋‹ค. ์—ฌ๋Ÿฌ bottom-up๊ณผ top-down path๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ณ  FPN, BiFPN, PAN ๋“ฑ์ด ์ด์— ์†ํ•œ๋‹ค.
notion image
ย 
backbone, neck, head์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•Œ๊ณ  ์‹ถ๋‹ค๋ฉด?
ย 

2.2. Bag of freebies

Bag of Freebies๋Š” inference cost๋ฅผ ์ฆ๊ฐ€ ์‹œํ‚ค์ง€ ์•Š๊ณ , training strategy ๋˜๋Š” training cost๋งŒ์„ ๋ณ€ํ™” ์‹œํ‚ค๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค. data augmentation, Loss function, Regularization๋“ฑ์ด ์ด์— ํ•ด๋‹นํ•œ๋‹ค.
ย 
  1. Data augmentation
๐Ÿ‘‰ ์ธํ’‹ ์ด๋ฏธ์ง€์˜ ๋‹ค์–‘์„ฑ์„ ์ฆ๊ฐ€์‹œ์ผœ ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์—์„œ ์–ป์€ ์ด๋ฏธ์ง€๋“ค์— ๋Œ€ํ•ด ๋”์šฑ robustํ•ด์ง„๋‹ค.
ex) ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•์˜ ์˜ˆ๋กœ๋Š” ๊ด‘๋„ ์™œ๊ณก(๋ฐ๊ธฐ, ๋Œ€๋น„, ์ƒ‰์กฐ, ์ฑ„๋„, ๋…ธ์ด์ฆˆ์ถ”๊ฐ€ ๋“ฑ), ๊ธฐํ•˜ ์™œ๊ณก (ํฌ๊ธฐ๋ณ€ํ™”, crop, flip, rotate ๋“ฑ)
์ข…๋ฅ˜
  • Random erase
  • Cutout
  • MixUp
  • CutMix
  • Style transfer GAN
๋ช‡๋ช‡ ์—ฐ๊ตฌ์ž๋“ค์€ augmentation ๋ฐฉ์‹์ด occlusion ๋ฌธ์ œ(๊ฐ์ฒด๊ฐ€ ๊ฐ€๋ ค์ง€๋Š” ๋ฌธ์ œ)์— ์ค‘์š”ํ•œ ์˜ํ–ฅ์„ ์ฃผ๋ฉฐ image classification๊ณผ object detection๋ถ„์•ผ์— ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ณ์•˜๋‹ค๊ณ  ๋งํ•œ๋‹ค.
Data Augmentation์œผ๋กœ๋Š” image์˜ ์ผ๋ถ€ ์˜์—ญ์— box๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ํ•ด๋‹น ์˜์—ญ์„ 0~255์˜ randomํ•œ ๊ฐ’์œผ๋กœ ์ฑ„์šฐ๋Š”ย Random erase, 0์œผ๋กœ ์ฑ„์šฐ๋Š”ย CutOut, ๋‘ image์™€ label์„ alpha blendingํ•˜๋Š” MixUp, CutOut๊ณผ MixUp์„ ์‘์šฉํ•œ CutMix, Style-transfer GAN ๋“ฑ์˜ ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.
ย 
  1. imbalance sampling
๐Ÿ‘‰๋ฐ์ดํ„ฐ์…‹ ๋‚ด semantic distribution์— bias๊ฐ€ ์žˆ๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•œ๋‹ค.
  • ์„œ๋กœ ๋‹ค๋ฅธ ํด๋ž˜์Šค๋“ค๊ฐ„์— ๋ฐ์ดํ„ฐ ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๊ฐ€ ์กด์žฌํ•œ๋‹ค.
    • two-stage detector : hard negative example mining, online hard example mining์œผ๋กœ ํ•ด๊ฒฐ
      one-stage object detector: denseํ•œ prediction architecture๋ฅผ ์ด์šฉํ•˜๋ฏ€๋กœ,ย example mining ๊ธฐ๋ฒ• ์ ์šฉ ๋ถˆ๊ฐ€๋Šฅ
โ†’ Lin et al์ด ์ œ์‹œํ•œ focal loss๋ฅผ ์ ์šฉํ•œ๋‹ค
  • one-hot hard representation์—์„œ ๋‹ค๋ฅธ ์นดํ…Œ๊ณ ๋ฆฌ๋“ค ์‚ฌ์ด์˜ ์—ฐ๊ด€์ •๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์ด ์–ด๋ ต๋‹ค.
    • label smoothing ๋ฐฉ์‹์„ ํ†ตํ•ด, ํ›ˆ๋ จ ์‹œ์— hard label โ†’ soft label๋กœ ๋ฐ”๊ฟ” ๋ชจ๋ธ์„ ๋”์šฑ robustํ•˜๊ฒŒ ๋งŒ๋“ ๋‹ค.
ย 
focal loss๋ž€?
Easy Example์˜ weight๋ฅผ ์ค„์ด๊ณ  Hard Negative Example์— ๋Œ€ํ•œ ํ•™์Šต์— ์ดˆ์ ์„ ๋งž์ถ”๋Š” Cross Entropy Loss ํ•จ์ˆ˜์˜ ํ™•์žฅํŒ์ด๋ผ๊ณ  ๋ณด๋ฉด ๋œ๋‹ค.
์‰ฝ๊ฒŒ ๋งํ•˜์ž๋ฉด, ์–ด๋ ต๊ฑฐ๋‚˜ ์‰ฝ๊ฒŒ ์˜ค๋ถ„๋ฅ˜๋˜๋Š” ์ผ€์ด์Šค์— ๋Œ€ํ•˜์—ฌ ๋” ํฐ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค. (๊ฐ์ฒด ์ผ๋ถ€๋ถ„๋งŒ ์žˆ๊ฑฐ๋‚˜, ์‹ค์ œ ๋ถ„๋ฅ˜ํ•ด์•ผ ๋˜๋Š” ๊ฐ์ฒด๋“ค์ด ์ด์— ํ•ด๋‹นํ•œ๋‹ค.)
label smoothing ๋ฐฉ์‹์ด๋ž€?
Hard label(One-hot encoded vector๋กœ ์ •๋‹ต ์ธ๋ฑ์Šค๋Š” 1, ๋‚˜๋จธ์ง€๋Š” 0์œผ๋กœ ๊ตฌ์„ฑ)์„ Soft label(๋ผ๋ฒจ์ด 0๊ณผ 1 ์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ๊ตฌ์„ฑ)๋กœ ์Šค๋ฌด๋”ฉํ•˜๋Š” ๊ฒƒ์„ ๋œปํ•œ๋‹ค.
ย 
  1. Bounding Box(BBox) regression
[๊ธฐ์กด ๋ฐฉ์‹๋“ค]
  • ์ „ํ†ต์ ์ธ object detector: ๋ณดํ†ต Mean Square Error (MSE)๋ฅผ ์ด์šฉํ•˜์—ฌ, bbox์˜ center point, height, width ์ขŒํ‘œ๋“ค(ex {x_center, y_center, w, h}) ๋˜๋Š” upper left & lower right point(ex {x_top_left, y_top_left, x_bottom_right, y_bottom_right})๋ฅผ ์ง์ ‘์ ์œผ๋กœ regressionํ–ˆ๋‹ค.
  • anchor based method :
    • (์˜ˆ: {x_center_offset, y_center_offset, w_offset, h_offeset} ๋ฐ {x_top_left_offset, y_top_left_offset, x_bottom_right_offset, y_bottom_right_offset})์™€ ๊ฐ™์ด ๊ฐ ์ขŒํ‘œ์— ํ•ด๋‹น๋˜๋Š” offset์„ ์ถ”์ •ํ•œ๋‹ค.
ํ•˜์ง€๋งŒ, ์ด๋ ‡๊ฒŒ ์ง์ ‘์ ์œผ๋กœ bbox์˜ ์ขŒํ‘ฏ๊ฐ’์„ ์ถ”์ •ํ•˜๋Š” ๊ฒƒ์€ ๋…๋ฆฝ์ ์ธ ๋ณ€์ˆ˜๋กœ ๋ณด๋Š” ๊ฒƒ์ด๊ณ  object ์ž์ฒด๋ฅผ ๊ณ ๋ คํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋‹ค.
์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, loss fuction์— ๋Œ€ํ•œ ๋ณ€ํ™”๋ฅผ ์ฃผ์—ˆ๋‹ค.
ย 
[๊ฐœ์„ ๋œ ๋ฐฉ์•ˆ]
  • IoU loss
    • ์˜ˆ์ธก๋˜๋Š” bbox ์˜์—ญ๊ณผ groudtruth ์˜์—ญ์˜ ๋ฒ”์œ„(coverage)๋ฅผ ๊ณ ๋ คํ•œ๋‹ค.
      iou loss computing ๊ณผ์ •์€ groundtruth๋กœ IoU๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์ƒ์„ฑ๋œ ๊ฒฐ๊ณผ๋ฅผ ์ „์ฒด ์ฝ”๋“œ๋กœ ์—ฐ๊ฒฐํ•˜์—ฌ bbox์˜ 4๊ฐœ์˜ ์ขŒํ‘œ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.
      โ†’ ioU๋Š” scale์ด ๋ณ€ํ•˜์ง€ ์•Š๋Š” ํ‘œํ˜„์ด๋ฏ€๋กœ, ์ „ํ†ต์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ {x, y, w, h}์˜ l_1 ๋˜๋Š” l_2 loss๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ loss๊ฐ€ scale์— ๋”ฐ๋ผ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐ ํ•  ์ˆ˜ ์žˆ๋‹ค.
  • GIoU loss
    • coverage ์˜์—ญ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ object์˜ ํ˜•ํƒœ(shape)๊ณผ ๋ฐฉํ–ฅ(orientation)๋ฅผ ํฌํ•จ์‹œํ‚จ๋‹ค.
    • ์˜ˆ์ธก๋œ bbox์™€ groundtruth๋ฅผ ๋™์‹œ์— coverํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€์žฅ ์ž‘์€ ์˜์—ญ์˜ bbox๋ฅผ ์ฐพ์€ ํ›„, BBox์˜ ๋ถ„๋ชจ(denominator)๋ฅผ ์›๋ž˜ IoU loss์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ ๋ถ„๋ชจ๋กœ ๋Œ€์ฒดํ•˜์˜€๋‹ค.
  • DIoU loss : object์˜ center์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ ๋ คํ•œ๋‹ค.
  • CIoU loss
    • ๊ฒน์น˜๋Š” ์˜์—ญ๊ณผ center point ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ, aspect ratio๋ฅผ ๋™์‹œ์— ๊ณ ๋ คํ•œ๋‹ค.
    • bbox regression ๋ฌธ์ œ๋ฅผ ๋”์šฑ ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•˜๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.
ย 
  1. Regularization
  • DropOut
  • DropPath
  • Spatical dropout
  • Dropblock
ย 

2.3. Bag of specials

inference cost๋Š” ์กฐ๊ธˆ ์ฆ๊ฐ€ํ•˜์ง€๋งŒ object detection์˜ ์ •ํ™•๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์„ ๋œปํ•œ๋‹ค.
plugin modules๊ณผ post-processing์ด ์ด์— ํ•ด๋‹นํ•œ๋‹ค. Plugin modules์€ ๋ชจ๋ธ์˜ ํŠน์ • ๋ถ€๋ถ„์„ ๊ฐ•ํ™” ์‹œํ‚ค๋Š” ๊ฒƒ์„ ๋œปํ•˜๋ฉฐ, receptive fieldํ™•๋Œ€, attention mechanism๋„์ž… ๋˜๋Š” feature intergration capability ๊ฐ•ํ™” ๋“ฑ์ด ์ด์— ํฌํ•จ๋œ๋‹ค. Post-pocessing์€ ๋ชจ๋ธ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ์„ ๋ณ„(screening)ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
ย 

1)plugin modules

1.receptive field enhancement module
  • SPP
    • Spatial Pyramid Matching(SPM)์—์„œ ์œ ๋ž˜ํ•˜์˜€๋‹ค.
    • feature map์„ ์—ฌ๋Ÿฌ d x d {1,2,3 ...} ์‚ฌ์ด์ฆˆ์˜ ๋ธ”๋ก์œผ๋กœ splitํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ, spatial pyramid๋ฅผ ๋งŒ๋“ค์–ด bag-of-word feature๋ฅผ ์ถ”์ถœํ•œ๋‹ค.
      • (์—ฌ๊ธฐ์— ์™œ bag-of-word๊ฐ€ ๋‚˜์˜ค๋Š”์ง€ ๋ชจ๋ฅด๊ฒ ๋„ค์š”..?)
    • YOLOv3 [63]์—์„œ๋Š” k x k(k={1, 5, 9, 13}) kernel size์™€ stride=1๋ฅผ ๊ฐ€์ง„ max-pooling ์ถœ๋ ฅ์„ concatenationํ•˜์—ฌ SPP module์„ ๊ฐœ์„ ํ•˜์˜€๋‹ค.
    • ์œ„์™€ ๊ฐ™์€ ์„ค๊ณ„ ํ•˜์—, ๋น„๊ต์  ํฐ k x k max-pooling์œผ๋กœ backbone feature์˜ receptive field๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ฆ๊ฐ€์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค.
    • YOLOv3-608 โ†’ MS COCO dataset์„ ์ด์šฉํ•ด 0.5%์˜ ์ถ”๊ฐ€์ ์ธ ๊ณ„์‚ฐ ๋น„์šฉ์œผ๋กœ AP_50์„ 2.7% ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค.
    • SPP ์ฐธ๊ณ ๋‚ด์šฉ
      ๊ธฐ์กด์˜ CNN ์•„ํ‚คํ…์ณ๋“ค์€ ๋ชจ๋‘ ์ž…๋ ฅ ์ด๋ฏธ์ง€๊ฐ€ ๊ณ ์ •๋˜์–ด์•ผ ํ–ˆ๋‹ค. (ex. 224 x 224) ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ์‹ ๊ฒฝ๋ง์„ ํ†ต๊ณผ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ด๋ฏธ์ง€๋ฅผ ๊ณ ์ •๋œ ํฌ๊ธฐ๋กœ ํฌ๋กญํ•˜๊ฑฐ๋‚˜ ๋น„์œจ์„ ์กฐ์ •(warp)ํ•ด์•ผ ํ–ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋ ‡๊ฒŒ ๋˜๋ฉด ๋ฌผ์ฒด์˜ ์ผ๋ถ€๋ถ„์ด ์ž˜๋ฆฌ๊ฑฐ๋‚˜, ๋ณธ๋ž˜์˜ ์ƒ๊น€์ƒˆ์™€ ๋‹ฌ๋ผ์ง€๋Š” ๋ฌธ์ œ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ ํฌ๊ธฐ๋‚˜ ๋น„์œจ์— ๊ด€๊ณ„ ์—†์ด CNN์„ ํ•™์Šต ์‹œํ‚ฌ ์ˆ˜๋Š” ์—†์„๊นŒ? ํ•˜๋Š” ์•„์ด๋””์–ด๋กœ ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ ํฌ๊ธฐ์— ๊ด€๊ณ„ ์—†์ด Conv layer๋“ค์„ ํ†ต๊ณผ์‹œํ‚ค๊ณ , FC layer ํ†ต๊ณผ ์ „์— ํ”ผ์ณ ๋งต๋“ค์„ ๋™์ผํ•œ ํฌ๊ธฐ๋กœ ์กฐ์ ˆํ•ด์ฃผ๋Š” pooling์„ ์ ์šฉํ•˜์ž๋Š” ์ƒ๊ฐ์„ ๊ฐ€์ง€๊ฒŒ ๋œ๋‹ค.
      ์ด๋ฅผ ํ†ตํ•ด spatial pyramid pooling์ด ์ƒ๊ฒจ๋‚ฌ๊ณ , Spatial Pyramid Pooling์„ ํ†ตํ•ด์„œ ๊ฐ๊ธฐ ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅธ CNN ํ”ผ์ณ๋งต ์ธํ’‹์œผ๋กœ๋ถ€ํ„ฐ ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ feature vector๋ฅผ ๋ฝ‘์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค.
      SPPNET ์ฐธ๊ณ ์ž๋ฃŒ : https://yeomko.tistory.com/14
      ย 
  • ASPP
  • RFB
ย 
2.attention module
๊ฐ์ฒด ๊ฒ€์ถœ์— ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” attention ๋ชจ๋“ˆ์€ ํฌ๊ฒŒ channel-wise Attention์™€ point wise Attention์œผ๋กœ ๊ตฌ๋ถ„๋˜๋ฉฐ, ์ด ๋‘ ๊ฐ€์ง€ ์ฃผ์˜ ๋ชจ๋ธ์˜ ๋Œ€ํ‘œ์ ์ธ ๊ฒƒ์œผ๋กœ Squeeze-and-Excitation(SE)๊ณผ Spatical Attention Module(SAM)์ด ์žˆ๋‹ค.
ย 
3.feature integration
  • low-level์˜ ๋ฌผ๋ฆฌ์ ์ธ feature๋ฅผ high-level์˜ semantic feature๋กœ ํ†ตํ•ฉํ•˜๊ธฐ ์œ„ํ•ด skip connection ๋˜๋Š” hyper-column์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.
  • FPN ์ดํ›„, multi-scale์˜ ์˜ˆ์ธก ๋ฐฉ๋ฒ•๋“ค์ด ๋Œ€์ค‘ํ™”๋˜๋ฉด์„œ, ์„œ๋กœ ๋‹ค๋ฅธ feature pyramid๋ฅผ ํ†ตํ•ฉํ•˜๋Š” ๋งŽ์€ ๊ฒฝ๋Ÿ‰ module๋“ค์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค
    • SFAM
    • ASFF
    • BiFPN
ย 
4.activation fuction
์ข‹์€ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋Š” gradient๋ฅผ ๋”์šฑ ํšจ๊ณผ์ ์œผ๋กœ propagateํ•˜์—ฌ ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.
  • LReLU
  • PReLU
  • ReLU6
  • Scaled Exponential Linear Unit (SELU)
  • Swish
  • hard-Swish
  • Mish
ย 
*Swish์™€ Mish ๋ชจ๋‘ ์—ฐ์†์ ์œผ๋กœ ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•œ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋ผ๋Š” ์ ์—์„œ ์ฃผ๋ชฉํ•ด๋ณผ ํ•„์š”๊ฐ€ ์žˆ๋‹ค.
ย 

2)post-processing

๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜ object detection ๋ถ„์•ผ์—์„œ ํ”ํžˆ ์‚ฌ์šฉ๋˜๋Š” NMS๊ฐ€ ์ด์— ํ•ด๋‹นํ•œ๋‹ค. NMS๋Š” ๋™์ผํ•œ ๊ฐ์ฒด๋ฅผ ๋‚ฎ์€ ์ •ํ™•๋„๋กœ ์˜ˆ์ธกํ•˜๋Š” BBox๋ฅผ ํ•„ํ„ฐ๋งํ•˜๊ณ  ์ •ํ™•๋„๊ฐ€ ๋†’์€ ํ›„๋ณด์˜ BBox๋งŒ ์œ ์ง€ํ•œ๋‹ค. NMS๊ฐ€ ๊ฐœ์„ ํ•˜๋ ค๋Š” ๋ฐฉ๋ฒ•์€ ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ์ผ์น˜ํ•œ๋‹ค.
ย 
โ†’ ๋‹ค์‹œ ํ•œ๋ฒˆ ๋ณด๊ธฐ
R-CNN [19]์—์„œ ์‚ฌ์šฉํ•œ NMS
  • ์›๋ž˜ ์ œ์•ˆ๋œ NMS์˜ ๊ฒฝ์šฐ context information๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š์Œ
  • R-CNN์—์„œ๋Š” classification confidence score๋ฅผ reference๋กœ ์ถ”๊ฐ€ํ•˜๊ณ  confidence score์˜ ์ˆœ์„œ์— ๋”ฐ๋ผ ๋†’์€ score์—์„œ ๋‚ฎ์€ score ์ˆœ์œผ๋กœ greedy NMS๋ฅผ ์ˆ˜ํ–‰
R-CNN ์ดํ›„์˜ ์—ฐ๊ตฌ๋“ค
  • soft NMS [1]: object์˜ occlusion์œผ๋กœ ์ธํ•ด, greedy NMS์—์„œ๋Š” confidence score๊ฐ€ IoU score์™€ ํ•จ๊ป˜ degradation๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ฌธ์ œ๋ฅผ ๊ณ ๋ ค
  • DIoU NMS [99]:ย soft NMS์— ๊ธฐ์ดˆํ•˜์—ฌ, BBox screening process์— center point ๊ฑฐ๋ฆฌ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ถ”๊ฐ€
ย 
์œ„์˜ post-processing ๋ฐฉ๋ฒ•๋“ค์€ ๋ชจ๋‘ capture๋œ image feature๋ฅผ ์ง์ ‘์ ์œผ๋กœ ์ฐธ๊ณ ํ•˜๊ณ  ์žˆ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ์ดํ›„์˜ anchor-free ๊ธฐ๋ฒ•์˜ ๊ฐœ๋ฐœ์—์„œ๋Š” ๋” ์ด์ƒ ํ•„์š”ํ•˜์ง€ ์•Š๋Š”๋‹ค.

3. Methodology

YOLOv4์˜ ๊ธฐ๋ณธ์ ์ธ ๋ชฉํ‘œ๋Š” ์ด๋ก ์  ์ง€ํ‘œ์ธ BFLOPS๋ฅผ ์ค„์ด๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์‹œ์Šคํ…œ๋‚ด์—์„œ ๋ณ‘๋ ฌ๊ณ„์‚ฐ์„ ์œ„ํ•œ ์ตœ์ ํ™”์™€ ์‹ ๊ฒฝ๋ง์˜ ๋น ๋ฅธ ์ž‘๋™์†๋„๋ฅผ ๊ฐ€์ง€๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋ฅผ ์œ„ํ•ด GPU์™€ VPU์˜ ๊ฒฝ์šฐ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค.
  • GPU์˜ ๊ฒฝ์šฐ
    • convolutional layers ๋‚ด group์˜ ์ˆ˜๊ฐ€ ์ž‘์€(1-8) CSPResNeXt50 / CSPDarknet53 ๋“ฑ์„ ์ด์šฉํ•œ๋‹ค.
  • VPU์˜ ๊ฒฝ์šฐ
    • grouped-convolution์€ ์‚ฌ์šฉํ•˜์ง€๋งŒ, Squeeze-and-excitement (SE) blocks์€ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค. EfficientNet-lite / MixNet / GhostNet / MobileNetV3 ๋“ฑ์˜ ๋ชจ๋ธ๋“ค์„ ํฌํ•จํ•œ๋‹ค.
      ย 

3.1 Selection of architecture

๋ชจ๋ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ชฉํ‘œ๋ฅผ ์ง€๋‹Œ๋‹ค.
  1. input network resolution, convolutional layer ๊ฐœ์ˆ˜, ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜(filter size2 * filters * channel / groups), output layer(filter)์˜ ์ˆ˜ ๊ฐ€์šด๋ฐ ์ตœ์ ์˜ balance๋ฅผ ์ฐพ๋Š”๋‹ค.
    1. ex) CSPDarknet53์™€ CSPResNext50 ๋‘ ๋ชจ๋ธ์ด classification์—์„œ๋Š” CSPResNext50์ด ๋” ์šฐ์ˆ˜ํ•˜๊ณ  object detection์—์„œ๋Š” CSPDarknet53๊ฐ€ ์šฐ์ˆ˜ํ•˜๋“ฏ์ด classification์„ ์œ„ํ•œ reference model์ด detector์—๋„ ํ•ญ์ƒ ์ตœ์ ์ž„์„ ๋ณด์žฅํ•  ์ˆ˜ ์—†๋‹ค.
  1. receptive field๋ฅผ ๋Š˜๋ฆด ์ˆ˜ ์žˆ๋Š” ์ถ”๊ฐ€์ ์ธ ๋ธ”๋ก๋“ค๊ณผ ๋‹ค๋ฅธ detector levels์˜ ๋‹ค๋ฅธ backbone levels๋กœ๋ถ€ํ„ฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•ฉ์น˜๊ธฐ ์œ„ํ•œ ์ตœ์ƒ์˜ ๊ธฐ๋ฒ•์„ ์„ ํƒํ•œ๋‹ค.
    1. ex) FPN, PAN, ASFF, BiFPN
classification๊ณผ ๋‹ฌ๋ฆฌ detector์—๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ๋‚ด์šฉ๋“ค์ด ํ•„์š”ํ•˜๋‹ค
  • ๋†’์€ input ์ด๋ฏธ์ง€์˜ ํ•ด์ƒ๋„ โ€“> ๋‹ค์ˆ˜์˜ ์ž‘์€ ๋ฌผ์ฒด๋„ ๊ฒ€์ถœํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค.
  • ๋” ๋งŽ์€ ๋ ˆ์ด์–ด โ†’ input ๋„คํŠธ์›Œํฌ์˜ ์ฆ๊ฐ€๋œ ์‚ฌ์ด์ฆˆ๋ฅผ coverํ•  ์ˆ˜ ์žˆ๋Š” ๋†’์€ receptive field๊ฐ€ ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
  • ๋” ๋งŽ์€ ํŒŒ๋ผ๋ฏธํ„ฐ โ†’ ํ•œ ์ด๋ฏธ์ง€์—์„œ ๋‹ค๋ฅธ ํฌ๊ธฐ๋“ค์˜ ๋ฌผ์ฒด๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.
ย 
notion image
ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์™€ receptive field size๊ฐ€ ํฐ ๋ชจ๋ธ์ด ์ข‹์€ ๋ชจ๋ธ์ด๋ผ๊ณ  ๊ฐ€์ •ํ–ˆ์„ ๋•Œ, ์—ฌ๋Ÿฌ ์‹œํ—˜์„ ํ†ตํ•ด ์œ„์˜ ํ‘œ์™€ ๊ฐ™์ด CSPDarknet53๊ฐ€ ๋‹ค๋ฅธ ๋‘ backbone ๋ชจ๋ธ์— ๋น„ํ•ด ์ตœ์ ์˜ ๋ชจ๋ธ์ด๋ผ ํ•  ์ˆ˜ ์žˆ๋‹ค.
ย 
ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅธ receptive field๋“ค์˜ ์˜ํ–ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
  • ๊ฐ์ฒด์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์งˆ ๋•Œ, ์ „์ฒด ๊ฐ์ฒด๋ฅผ ๋ณด๋‹ค ๋งŽ์ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • ๋„คํŠธ์›Œํฌ ์‚ฌ์ด์ฆˆ๊ฐ€ ์ปค์งˆ ๋•Œ, ๊ฐ์ฒด ์ฃผ๋ณ€์˜ context๋ฅผ ๋ณด๋‹ค ๋งŽ์ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • ๋„คํŠธ์›Œํฌ ํฌ๊ธฐ๊ฐ€ ์ดˆ๊ณผํ•˜๋ฉด, image point์™€ ์ตœ์ข… activation fuction ์‚ฌ์ด ์—ฐ๊ฒฐ ๊ฐœ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•œ๋‹ค.

3.2 Selection of BoF and BoS

detector์˜ ํ›ˆ๋ จ ์„ฑ๋Šฅ์„ ์ฆ๊ฐ€ ์‹œํ‚ค๊ธฐ ์œ„ํ•ด BoF์™€ BoS๋ฅผ ์„ ํƒํ•ด์•ผ ํ•œ๋‹ค. ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ๋‹ค์–‘ํ•œ ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ์ค‘ PReLU์™€ SeLU๋Š” ํ›ˆ๋ จํ•˜๊ธฐ ์–ด๋ ต๊ณ  ReLU6๋Š” ์–‘์žํ™” ๋„คํŠธ์›Œํฌ๋ฅผ ์œ„ํ•œ ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ์ด๋ฏ€๋กœ ํ›„๋ณด์—์„œ ์ œ๊ฑฐ ํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.
Drop block์€ ์ด๋ฏธ ๋งŽ์€ ์ •๊ทœํ™” ๋ฐฉ๋ฒ•์ค‘ ๋งŽ์€ ์ข‹์€ ์„ฑ๋Šฅ์„ ํ™•์ธํ•˜์˜€๊ธฐ์— Drop block์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ๋ง์„ค์ด์ง€ ์•Š์•˜๋‹ค๊ณ  ํ•œ๋‹ค.

3.3 Additional improvements

single GPU์—์„œ๋„ ํ›ˆ๋ จ ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋„๋ก ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ถ”๊ฐ€์ ์ธ ๋””์ž์ธ์„ ๊ตฌ์„ฑํ•˜์˜€๋‹ค.
  • ์ƒˆ๋กœ์šด data augmentation ๋ฐฉ์‹์€ Mosaic๊ณผ Self-Adversarial Training (SAT)
    • 1) Mosaic
      CutMix๋Š” ๋‹จ์ง€ 2๊ฐœ์˜ ์ž…๋ ฅ ์ด๋ฏธ์ง€๋“ค๋งŒ mixํ•˜๋Š”๋ฐ ๋ฐ˜ํ•ด, Mosaic์€ย ๋‹ค์Œ๊ณผ ๊ฐ™์ด 4๊ฐœ์˜ training ์ด๋ฏธ์ง€๋“ค์„ 1๊ฐœ๋กœ mixํ•œ๋‹ค.
      โ†’ normalํ•œ context ์™ธ๋ถ€์˜ object๋“ค๋„ ๊ฒ€์ถœ ๊ฐ€๋Šฅํ•˜๋‹ค.
      โ†’ batch normalization์€ 4๊ฐœ์˜ ์ด๋ฏธ์ง€๋“ค์— ๋Œ€ํ•œ activation statistics๋ฅผ ๊ณ„์‚ฐ ๊ฐ€๋Šฅํ•˜๋‹ค.
      notion image
2) Self-Adversarial Training(SAT)
2๋‹จ๊ณ„์˜ forward ๋ฐ backward ๋‹จ๊ณ„๋กœ ๋™์ž‘ํ•˜๋Š” ์ƒˆ๋กœ์šด data augmentation ๊ธฐ๋ฒ•์ด๋‹ค.
1๋‹จ๊ณ„ : ๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ€์ ธ๊ฐ€๋Š” ๊ฒƒ ๋Œ€์‹ ์— ์›๋ณธ ์ด๋ฏธ์ง€๋ฅผ ๋ณ€๊ฒฝํ•œ๋‹ค.
โ†’ ์›๋ณธ ์ด๋ฏธ์ง€์— ๊ฐ์ฒด๊ฐ€ ์—†๋‹ค๋Š” ์†์ž„์ˆ˜๋ฅผ ๋งŒ๋“ค๊ฒŒ๋œ๋‹ค.
2๋‹จ๊ณ„ : 1๋‹จ๊ณ„์—์„œ ์ˆ˜์ •๋œ ์ด๋ฏธ์ง€๋ฅผ ๊ฒ€์ถœํ•˜๋Š” ํ›ˆ๋ จ์„ ์ง„ํ–‰ํ•œ๋‹ค.
โ‡’ ์ด ๋‘ ๊ณผ์ •์„ ํ†ตํ•ด propagation๊ณผ์ •์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฆ๊ฐ• ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.
ย 
  • genetic algorithm(GA)์„ ์ด์šฉํ•˜์—ฌ ์ตœ์ ์˜ hyper parameter๋ฅผ ๊ณ ๋ฅธ๋‹ค.
  • ๊ธฐ์กด์˜ method๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋„๋ก ์ˆ˜์ •ํ•œ ๋ฒ„์ „์˜ SAM, PAN์™€ CmBN(Cross mini-batch Normalization)๋ฅผ ๋„์ž…ํ–ˆ๋‹ค.
ย 
Bos : Cross mini-Batch Normalization(CmBN)
CBN์˜ ์ˆ˜์ •๋œ ๋ฒ„์ „์œผ๋กœ single batch ๋‚ด์—์„œ mini-batche๋“ค ์‚ฌ์ด์˜ ํ†ต๊ณ„๋Ÿ‰์„ ์ˆ˜์ง‘ํ•œ๋‹ค.
notion image
CBN ์ฐธ๊ณ ์ž๋ฃŒ : https://deep-learning-study.tistory.com/635
ย 
Bos : Modified SAM & PAN
notion image

3.4 YOLOv4

YOLOv4 consists of:
  • backbone : CSPDarknet53
  • neck
    • addtional blocks: SPP
    • path-aggregation blocks: PANet
  • Head: YOLOv3
notion image
CSPDarknet53์— SPP block์„ ์ถ”๊ฐ€ํ•˜์—ฌ
  • receptive field๋ฅผ ์ƒ๋‹นํžˆ ๋Š˜๋ฆด ์ˆ˜ ์žˆ๋‹ค.
  • ๊ฐ€์žฅ ์ค‘์š”ํ•œ context features๋ฅผ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค.
  • network์˜ ๋™์ž‘ ์†๋„๋ฅผ ๊ฑฐ์˜ ์ค„์ด์ง€ ์•Š๋‹ค.
  • parameter aggregation ๋ฐฉ๋ฒ•์œผ๋กœ PANet์„ ์‚ฌ์šฉํ•˜์—ฌ YOLOv3์—์„œ์˜ FPN ๋Œ€์‹ ์— ๋‹ค๋ฅธ detector level์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐฑ๋ณธ ๋ ˆ๋ฒจ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค.
ย 

4.Experiments

ย 

Influence of different features on Classifier training, Detector training

ย 
notion image
notion image

Influence of different backbones and pretrained weightings on Detector training

notion image

Influence of different mini-batch size on Detector training

notion image

5. Results

notion image
YOLOv4๋Š” ์œ„์˜ figure 8์—์„œ ๋‹ค๋ฅธ sota detector๋“ค๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•˜๋‹ค.

6. Conclusions

์šฐ๋ฆฌ๋Š” ๋‹ค๋ฅธ detector์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ ๋” ๋น ๋ฅด๊ณ (FPS) ์ •ํ™•ํ•œ(MS COCO AP50...95 and AP50)์ตœ์‹ ์˜ detector๋ฅผ ์ œ๊ณตํ•œ๋‹ค. 8-16GB-VRAM๋กœ ๊ธฐ์กด์˜ ํ•˜๋‚˜์˜ GPU๋กœ ํ›ˆ๋ จ์ด ๊ฐ€๋Šฅํ•˜๊ธฐ์— ์‚ฌ์šฉ์„ฑ์ด ๋”์šฑ ๋›ฐ์–ด๋‚˜๋‹ค. ๋งŽ์€ feature๋“ค์„ ์ถ”๊ฐ€ํ•˜์—ฌ classifier์™€ detector ๋ชจ๋‘์˜ ์ •ํ™•๋„๋ฅผ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค. ์ด๋Ÿฌํ•œ features๋“ค์€ ํ–ฅํ›„ ์—ฐ๊ตฌ ๊ฐœ๋ฐœ์„ ์œ„ํ•œ ๋ชจ๋ฒ” ์‚ฌ๋ก€(best-practice)๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.
์ถœ์ฒ˜:
ย 
ย 

์ด์ „ ๊ธ€ ์ฝ๊ธฐ

ย