๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“’ Deep learning

[์ธ๊ณต์ง€๋Šฅ] Training CNN

by isdawell 2022. 4. 26.
728x90

 

๐Ÿ“Œ ๊ต๋‚ด '์ธ๊ณต์ง€๋Šฅ' ์ˆ˜์—…์„ ํ†ตํ•ด ๊ณต๋ถ€ํ•œ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. 

 

 

1๏ธโƒฃ  ๋ณต์Šต


 

โ‘  FC backpropagation 

 

 

 

 

๐Ÿ‘‰ dy(L) ์—์„œ dz(L+1) ์ด ์‚ฝ์ž…๋˜๋Š” ๋ถ€๋ถ„ ์ดํ•ดํ•  ๊ฒƒ! 

 

๐Ÿ’จ dy(L) = (0-0t) * f'(zk) * W = dz(L+1) * W

๐Ÿ’จ dz(L+1) = (0-0t) * f'(zk) = dy(L+1) * f'(zk) 

๐Ÿ’จ ์ตœ์ข… ๋๋‹จ layer dy(L+1) = dC/dy(L+1) = d {1/2*(0-0t)^2} / dy(L+1) = d {1/2*(0-0t)^2} / d0t = (0-0t)

 

 

 

 

๐Ÿ‘ป by chain rule 

 

  • activation gradient : dL / dy(L) = local gradient * weight 
  • local graidnet : dL / dz(L) 
  • weight gradient : dL / dW(L) = local gradient * input 

 

โญ activation gradient ๊ฐ€ ๋’ค ๋ ˆ์ด์–ด์—์„œ๋ถ€ํ„ฐ ์•ž ๋ ˆ์ด์–ด๋กœ ์ „๋‹ฌ๋˜๊ณ  ๊ทธ๋ฅผ ํ†ตํ•ด ๊ณ„์‚ฐ๋œ local gradient ๋ฅผ  ์ด์šฉํ•ด weight gradient ๋ฅผ ๊ตฌํ•˜๋Š” ๊ณผ์ • ์ดํ•ดํ•˜๊ธฐ 

 

โญ Transpose ๋กœ ์ฐจ์› ๋งž์ถฐ์ฃผ๋Š”๊ฑฐ ์žŠ์ง€ ์•Š๊ธฐ 

 

 

 

โ‘ก CONV forward in Python

 

๐Ÿ‘ป CONV ์—ฐ์‚ฐ ๊ธฐ๋ณธ (in 3D conv)

 

  • input channel = filter channel 
  • filter ์˜ ๊ฐœ์ˆ˜ = output channel 

 

  • Output feature map ์˜ ํฌ๊ธฐ ๊ตฌํ•˜๋Š” ๊ณต์‹ 
    • W2 = (W1 - F + 2P) / S +1 
    • H2 = (H1 - F + 2P) / S +1 

 

  • Maxpooling ์—ฐ์‚ฐ ๊ฒฐ๊ณผ 
    • (W1 - Ps) / S + 1
    • Ps ๋Š” pooling size 

 

 

๐Ÿ‘ป Alexnet 

 

 

  • Convolution (feature extraction) : filter ์— ์ƒ์‘ํ•˜๋Š” ํ”ผ์ฒ˜๋ฅผ ์ด๋ฏธ์ง€์—์„œ ๋ฝ‘์•„๋‚ด๋Š” ๊ณผ์ • 
  • Dense layer (classification) : ์ถ”์ถœ๋œ ํ”ผ์ฒ˜ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์—์„œ linear classification ์„ ์ˆ˜ํ–‰ 

 

 

๐Ÿ‘ป python ์œผ๋กœ CNN forward ์—ฐ์‚ฐ ๊ตฌํ˜„

 

 

 

  • O : output , W : weight (=filter ์˜ ์›์†Œ๊ฐ’) , I : input , S : stride , B : bias , K : filter (ํ•„ํ„ฐ ํ•˜๋‚˜๋ฅผ ๊ฐ€๋ฆฌํ‚ด)
  • Total Cost : W2 * H2 * K * C * F * F 

 

 

 


 

2๏ธโƒฃ CNN backpropagation


 

โ‘  CNN backpropagation 

 

(1) forward ex with C=1, K=1, S=1, P=0 

 

 

๐Ÿ’จ ์—ฐ์‚ฐ ์˜ˆ์‹œ : O(00) ๊ตฌํ•˜๋Š” ๊ณผ์ • 

 

  • x=0, y=0 ์ด๊ณ  (p,q) → (0,0), (1,0), (0,1), (1,1) ์— ๋Œ€ํ•ด ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋ฉด i00*k00 + i10*k10 + i01*k01 + i11*k11 ์˜ ๊ฒฐ๊ณผ๊ฐ€ ์–ป์–ด์ง„๋‹ค. 

 

 

(2) backward

 

  • FC layer ์˜ ์—ญ์ „ํŒŒ ๊ณผ์ •๊ณผ ์œ ์‚ฌํ•˜๋‹ค. (by chain rule) 

 

 

๐Ÿ’จ dO/dI ์™€ dO/dk ๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์ด ๊ด€๊ฑด! 

 

 

 

(2)-a. Find dO/dk

 

โ—พ a. dO / dK

 

 

โ—พ b : dL / dK

 

 

โ—พ c. a๋‹จ๊ณ„์—์„œ ๊ตฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋Œ€์ž…

 

 

 

๐Ÿ˜Ž loss ์— ๋Œ€ํ•œ kernel (weight) gradient ๊ฒฐ๊ณผ๋Š”

input ๊ณผ activation gradient dL/dO ์˜ ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ๊ณผ ๊ฐ™๋‹ค.

 

 

 

 

 

 

(2)-b. Find dO/dI

 

โ—พ a. dO / dI

 

 

 

 

โ—พ b. dL / dI

 

 

โ—พ c. a๋‹จ๊ณ„์—์„œ ๊ตฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋Œ€์ž… : stride ๋กœ ์—ฐ์‚ฐ์„ ๋”ฐ๋ผ๊ฐ€๋ณด๋ฉด ikk ๊ฐ€ ์˜ํ–ฅ์„ ์ฃผ๋Š” output ์— ๋Œ€ํ•ด์„œ๋งŒ k ๊ฐ’์ด ์‚ด์•„๋‚จ๋Š” ๊ฒƒ์„ ํ™•์ธํ•ด๋ณผ๊ฒƒ!

 

 

๐Ÿ˜Ž loss ์— ๋Œ€ํ•œ input gradient ๊ฒฐ๊ณผ๋Š”

์›์†Œ ๊ฐ’์ด 180 ๋„ ๋ฐฉํ–ฅ์œผ๋กœ ์œ„์น˜ํ•œ  weight ์™€

activation gradient dL/dO ์˜ ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ๊ณผ ๊ฐ™๋‹ค.

 

 

 

(2)-c. summary 

 

์—ญ์ „ํŒŒ๋ฅผ ํ†ตํ•ด ๋ชจ๋“  layer ์— ๋Œ€ํ•œ weight gradient ๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๊ณ  weight update ๋ฅผ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

 

 

โ‘ก Maxpooling backpropagation 

 

(1) A์™€ B์— ๋Œ€ํ•œ max ์—ฐ์‚ฐ์˜ ์—ญ์ „ํŒŒ ๊ตฌํ•ด๋ณด๊ธฐ 

 

 

  • dA : A๊ฐ€ B ๋ณด๋‹ค ํด ๋•Œ O=A ๊ฐ€ ๋˜๊ณ , A์— ๋Œ€ํ•ด ๋ฏธ๋ถ„ํ•˜๋ฉด 1 (dO) ๊ฐ€ ๋œ๋‹ค.
  • dB : B๊ฐ€ A ๋ณด๋‹ค ํด ๋•Œ O=B ๊ฐ€ ๋˜๊ณ , B์— ๋Œ€ํ•ด ๋ฏธ๋ถ„ํ•˜๋ฉด 1 (dO) ๊ฐ€ ๋œ๋‹ค. 
  • ์ฆ‰, max ์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’์—๋งŒ output gradient ๊ฐ€ ์ „๋‹ฌ๋œ๋‹ค. 

 

 

(2) maxpooling ์— ๋Œ€ํ•œ ์—ญ์ „ํŒŒ 

 

  • (1) ์—์„œ ๋„์ถœ๋œ ์•„์ด๋””์–ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋™์ผํ•˜๊ฒŒ ์ ์šฉ 

 

์ตœ๋Œ€๊ฐ’์ด ์•„๋‹Œ ๋ถ€๋ถ„์€ ๋ชจ๋‘ 0

 

  • forward ์—์„œ ์ตœ๋Œ€๊ฐ’์— ํ•ด๋‹น๋˜๋Š” ์›์†Œ์˜ ์œ„์น˜๋ฅผ ๊ธฐ์–ตํ•ด๋‘์—ˆ๋‹ค๊ฐ€ backward ์—์„œ ๊ธฐ์–ตํ•ด๋‘” ์œ„์น˜์—๋งŒ output gradient ๋ฅผ ์ „๋‹ฌํ•œ๋‹ค. 

 

 

 

โ‘ข batch gradient descent in CNN

 

1. ๋งค์šฐ ์ž‘์€ ์ˆซ์ž๋กœ weight ์™€ bias ์˜ ์ดˆ๊ธฐ๊ฐ’์„ ์„ค์ •ํ•œ๋‹ค.

2. For the โญ entire training samples : 
   a. Forward propagation : calculate the batch of output values 
   b. Compute Loss
   c. Backpropagate errors through network 
   d. Update weights and biases 

3. network ๊ฐ€ ์ž˜ ํ›ˆ๋ จ๋ ๋•Œ๊นŒ์ง€ 2๋ฒˆ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•œ๋‹ค. 

 

 

 

๊ทธ๋Ÿฌ๋‚˜ ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ํ›ˆ๋ จ์„ ์ง„ํ–‰ํ•˜๋ฉด

ํ•œ๋ฒˆ weight update ์‹œ ์ „์ฒด ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜๋งŒํผ forward ๋ฅผ ํ•ด์•ผํ•˜๋ฏ€๋กœ

์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆฐ๋‹ค. (๐Ÿ‘€ ์‹ค์ œ๋กœ ์ด๋ ‡๊ฒŒ๋Š” ์•ˆํ•จ)

 

 

 

โ‘ฃ Stochastic gradient descent in CNN

 

(1) SGD

 

1. ๋งค์šฐ ์ž‘์€ ์ˆซ์ž๋กœ weight ์™€ bias ์˜ ์ดˆ๊ธฐ๊ฐ’์„ ์„ค์ •ํ•œ๋‹ค.

2. For โญ a mini-batch of randomly chosen training samples : 
   a. Forward propagation : calculate the mini-batch of output values 
   b. Compute Loss 
   c. Backpropagate errors through network 
   d. Update weights and biases 

3. 1 epoch, ์ฆ‰ ์ „์ฒด ํ›ˆ๋ จ์…‹์„ ๋ชจ๋‘ ๊ฑฐ์น ๋•Œ๊นŒ์ง€ 2๋ฒˆ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•œ๋‹ค. 

4. network ๊ฐ€ ์ž˜ ํ›ˆ๋ จ๋  ๋•Œ๊นŒ์ง€ 2๋ฒˆ๊ณผ 3๋ฒˆ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•œ๋‹ค. 

 

  • ๋žœ๋คํ•˜๊ฒŒ mini - batch ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹์„ ์ถ”์ถœํ•œ๋‹ค. ์ด๋•Œ ์ด์ „ ๋‹จ๊ณ„์— ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ๋Š” ์ œ์™ธํ•œ๋‹ค. 
  • ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ํฌ๊ธฐ ๋งŒํผ์˜ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€ ๊ทธ๋งŒํผ์˜ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. 
  • ํ•œ ์—ํฌํฌ ๋‹น ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ๊ฐœ์ˆ˜๋งŒํผ์˜ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. 

 

 

 

 

(2) SGD vs gradient descent 

 

 

์ผ๋ฐ˜์ ์ธ gradient descent ๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์ „์ฒด์— ๋Œ€ํ•ด forward ๋ฅผ ํ•˜๊ณ  loss function ์„ ๊ตฌํ–ˆ๋˜ ๋ฐ˜๋ฉด์— sgd ๋Š” ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ๋‹จ์œ„๋กœ ๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ๋ฅผ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

 

 

 

๐Ÿ‘€ SGD๊ฐ€ GD ๋ณด๋‹จ...

 

  1. ํ›ˆ๋ จ ์†๋„๊ฐ€ ๋น ๋ฅด๋‹ค. 
  2. randomness ์— ์˜ํ•ด์„œ local min ์œผ๋กœ๋ถ€ํ„ฐ ๋น ์ ธ๋‚˜์˜ฌ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„ global min ์— ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ์–ด ์ตœ์ข… ์„ฑ๋Šฅ์ด ์ข‹๊ฒŒ ๋‚˜์˜ค๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค. 
728x90

๋Œ“๊ธ€