๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“— NLP

[cs224n] 8๊ฐ• ๋‚ด์šฉ ์ •๋ฆฌ

by isdawell 2022. 5. 9.
728x90

๐Ÿ’ก ์ฃผ์ œ : Seq2Seq , Attention, ๊ธฐ๊ณ„๋ฒˆ์—ญ 

 

๐Ÿ“Œ ํ•ต์‹ฌ 

  • Task : machine translation ๊ธฐ๊ณ„๋ฒˆ์—ญ 
  • Seq2Seq 
  • Attention 

 

๊ธฐ๊ณ„๋ฒˆ์—ญ์€ ๋Œ€ํ‘œ์ ์ธ Seq2Seq ํ˜•ํƒœ์˜ ํ™œ์šฉ ์˜ˆ์ œ ์ค‘ ํ•˜๋‚˜์ด๊ณ , attention ์ด๋ผ๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ํ†ตํ•ด ์„ฑ๋Šฅ์ด ๊ฐœ์„ ๋˜์—ˆ๋‹ค.

 

 

 

 

 

1๏ธโƒฃ  Machine Translation


 

1. ๊ธฐ๊ณ„๋ฒˆ์—ญ

 

 

 

โœ” ์ •์˜

 

  • ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด์˜จ Source language ๋ฅผ target language ํ˜•ํƒœ๋กœ ๋ฒˆ์—ญํ•˜๋Š” Task

 

 

โœ” ์—ญ์‚ฌ 

 

โžฐ 1950's : The early history of MT 

 

  • ๋Ÿฌ์‹œ์•„์–ด๋ฅผ ์˜์–ด๋กœ ๋ฒˆ์—ญํ•˜๋Š” ๋“ฑ์˜ ๊ตฐ์‚ฌ ๋ชฉ์ ์œผ๋กœ ๊ฐœ๋ฐœ๋˜๊ธฐ ์‹œ์ž‘ํ•˜์˜€๋‹ค. 
  • Rule-based ์˜ ๋ฒˆ์—ญ ์‹œ์Šคํ…œ์œผ๋กœ ๊ฐ™์€ ๋œป์˜ ๋‹จ์–ด๋ฅผ ๋Œ€์ฒดํ•˜๋Š” ๋‹จ์ˆœํ•œ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ–ˆ๋‹ค. 

 

โžฐ 1990s - 2010s : Statistical Machine Translation SMT

 

  • ํ•ต์‹ฌ ์•„์ด๋””์–ด : parallel ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ™•๋ฅ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚จ๋‹ค ๐Ÿ‘‰ ์„œ๋กœ๋‹ค๋ฅธ ๋‘ ์–ธ์–ด์˜ parallel ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ™•๋ฅ  ๋ชจ๋ธ์„ ๊ตฌ์ถ• 
  • ์ž…๋ ฅ๋ฌธ์žฅ x ์— ๋Œ€ํ•ด ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™” ํ•˜๋Š” ๋ฒˆ์—ญ๋ฌธ์žฅ y๋ฅผ ์ฐพ๊ธฐ 

 

ํ”„๋ž‘์Šค์–ด ๋ฌธ์žฅ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, ์ตœ์ ์˜ ์˜์–ด๋ฌธ์žฅ y๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฌธ์žฅ ์ฐพ๊ธฐ

 

  • By Bayes rule ๐Ÿ’จ Translation model + Language model 

 

 

0. Parallel data 

  • ๋ฒˆ์—ญํ•  ๋ฌธ์žฅ๊ณผ ์ด์— ๋Œ€์‘๋˜๋Š” ๋ฒˆ์—ญ๋œ ๋ฌธ์žฅ ํ˜•ํƒœ์˜ data pair : ๋ฒˆ์—ญ์ด ๋œ ๋ฌธ์žฅ์Œ
  • ๊ฐ€๋ น ์ธ๊ฐ„์ด ๋ฒˆ์—ญํ•œ ํ”„๋ž‘์Šค์–ด์™€ ์˜์–ด ๋ฌธ์žฅ์˜ pair ์Œ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ์˜๋ฏธํ•œ๋‹ค. 

 

 

1. Translation model 

  • ๊ฐ ๋‹จ์–ด์™€ ๊ตฌ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ฒˆ์—ญ๋˜์–ด์•ผ ํ• ์ง€๋ฅผ ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ ์–ธ์–ด๋กœ ๊ตฌ์„ฑ๋œ parallel data ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•œ๋‹ค. 
  • Fidelity ๋ฒˆ์—ญ ์ •ํ™•๋„ 

 

2. Language model 

  • ์–ด๋–ป๊ฒŒ ํ•˜๋ฉด ์ข‹์€ ๋ฒˆ์—ญ๋ฌธ์žฅ์„ ์ƒ์„ฑํ• ์ง€ ๋ชจ๋ธ๋งํ•œ๋‹ค. 
  • ๊ฐ€์žฅ ๊ทธ๋Ÿด๋“ฏํ•œ target ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋„๋ก ๋‹จ์ผ ์–ธ์–ด ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ•™์Šต๋œ๋‹ค. 
  • fluency ์ข‹์€ ๋ฌธ์žฅ ์ƒ์„ฑ 
  • Ex. RNN, LSTM, GRU 

 

 

โžฐ 2014s~ : Neural Machine Translation NMT

 

  • Single end-to-end neural network model ์ธ๊ณต์‹ ๊ฒฝ๋ง 
  • 2๊ฐœ์˜ RNN ์œผ๋กœ ๊ตฌ์„ฑ๋œ sequence to sequence model ๐Ÿ‘‰ Encoder - Decoder 
  • ์—ญ์ „ํŒŒ๋ฅผ ํ™œ์šฉํ•˜์˜€๊ธฐ์— feature engineering ๋‹จ๊ณ„๊ฐ€ ์—†๊ณ  ๋ชจ๋“  source-target pair ์— ๋Œ€ํ•ด ๋™์ผํ•œ ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค.
  • SMT ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋›ฐ์–ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ํ˜„์žฌ๋Š” ๋Œ€๋ถ€๋ถ„ NMT ๊ธฐ๋ฐ˜์˜ ๋ฒˆ์—ญ๊ธฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. 
  • ๋‹ค๋งŒ ์—ฌ๋Š ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์ฒ˜๋Ÿผ ํ•ด์„๋ ฅ์ด ๋ถ€์กฑํ•˜๊ณ , Rule ์ด๋‚˜ guideline ์„ ์ ์šฉํ•˜๊ธฐ ์‰ฝ์ง€ ์•Š๋‹ค๋Š” ๋ฌธ์ œ์ ์ด ์กด์žฌํ•œ๋‹ค. 

 

 

 

 

 

โœ” SMT ๊ฐœ์š” 

 

์ฃผ์–ด์ง„ ๋ฐฉ๋Œ€ํ•œ Parallel data ๋กœ๋ถ€ํ„ฐ ํ™•๋ฅ  ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•œ๋‹ค. 

 

  • By Bayes rule ๐Ÿ’จ Translation model + Language model 

 

๐Ÿค” Translation model ์„ ์–ด๋–ป๊ฒŒ ํ›ˆ๋ จ์‹œํ‚ฌ๊นŒ? 

 

 

Source-target ๋ฌธ์žฅ ๊ฐ„์˜ correspondence ๋ฅผ ๋งคํ•‘ํ•˜๋Š” alignment ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. 

 

  •  alignment : source~target ๋ฌธ์žฅ์—์„œ ์ผ์น˜ํ•˜๋Š” ๋‹จ์–ด ๋˜๋Š” ๊ตฌ๋ฅผ ์ฐพ์•„์„œ ์ง์„ ์ง€์–ด์ฃผ๋Š” ์ž‘์—… 

 

1:1 ๋กœ alignment ๊ฐ€ ๋งคํ•‘๋˜๋Š” ๊ฒฝ์šฐ (๊ทนํžˆ ๋“œ๋ฌธ example)

 

  • source ์™€ target ๋ฌธ์žฅ์„ ๋‹จ์–ด ๋‹จ์œ„๋กœ ๋งคํ•‘์‹œํ‚ฌ ๋•Œ ์–ธ์–ด๋ณ„ ์–ด์ˆœ์˜ ์ฐจ์ด๋ฅผ ๋ฐ˜์˜ํ•˜๊ธฐ ์œ„ํ•ด ๋งคํ•‘๋œ ๋‹จ์–ด์˜ ๋ฐฐ์—ด์ •๋ณด๋ฅผ ๋‹ด๋Š” alignment ๋ผ๋Š” ์ •๋ณด๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. 
  • ๋ณดํ†ต alignment ๊ฐ€ ์œ„์˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ 1:1๋กœ ๋ช…ํ™•ํžˆ ์ •์˜ํ•  ์ˆ˜ ์—†๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. alignment ๋ฅผ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋ณต์žกํ•ด์ง€๊ณ  ์—ฌ๋Ÿฌ ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์ด ์—ฐ๊ตฌ๋˜์–ด ์™”๋‹ค. ๐Ÿ‘‰ EM ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ฐ™์€ ํ•™์Šต ๋ฐฉ๋ฒ•์ด ์‚ฌ์šฉ๋œ๋‹ค. 

 

 

Alignmnet is Complex ๐Ÿ‘‰ ๊ธฐ๊ณ„๋ฒˆ์—ญ์„ ์ˆ˜ํ–‰ํ•˜๋Š” SMT ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์ €ํ•˜์‹œํ‚ค๋Š” ์š”์ธ์ด ๋จ 

 

* ์™ผ์ชฝ์ด source, ์˜ค๋ฅธ์ชฝ์ด target  

 

1. No counterpart 

 

target ๋ฒˆ์—ญ๋œ ๋ฌธ์žฅ์— ๋Œ€์‘๋˜๋Š” ๋‹จ์–ด๊ฐ€ ์—†๋Š” ๊ฒฝ์šฐ

 

 

2. Many to one 

 

source ์—์„œ ์—ฌ๋Ÿฌ๊ฐœ ๋‹จ์–ด๋กœ ํ‘œํ˜„๋˜๋Š” ์˜๋ฏธ๊ฐ€ target ์—์„œ๋Š” ํ•˜๋‚˜์˜ ๋‹จ์–ด๋กœ ํ‘œํ˜„๋˜๋Š” ๊ฒฝ์šฐ

 

 

 

 

3. One to many 

 

source ์˜ ํ•œ ๋‹จ์–ด๊ฐ€ target ์˜ ์—ฌ๋Ÿฌ ๋‹จ์–ด๋กœ ๋ฒˆ์—ญ๋˜์–ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ

 

 

 

source ์˜ ํŠน์ •ํ•œ ํ•˜๋‚˜์˜ ๋‹จ์–ด๊ฐ€ ๋งค์šฐ ๋งŽ์€ ๋œป์„ ํฌํ•จํ•˜์—ฌ ์ƒ๋‹นํžˆ ๋งŽ์€ ๋ฒˆ์—ญ ๋‹จ์–ด์™€ ๋Œ€์‘๋˜๋Š” ๊ฒฝ์šฐ → Very fertile 

 

 

 

4. Many to many _ phrase level

 

๊ตฌ, ์ ˆ ํ˜•ํƒœ๋กœ ํ‘œํ˜„๋˜๋Š” ๊ด€์šฉํ‘œํ˜„์€ ๋‹ค์ค‘ ๋‹จ์–ด๋กœ ๋งคํ•‘๋จ

 

many-to-many ํ˜•ํƒœ์˜ alignment ๋กœ ํ•™์Šต์ด ์ด๋ฃจ์–ด์ ธ์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ณต์žกํ•จ 

 

 

๐Ÿ’จ Translation model ์€ ๋‹จ์–ด์˜ ์œ„์น˜, ๋ฒˆ์—ญ ๋‹จ์–ด ๋Œ€์‘ ๊ฐœ์ˆ˜ ๋“ฑ์„ ๊ณ ๋ คํ•œ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ์กฐํ•ฉ์„ ํ•™์Šตํ•ด์•ผํ•จ 

 - Probability of particular words aligning also depends on position in sent 

 - Probability of particular words having particular fertility (number of corresponding words

 

 

 

 

โœ” Decoding for SMT 

 

Decoding ๐Ÿ‘‰ ํ•™์Šตํ•œ translation model ๊ณผ language model ์„ ํ†ตํ•ด ์ตœ์ ์˜ ๋ฒˆ์—ญ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ • 

 

 

์–ด๋–ค ๋ฌธ์žฅ์ด best ์ธ์ง€ ์–ด๋–ป๊ฒŒ ์ฐพ์„ ์ˆ˜ ์žˆ์„๊นŒ

 

  • ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ๋ฒˆ์—ญ๋ฌธ y ์˜ ํ™•๋ฅ ์„ ๋น„๊ตํ•˜๋Š” ๊ฒƒ์€ ๋„ˆ๋ฌด ๋งŽ์€ ๊ณ„์‚ฐ๋Ÿ‰์„ ํ•„์š”๋กœ ํ•œ๋‹ค. 
  • Answer : ์—ฌ๋Ÿฌ factor ๋กœ ๋ถ„๋ฆฌํ•˜๊ณ  ๊ฐ๊ฐ์˜ ์ตœ์ ํ•ด ์กฐํ•ฉ์œผ๋กœ global optimal (๊ฐ€์žฅ ์ ์ ˆํ•œ ๋ฒˆ์—ญ๋ฌธ) ์„ ์ฐพ๋Š” Dynamic programming ๋ฐฉ์‹์œผ๋กœ decoding ํ•œ๋‹ค ๐Ÿ‘‰ Heuristic search algorithm 

 

Decoding  ๊ณผ์ • 

 

 

์ตœ์ ํ•ด ์กฐํ•ฉ์œผ๋กœ ๊ฐ€์žฅ best ๋ฒˆ์—ญ๋ฌธ์„ ๋„์ถœ

 

 

 

 

โœ” SMT ์˜ ๋‹จ์  

 

  • Extremely complex : alignment ๊ฐ™์€ ๋งŽ์€ ํ•˜์œ„ ์‹œ์Šคํ…œ (subcomponents) ์œผ๋กœ ์ „์ฒด ์‹œ์Šคํ…œ์„ ์„ค๊ณ„ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณผ์ •์ด ๋ณต์žกํ•˜๋‹ค. 
  • ์„ฑ๋Šฅํ–ฅ์ƒ์„ ์œ„ํ•ด ํŠน์ •ํ•œ ์–ธ์–ด ํ˜„์ƒ์„ ํฌ์ฐฉํ•ด์•ผํ•˜๋ฏ€๋กœ Feature enginnering ์ด ํ•„์š”ํ•˜๋‹ค. 
  • ์„ฑ๋Šฅ์„ ์œ„ํ•ด ๋™์ผํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ ๋ฌธ์žฅ์–ด๊ตฌ๋ฅผ ํ™•๋ณดํ•˜๋Š” ๋“ฑ์˜  ์ถ”๊ฐ€์ ์ธ resource ๊ฐ€ ํ•„์š”ํ•˜๋‹ค. 
  • ๋ชจ๋ธ์˜ ๊ฐœ๋ฐœ ๋ฐ ์œ ์ง€๋ฅผ ์œ„ํ•ด ์‚ฌ๋žŒ์˜ ๋…ธ๋ ฅ์ด ๋งŽ์ด ํ•„์š”ํ•˜๋‹ค : paried dataset ๊ตฌ์ถ•, subcomponent ์˜ ํ•™์Šต ๋“ฑ 

 

 

 

 

 

 

2๏ธโƒฃ Sequence to Sequence 


 

1. neural Machine Translation 

 

 

 

โœ” Sequence to Sequence ๊ฐœ์š” 

 

  • 2๊ฐœ์˜ RNN ์œผ๋กœ ๊ตฌ์„ฑ๋œ ๋ชจ๋ธ 

 

Encoder ๐Ÿ’จ source sentence encoding 

 

  • source ๋ฌธ์žฅ์˜ context ์ •๋ณด๋ฅผ ์š”์•ฝํ•œ๋‹ค. 
  • context ์š”์•ฝ ์ •๋ณด์ธ last hidden state ๋ฅผ decoder ๋กœ ์ „๋‹ฌํ•˜์—ฌ initial hidden state ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. 

 

 

Decoder ๐Ÿ’จ generates target sentence 

 

  • encoding ์„ condition ์œผ๋กœ target sentence ๋ฅผ ์ƒ์„ฑํ•˜๋Š” conditional language model 
  • target sentence ํŠน์ • ๋‹จ์–ด๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ๊ทธ ๋‹ค์Œ์— ์˜ฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ๋•Œ๋ฌธ์— language model ์ด๋ผ ์นญํ•  ์ˆ˜ ์žˆ๋‹ค.
  • source sentence ์˜ context ์š”์•ฝ ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ target sentence ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

 

๐Ÿ’จ Train : target sentence ์˜ ๊ฐ timestep ๋ณ„ ๋‹จ์–ด๊ฐ€ ๋‹ค์Œ timestep ์˜ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•œ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. 

 

๐Ÿ’จ Test : target ๋ฌธ์žฅ์ด ์ฃผ์–ด์ง€์ง€ ์•Š๋Š”๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ , Decoder RNN ์˜ timestep ๋ณ„ output ์ด ๋‹ค์Œ timestep ์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. 

 

 

โž• Teach-Forcing : train ์‹œ decoder ์—์„œ ์‹ค์ œ ์ •๋‹ต ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•  ๋น„์œจ์„ ์„ค์ •ํ•˜๋Š” ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ 

  - ์ดˆ๊ธฐ์— ์ž˜๋ชป ์ƒ์„ฑํ•œ ๋‹จ์–ด๋กœ ์ธํ•ด ๊ณ„์†ํ•˜์—ฌ ์ž˜๋ชป๋œ ๋ฌธ์žฅ์ด ์ƒ์„ฑ๋˜์–ด ํ•™์Šต์ด ์ž˜ ์•ˆ๋˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ 

  - ๊ณผ์ ํ•ฉ ๋˜์ง€ ์•Š๋„๋ก ์ ์ ˆํžˆ ์„ค์ •ํ•ด์ฃผ์–ด์•ผ ํ•จ 

 

 

โœ” Sequence to Sequence ํ™œ์šฉ 

 

  • ์—ฌ๋Ÿฌ text task ํ›ˆ๋ จ ๋ชจ๋ธ๋กœ ํ™œ์šฉ๋œ๋‹ค. 

1. Machine Translation ๋ฒˆ์—ญ

2. Dialogue ๋‹ค์Œ ๋Œ€ํ™” ๋‚ด์šฉ์„ ์˜ˆ์ธก 

3. Parsing ๋ฌธ์žฅ ๊ตฌ๋ฌธ ๋ถ„์„ 

4. time series ์‹œ๊ณ„์—ด 

5. Text summarization ํ…์ŠคํŠธ ์š”์•ฝ 

6. voice generation 

7. Code generation 

 

 

 

โœ” NMT ์˜ train 

 

  • SMT ๋Š” ๋ฒ ์ด์ฆˆ๋ฃฐ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ 2๊ฐœ๋กœ ๋ถ„๋ฆฌํ•ด ํ›ˆ๋ จํ•œ ๋ฐ˜๋ฉด, NMT ๋Š” conditional probaility ๋ฅผ ์ง์ ‘ ๊ณ„์‚ฐํ•œ๋‹ค. 
  • How ? ๐Ÿ‘‰ big parallel corpus + End-to-End training 

 

๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ํ˜•ํƒœ์˜ NMT ๋ชจ๋ธ

 

  • ์—ญ์ „ํŒŒ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” NN ์˜ loss min ๋ฌธ์ œ๋กœ ์ •์˜ํ•จ ๐Ÿ‘‰ negative log likelihood ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ loss ๋กœ ์‚ฌ์šฉ 

 

 

 

 

 

2. Seq2Seq  

 

 

โœ” Multi-layer RNNs 

 

 

  • Time step ์˜ ์ง„ํ–‰๊ณผ ํ‰ํ–‰ํ•˜๊ฒŒ layer ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์ข€ ๋” ๋†’์€ ๋ ˆ๋ฒจ์˜ feature ๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•จ 
  • ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” RNN ๋ชจ๋ธ์€ ๋Œ€๋ถ€๋ถ„ Multi-layer RNN ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ผ๋ฐ˜์ ์ธ NN ์ด๋‚˜ CNN ์ฒ˜๋Ÿผ ๋งค์šฐ ๊นŠ์€ ๊ตฌ์กฐ๋Š” ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š”๋‹ค. 
  • Encoder 2~4๊ฐœ, Decoder 4๊ฐœ layer ๐Ÿ’จ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค๊ณ  ํ•จ 
  • ์ดํ›„ Transformer base model ๋“ฑ์žฅ ์ดํ›„ 12 or 24๊ฐœ layer ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๋„ ์ƒ๊น€ (BERT) 

 

 

โœ” NMT Decoding method

 

 

1) Greedy decoding 

 

argmax

 

๋ถ€์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฌธ์žฅ ์ƒ์„ฑ example

 

  • ๋งค step ๋งˆ๋‹ค ๊ฐ€์žฅ ํ™•๋ฅ ์ด ๋†’์€ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ• ๐Ÿ’จ argmax word 
  • ์ตœ์ ์˜ ํ•ด๋ฅผ ๋ณด์žฅํ•˜์ง„ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋ฏธ ์„ ํƒํ•œ ๋‹จ์–ด๋ฅผ ์ดํ›„ ๋‹จ์–ด์™€์˜ ์กฐํ•ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฒˆ๋ณตํ•  ์ˆ˜ ์—†์–ด, ๋ถ€์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฌธ์žฅ์ด ์ƒ์„ฑ๋  ์ˆ˜ ์žˆ๋‹ค. 
  • Exhaustive ๋ฐฉ๋ฒ•๋ณด๋‹ค ํšจ์œจ์ ์œผ๋กœ ๋™์ž‘ 

 

 

 

2) Exhaustice search decoding 

 

 

  • ์ƒ์„ฑ๋  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  target ๋ฌธ์žฅ์„ ๋น„๊ตํ•˜์—ฌ ํ™•๋ฅ ์ด ๊ฐ€์žฅ ๋†’์€ ๋ฌธ์žฅ์„ ์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ• 
  • O(V) ๊ณ„์‚ฐ ๋ณต์žก๋„๊ฐ€ ๋„ˆ๋ฌด ํฌ๋‹ค๋Š” ๋‹จ์  

 

 

 

3) Beam Search decoding 

 

 

  • ํ•ต์‹ฌ ์•„์ด๋””์–ด : K ๊ฐœ์˜ best translation ์„ ์œ ์ง€ํ•˜๋ฉฐ decoding ํ•˜๋Š” ๋ฐฉ์‹ 
  • ๋ณดํ†ต k (beam size) ๋Š” 5~10 ์‚ฌ์ด์˜ ์ˆซ์ž๋กœ ์„ค์ •ํ•œ๋‹ค.

 

  • k*k ๊ฐœ์˜ ํ›„๋ณด๊ตฐ์„ ๋น„๊ตํ•˜๊ณ  k ๊ฐœ์˜ best hypotheses ๋ฅผ ์„ ํƒํ•˜๋Š” ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•œ๋‹ค. 
  • ์ตœ์ ํ•ด๋ฅผ ๋ณด์žฅํ•˜์ง„ ์•Š์œผ๋‚˜, ๋” ๋‹ค์–‘ํ•œ ๋ฒˆ์—ญ๋ฌธ์ด ๋“ฑ์žฅํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—  greedy search ๊ฐ€ ๋†“์น  ์ˆ˜ ์žˆ๋Š” ๋” ๋‚˜์€ ํ›„๋ณด๊ตฐ์„ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค.  
  • EOS ํ† ํฐ์˜ ๋“ฑ์žฅ์ด๋‚˜ ์‚ฌ์ „์— ์ •์˜๋œ max timestep T, ๋˜๋Š” n ๊ฐœ์˜ ์™„์„ฑ๋œ ๋ฌธ์žฅ์„ ์ข…๋ฃŒ์กฐ๊ฑด์œผ๋กœ ํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋””์ฝ”๋”ฉ ๊ณผ์ •์ด ๋„ˆ๋ฌด ๊ธธ์–ด์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

์ฆ‰, ๊ฐ step ๋‹จ์–ด๋ณ„๋กœ 2๊ฐœ์˜ ํ›„๋ณด๊ตฐ์„ ๊ฐ๊ฐ ์„ค์ •ํ•˜๊ณ , step ์ „์ฒด hypotheses ์ค‘ 2๊ฐœ๋ฅผ ์„ ํƒํ•˜๋Š” ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•œ๋‹ค.

 

  • log likelihood ๊ฐ’์ด ํฐ ๋‘ ๊ฐœ์˜ ํ›„๋ณด๋ฅผ ๊ณ„์†ํ•ด์„œ ์„ ํƒํ•˜๋Š” ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•˜๋ฉฐ, EOS ํ† ํฐ์ด ์ƒ์„ฑ๋˜๋Š” ๋ฌธ์žฅ ์ข…๋ฃŒ์‹œ๊นŒ์ง€ ๋ฐ˜๋ณตํ•œ๋‹ค. 

 

 

  • NLL loss ํ•ฉ์„ ์ƒ์„ฑ๋œ ๋ฌธ์žฅ ๊ธธ์ด๋กœ ๋‚˜๋ˆ ์„œ normalize ํ•จ ๐Ÿ‘‰log likelihood ํ•ฉ์œผ๋กœ ์ƒ์„ฑ๋œ ๋‹จ์–ด ์กฐํ•ฉ์„ ๋น„๊ตํ•  ๋•Œ ๋” ๊ธด ๋ฌธ์žฅ์—์„œ log likelihood ํ•ฉ์ด ์ž‘์•„์ ธ ์งง์€ ๋ฌธ์žฅ์„ ์šฐ์„ ์ ์œผ๋กœ ์„ ํƒํ•˜๋Š” ๊ฒƒ์„ ๋ง‰๊ธฐ ์œ„ํ•ด ๋ฌธ์žฅ ๊ธธ์ด๋กœ likelihood ๋ฅผ ๋‚˜๋ˆ„๋Š” normalize ๊ณผ์ •์ด ์ถ”๊ฐ€๋กœ ์‚ฌ์šฉ๋œ๋‹ค. 

 

 

 

 

3. NMT ์žฅ๋‹จ์  

 

๐Ÿ‘€ ์žฅ์  

 

  • Better Performance : more fluent, better use of context, better use of phrase similarities 
  • Single NN to be optimized end to end : subcomponent ๊ฐ€ ํ•„์š” ์—†๋‹ค. 
  • less human engineering effort 

 

๐Ÿ‘€ ๋‹จ์  

 

  • difficult to control : rule, guideline ์ ์šฉ์ด ์–ด๋ ค์›€ 
  • less interpretable → hard to debug 

 

 

4. BLEU Bilingual Evaluation Understudy

 

๊ธฐ๊ณ„๋ฒˆ์—ญ ์„ฑ๋Šฅ ํ‰๊ฐ€ ์ง€ํ‘œ ๐Ÿ‘‰ N-gram precision ๋ฐฉ์‹์„ ์‚ฌ์šฉ 

 

  • ๊ธฐ๊ณ„๋ฒˆ์—ญ๋œ ๋ฌธ์žฅ๊ณผ ์‚ฌ๋žŒ์ด ๋ฒˆ์—ญํ•œ ๋ฌธ์žฅ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ, ๋””์ฝ”๋”ฉ ๊ณผ์ • ์ดํ›„ ์ƒ์„ฑ๋œ ๋ฌธ์žฅ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•œ๋‹ค. 
  • N-gram precision ๊ธฐ๋ฐ˜์˜ similarity score + penalty for too short translation 
  • ๊ทธ๋Ÿฌ๋‚˜ ์ •์„ฑ์ ์œผ๋กœ ๋” ์ข‹์€ ๋ฌธ์žฅ์ž„์—๋„ score ๊ฐ€ ๋‚ฎ๊ฒŒ ๋‚˜์˜ฌ์ˆ˜ ์žˆ๋Š” ๋ถˆ์™„์ „ํ•œ ์ง€ํ‘œ์ด๋‹ค. 

 

 

 

5. NMT ํ•œ๊ณ„์ 

 

 

  1. ๊ด€์šฉํ‘œํ˜„ ํ•ด์„ ์˜ค๋ฅ˜ 
  2. Bias ํ•™์Šต : source ์–ธ์–ด์—์„œ ์„ฑ๋ณ„์— ๋Œ€ํ•œ ์–ธ๊ธ‰์ด ์—†์—ˆ์Œ์—๋„ ๋ฒˆ์—ญ๋ฌธ์—์„  ์ง์—…์— ๋Œ€ํ•œ ์„ฑ๋ณ„ bias ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Œ 
  3. Uninterpretable : ํ•ด์„ํ•  ์ˆ˜ ์—†๋Š” ํ‘œํ˜„์— ๋Œ€ํ•ด ๋ฒˆ์—ญ ๋ชจ๋ธ์€ ์ž„์˜์˜ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•œ๋‹ค ๐Ÿ‘‰ ์‚ฌ๋žŒ๊ณผ ๋‹ฌ๋ฆฌ ๋ชจ๋ธ์—์„œ ํ•™์Šตํ•˜์ง€ ์•Š์€ ์ž…๋ ฅ์— ๋Œ€ํ•ด ์ทจ์•ฝํ•˜๋‹ค. 
  4. Out of vocabulary words 
  5. Domain mismatch between train and test data 
  6. Maintaining context over longer text 
  7. Low-resource language pairs 

 

 

 

 

3๏ธโƒฃ Attention 


 

1. Bottleneck problem 

 

โœ” source ๋ฌธ์žฅ์˜ ๋งˆ์ง€๋ง‰ encoding ๋งŒ์„ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ

 

 

  • information bottleneck ์ •๋ณด์˜ ์†์‹ค์ด ๋ฐœ์ƒํ•œ๋‹ค. 
  • ์ ์ ˆํ•œ ๋ฒˆ์—ญ๋ฌธ์„ ์ƒ์„ฑํ•˜๋ ค๋ฉด ์ธ์ฝ”๋”ฉ์ด source ๋ฌธ์žฅ์— ํ•„์š”ํ•œ ๋ชจ๋“  ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ์–ด์•ผ ํ•˜๋Š”๋ฐ ๋Œ€๋ถ€๋ถ„ ๊ทธ๋Ÿฌ๊ธฐ์— ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์Œ 

 

 

2. Attention  

 

โœ” bottleneck problem ์˜ ํ•ด๊ฒฐ์ฑ…

 

  • Decoder ์˜ ๊ฐ step ์—์„œ source sentence ์˜ ํŠน์ • ๋ถ€๋ถ„์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๋„๋ก Encoder ์™€ Connection ์„ ์ถ”๊ฐ€ํ•œ๋‹ค. 

 

โœ” Attention ์ด decoding ์— ๋ฐ˜์˜๋˜๋Š” ์ˆœ์„œ 

 

โ‘  Encoder ์˜ step ๋ณ„ encoding ๊ณผ Decoder hidden state ์˜ dot product ๋ฅผ ํ†ตํ•ด  similarity ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ attention score ๋ฅผ ๊ฒŒ์‚ฐํ•œ๋‹ค. 

 

์ดˆ๋ก์ƒ‰ = decoder ์˜ hidden state

 

 

โ‘ก  softmax ๋ฅผ ํ†ต๊ณผ์‹œ์ผœ attention distribution (attention weight) ์„ ๊ณ„์‚ฐํ•œ๋‹ค. 

 

๋‹ค์Œ step ์˜ ๋‹จ์–ด๋ฅผ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•ด encoder ์˜ ์ฒซ๋ฒˆ์งธ ๋‹จ์–ด์ธ il (=he) ์— ๋งŽ์€ attention

 

โ‘ข Attention distribution ์„ ์‚ฌ์šฉํ•ด encoder hidden state ๋ฅผ ๊ฐ€์ค‘ํ•ฉํ•˜์—ฌ attention output ์„ ๊ณ„์‚ฐํ•œ๋‹ค. 

 

 

โ‘ฃ Attention output ๊ณผ decoder hidden state ๋ฅผ concat ํ•˜๊ณ  ํ•ด๋‹น step ์˜ word ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

 

 

 

 

โ‘ค ๊ฐ step ๋ณ„ decoder hidden state ๋งˆ๋‹ค ์ ์šฉ 

 

 

 

 

 

โœ” ์ˆ˜์‹์œผ๋กœ ์‚ดํŽด๋ณด๊ธฐ

 

 

 

 

โœ” Attention is great

 

1. NMT ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ ๐Ÿ’จdecoder ๊ฐ€ source sentence ์˜ ํŠน์ • ์˜์—ญ์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•จ 

2. Source sentence ์˜ ๋ชจ๋“  ์ž„๋ฒ ๋”ฉ์„ ์ฐธ๊ณ ํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•ด bottleneck problem ์„ ํ•ด๊ฒฐํ•จ 

3. Shortcut ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ๋ฅผ ์ค„์ž„ + connection ์€ ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ hidden state ๊ฐ„์˜ ์ง€๋ฆ„๊ธธ์„ ์ œ๊ณตํ•˜์—ฌ ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐ 

4. Attention probability ๋ฅผ ํ†ตํ•ด ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ํ•ด์„๋ ฅ์„ ์ œ๊ณตํ•จ 

 

 

โœ” Attention ์€ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์— ํ•ด๋‹นํ•œ๋‹ค.

 

 

๐Ÿ‘€ Example

 

  • ์ฐจ๋Ÿ‰ ๋ฆฌ๋ทฐ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ๋ฌธ์žฅ์˜ ๊ธ๋ถ€์ •์„ ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ฐ์ •๋ถ„์„์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. 
  • ๋ฌธ์žฅ๋‹จ์œ„, ๋‹จ์–ด ๋‹จ์œ„๋กœ attention layer ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ณ„์ธต์  attention module ์„ ๋ชจ๋ธ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. 

 

attention ์„ ํ™œ์šฉํ•œ ๊ฐ์ •๋ถ„๋ฅ˜ ์˜ˆ์‹œ

 

  • ๋ถ€์ •์ ์œผ๋กœ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด ์–ด๋–ค ๋ฌธ์žฅ๊ณผ ์–ด๋–ค ๋‹จ์–ด์— ๋Œ€ํ•ด attention score ๊ฐ€ ๋†’์•˜๋Š”์ง€ ํ™•์ธํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

๐Ÿ‘€ Attention ์˜ ์ผ๋ฐ˜์ ์ธ ์ •์˜ 

 

query, value vector ๋ฅผ ์‚ฌ์šฉํ•ด attention ์˜ ์‚ฌ์šฉ์„ ๊ทœ๊ฒฉํ™”ํ•จ

 

 

728x90

๋Œ“๊ธ€