๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“— NLP

[cs224n] 12๊ฐ• ๋‚ด์šฉ์ •๋ฆฌ

by isdawell 2022. 5. 23.
728x90

๐Ÿ’ก ์ฃผ์ œ : Subword Models 


๐Ÿ“Œ ํ•ต์‹ฌ 

  • Task : Character level models 
  • BPE, WordPiece model, SentencePiece model, hybrid models

 

 

 

 

 

1๏ธโƒฃ Linguistic Knowledge 


1. ์–ธ์–ดํ•™ ๊ฐœ๋… ์ •๋ฆฌ 

 

โœ” ์Œ์šด๋ก  Phonology

 

โ—ฝ ์–ธ์–ด์˜ '์†Œ๋ฆฌ' ์ฒด๊ณ„๋ฅผ ์—ฐ๊ตฌํ•˜๋Š” ๋ถ„์•ผ → ์‚ฌ๋žŒ์˜ ์ž…์œผ๋กœ ๋ฌดํ•œ์˜ ์†Œ๋ฆฌ๋ฅผ ๋‚ผ ์ˆ˜ ์žˆ์ง€๋งŒ, ์–ธ์–ด๋กœ ํ‘œํ˜„๋  ๋•Œ๋Š” ์—ฐ์†์ ์ธ ์†Œ๋ฆฌ๊ฐ€ ๋ฒ”์ฃผํ˜•์œผ๋กœ ๋‚˜๋ˆ ์ ธ์„œ ์ธ์‹๋œ๋‹ค. 

 

๋น„์Šทํ•œ ๋ฐœ์Œ์˜ ์†Œ๋ฆฌ๋ผ๋„ ์–ธ์–ด๋กœ ํ‘œํ˜„ํ•  ๋• ๋ฒ”์ฃผํ˜•์œผ๋กœ ์ธ์‹ (caught , cot)

 

 

 

โœ” ํ˜•ํƒœ๋ก  Morphology 

 

โ—ฝ ์ตœ์†Œํ•œ์˜ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๋Š” ๊ตฌ์กฐ 

โ—ฝ ๋‹จ์–ด์˜ ์–ดํ˜• ๋ณ€ํ™”๋ฅผ ์—ฐ๊ตฌํ•˜๋Š” ๋ฌธ๋ฒ•์˜ ํ•œ ๋ถ„์•ผ → ์ž‘์€ ๋‹จ์œ„์˜ ๋‹จ์–ด๋“ค์ด ๋ชจ์—ฌ ํ•˜๋‚˜์˜ ์˜๋ฏธ๋ฅผ ์™„์„ฑ 

 

 

๋งค๋‹ ๊ต์ˆ˜์˜ ์‹คํ—˜์—์„œ ํ˜•ํƒœ๋ก ์„ ๋ฐ˜์˜ํ•œ ํ† ํฌ๋‚˜์ด์ง•์€ ์–ด๋ ต๋‹ค๊ณ  ๊ฒฐ๋ก ๋‚ด๋ฆผ

 

 

๐Ÿ‘‰ ํ˜•ํƒœ์†Œ ๋‹จ์œ„์˜ ๋‹จ์–ด๋“ค์„ ๋”ฅ๋Ÿฌ๋‹์— ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๋Š” ๊ฑฐ์˜ ์—†๋‹ค. ๋‹จ์–ด๋ฅผ ํ˜•ํƒœ์†Œ ๋‹จ์œ„๋กœ ์ชผ๊ฐœ๋Š” ๊ณผ์ • ์ž์ฒด๊ฐ€ ์–ด๋ ต๊ณ , ์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋”๋ผ๋„ character, n-gram ์œผ๋กœ ์ค‘์š”ํ•œ ์˜๋ฏธ ์š”์†Œ๋ฅผ ์ถฉ๋ถ„ํžˆ ์ž˜ ์žก์•„๋‚ผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

2. Words in writing systems 

 

โœ” Writing systmes vary in how they represents words

 

์‚ฌ๋žŒ์˜ ์–ธ์–ดํ‘œ๊ธฐ ์ฒด๊ณ„๋Š” ๊ตญ๊ฐ€๋งˆ๋‹ค ์ฐจ์ด๊ฐ€ ์กด์žฌํ•˜๋ฉฐ ํ•˜๋‚˜๋กœ ํ†ต์ผ๋˜์–ด ์žˆ์ง€ ์•Š๋‹ค. 

 

โ—ฝ No segmentation : ์ค‘๊ตญ์–ด, ์•„๋ž์–ด์ฒ˜๋Ÿผ ๋„์–ด์“ฐ๊ธฐ ๊ตฌ๋ถ„์ด ์—†๋Š” ์–ธ์–ด๋“ค๋„ ์กด์žฌํ•œ๋‹ค. 

 

โ—ฝ Compounds ํ•ฉ์„ฑ์–ด : ๋„์–ด์“ฐ๊ธฐ๊ฐ€ ์žˆ์Œ์—๋„ ํ•˜๋‚˜์˜ ๋ช…์‚ฌ๋กœ ์ธ์‹๋˜๋Š” ๊ฒฝ์šฐ , ๋„์–ด์“ฐ๊ธฐ ์—†์ด ํ•œ ๋‹จ์–ด๋กœ ์ธ์‹ํ•ด์•ผํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์กด์žฌํ•œ๋‹ค. 

 

 

โ—ฝ Written from 

 

3. Models below the word level  ๋‹จ์–ด ๊ธฐ๋ฐ˜ ๋ชจ๋ธ

 

โœ” Need to handle large, open vocabulary

 

๋‹จ์–ด ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์„ ๋งŒ๋“ค๋ฉด ์ปค๋ฒ„ํ•ด์•ผํ•  ๋‹จ์–ด๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์œผ๋ฏ€๋กœ ๋ฌดํ•œํ•œ ๋‹จ์–ด ์‚ฌ์ „ ๊ณต๊ฐ„์ด ํ•„์š”ํ•ด ๋น„ํšจ์œจ์ ์ด๋‹ค. 

 

โ—ฝ Rich morphology : ์ฒด์ฝ”์–ด์™€ ๊ฐ™์ด ํ’๋ถ€ํ•œ ํ˜•ํƒœํ•™์„ ๊ฐ€์ง„ ์–ธ์–ด 

 

โ—ฝ Transliteration : ์™ธ๋ž˜์–ด ํ‘œ๊ธฐ

 

โ—ฝ Informal Spelling : ์ถ•์•ฝ์–ด, ๋งž์ถค๋ฒ•์— ๋งž์ง€ ์•Š๋Š” ์ฒ ์žํ‘œ๊ธฐ, ์‹ ์กฐ์–ด๋“ค์ด ์ƒ๊ฒจ๋‚จ 

 

 

 

 

 

 

2๏ธโƒฃ Pure Character-level models 


 

1. Character level ๋กœ ์ ‘๊ทผํ•˜๋Š” ์ด์œ  

 

โœ” (๋ณต์Šต) ๋‹จ์–ด ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์˜ ๋ฌธ์ œ์  

 

 

โ‘  ์–ธ์–ด๋ณ„๋กœ ํŠน์„ฑ์ด ์ƒ์ดํ•˜๋‹ค. 

โ‘ก Need to handle large , open vocabularly 

 

ํ•œ์ •๋œ ํฌ๊ธฐ์˜ ๋‹จ์–ด์‚ฌ์ „ ํ˜•์„ฑ์€ ๋ฌธ์ œ์ ์ด ์กด์žฌํ•œ๋‹ค.

 

  • ํ•œ์ •๋œ ๋‹จ์–ด์‚ฌ์ „ ํฌ๊ธฐ์— ์˜ํ•ด ๋‹จ์–ด์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด๋Š” UNK ํ† ํฐ์œผ๋กœ ๋ถ„๋ฅ˜๋˜๋Š” ๋ฌธ์ œ : OOV 
  • ๊ธฐ๊ณ„๋ฒˆ์—ญ์˜ ๊ฒฝ์šฐ ์ˆซ์ž๋‚˜ ์ด๋ฆ„๊ฐ™์ด ๋‹จ์–ด๊ฐ€ ๋ฌดํ•œ๋Œ€๋กœ ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋ฐœ์ƒ : open-vocabulary problem 
  • multi-task learning ์˜ ๊ฒฝ์šฐ ๋ชจ๋ธ์ด ๋„๋ฉ”์ธ๋งˆ๋‹ค ์ปค๋ฒ„ํ•ด์•ผํ•  ๋‹จ์–ด๊ฐ€ ๋” ๋งŽ์•„์ง„๋‹ค. 

 

 

โœ” ๊ธ€์ž ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์˜ ์ข‹์€์ 

 

๐Ÿ’จ Unkown word ํŒŒ์•… ๊ฐ€๋Šฅ : OOV ๋ฌธ์ œ ํ•ด๊ฒฐ 

๐Ÿ’จ ๋น„์Šทํ•œ ์ŠคํŽ ๋ง ๊ตฌ์„ฑ์˜ ๋‹จ์–ด๋Š” ๋น„์Šทํ•œ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Œ 

๐Ÿ’จ ํ•ฉ์„ฑ์–ด์™€ ๊ฐ™์€ Connected language ๋„ ๋ถ„์„ ๊ฐ€๋Šฅ 

๐Ÿ’จ character n-grams ์œผ๋กœ ์˜๋ฏธ ์ถ”์ถœ : ๋ฌธ๋ฒ•, ์˜๋ฏธ์ ์œผ๋กœ๋Š” ์ ์ ˆ์น˜ ์•Š์„ ์ˆ˜ ์žˆ์œผ๋‚˜, ๊ฒฐ๊ณผ๋Š” ์ข‹์Œ 

 

 

 

๐Ÿ“Œ OOV ๋ฌธ์ œ : NLP ์—์„œ ๋นˆ๋ฒˆํžˆ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ๋กœ input language ๊ฐ€ database ํ˜น์€ input of embedding ์— ์—†์–ด์„œ ์ฒ˜๋ฆฌ๋ฅผ ๋ชป ํ•˜๋Š” ๋ฌธ์ œ 

 

https://acdongpgm.tistory.com/223

 

[NLP] . OOV ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐฉ๋ฒ• - 1. BPE(Byte Pair Encoding)

์ปดํ“จํ„ฐ๊ฐ€ ์ž์—ฐ์–ด๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ธฐ์ˆ ์€ ํฌ๊ฒŒ ๋ฐœ์ „ํ–ˆ๋‹ค. ๊ทธ ์ด์œ ๋Š” ์ž์—ฐ์–ด์˜ ๊ทผ๋ณธ์ ์ธ ๋ฌธ์ œ์˜€๋˜ OOV๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ๋‹ค๋Š” ์ ์—์„œ ํฐ ์—ญํ• ์„ ํ–ˆ๋‹ค๊ณ  ๋ณธ๋‹ค. ์‚ฌ์‹ค ํ•ด๊ฒฐ์ด๋ผ๊ณ  ๋ณด๊ธด ์–ด๋ ต๊ณ  ์™„ํ™”๊ฐ€ ๋” ๋งž๋Š”

acdongpgm.tistory.com

 

 

 

 

2. Pure Character level seq2seq LSTM NMT system

 

โœ” English - Czech WMT (2015) 

 

โ—ฝ ์ฒด์ฝ”์–ด๋Š” character level ๋กœ ์ด๊ฒƒ์ €๊ฒƒ ์‹คํ—˜ํ•ด๋ณด๊ธฐ ์ข‹์€ ์–ธ์–ด์ด๋‹ค. 

โ—ฝ word level ์— ๋น„ํ•ด char level ์ด ์‚ฌ๋žŒ ์ด๋ฆ„์„ ์ž˜ ๊ตฌ๋ถ„ํ•ด๋‚ด๋Š” ๋ชจ์Šต์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. word level ์€ unk ํ† ํฐ์— ๋Œ€ํ•ด ์›๋ฌธ์„ ๊ทธ๋Œ€๋กœ copy ํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. 

โ—ฝ character level ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š”๋ฐ 3์ฃผ๋‚˜ ์†Œ์š”๋˜์—ˆ๋‹ค. 

word level ์— ๋น„ํ•ด character level ์—์„œ ์‚ฌ๋žŒ ์ด๋ฆ„์„ ์ž˜ ๋ฒˆ์—ญํ•ด๋ƒ„

 

 

 

 

โœ” Fully Character-level Neural Machine Transliation without Explicit segmentation (2017) 

 

Encoder

 

(1) ๋ฌธ์žฅ ์ „์ฒด์— ๋Œ€ํ•ด character ๋‹จ์œ„์˜ ์ž„๋ฒ ๋”ฉ์„ ์ง„ํ–‰ํ•œ๋‹ค. 

(2) Filter size ๋ฅผ ๋‹ค์–‘ํ•˜๊ฒŒ ์ฃผ์–ด์„œ ํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ์„ ์ง„ํ–‰ํ•˜์—ฌ ํ”ผ์ฒ˜๋ฅผ ์ถ”์ถœํ•œ๋‹ค. 

(3) stride=5 ๋กœ maxpooling ์„ ์ง„ํ–‰ํ•œ๋‹ค. (๊ธ€์ž๊ฐ€ 3~7๊ฐœ๋กœ ๊ตฌ์„ฑ๋˜๋ฉด ํ•˜๋‚˜์˜ ๋‹จ์–ด๊ฐ€ ๋งŒ๋“ค์–ด์ง) 

(4) segmentation embeddings : feature 

(5) Highway Network 

(6) Bidirectional GRU 

 

Decoder ๋Š” ์ผ๋ฐ˜์ ์ธ character level์˜ sequence model 

 

char-char ๋‹จ์œ„์˜ ๋ฒˆ์—ญ์˜ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์•˜์Œ (2015 ์—ฐ๊ตฌ๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ)

 

 

๐Ÿ“Œ highway network : ๊นŠ์ด๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก ์ตœ์ ํ™”๊ฐ€ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์„ ๊นŠ๊ฒŒ ๋งŒ๋“ค๋ฉด์„œ ์ •๋ณด์˜ ํ๋ฆ„์„ ํ†ต์ œํ•˜๊ณ  ํ•™์Šต ๊ฐ€๋Šฅ์„ฑ์„ ๊ทน๋Œ€ํ™”ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ์—ญํ•  (Resnet / LSTM ๊ณผ ๋น„์Šทํ•จ) 

 

๐Ÿ’จ Transform gate, Carry gate 

 

https://lyusungwon.github.io/studies/2018/06/05/hn/

 

Highway Network

WHY? ์ผ๋ฐ˜์ ์œผ๋กœ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ์˜ ์ธต์ด ๊นŠ์–ด์ง€๋ฉด ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์ง€๋งŒ ๊ทธ๋งŒํผ ํ•™์Šตํ•˜๊ธฐ๋Š” ๋”์šฑ ์–ด๋ ค์›Œ์ง„๋‹ค.

lyusungwon.github.io

 

https://datacrew.tech/resnet-1-2015/

 

Resnet (1) – 2015 | DataCrew

LSTM๊ณผ Highway Network๋กœ ์‚ดํŽด๋ณด๋Š” Resnet.2015๋…„์„ ์ข…๊ฒฐ ์ง€์–ด๋ฒ„๋ฆฐ ResNet ๋ฆฌ๋ทฐ 1ํŽธ! 2015.12์›” ๋‚˜์˜จ ์ด ๋…ผ๋ฌธ์˜ ์ด๋ฆ„์€ Deep Residual Learning for Image Recognition์ž…๋‹ˆ๋‹ค. Microsoft์˜ Kaiminig He๊ฐ€ 1์ €์ž๋กœ, ์ด ๋…ผ๋ฌธ์œผ๋กœ ์—„

datacrew.tech

 

 

 

 

 

 

โœ”  Character-Based Neural Machine Translation with Capacity and Compression (2018)

 

โ—ฝ Bi-LSTM Seq2Seq ๋ชจ๋ธ์„ ์ ์šฉ

โ—ฝ char seq2seq ๋ชจ๋ธ๊ณผ word level ์˜ BPE ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋น„๊ตํ–ˆ์„ ๋•Œ, ์ฒด์ฝ”์–ด-์˜์–ด ๋ฒˆ์—ญ์—์„œ character based ๋ชจ๋ธ์ด BPE ๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์—ฐ์‚ฐ๋Ÿ‰์€ ํ›จ์”ฌ ๋” ๋งŽ๋‹ค. 

 

 

์–ธ์–ด์˜ ํŠน์„ฑ์— ๋”ฐ๋ผ char-based ์™€ BPE ์„ฑ๋Šฅ์— ์ฐจ์ด๊ฐ€ ์กด์žฌํ•œ๋‹ค.

 

 

๐Ÿ‘€ character-level model ์˜ ์„ฑ๋Šฅ์ด ์ข€ ๋” ์šฐ์ˆ˜ํ•˜์ง€๋งŒ, ์—ฐ์‚ฐ๋Ÿ‰์ด ๋งŽ์•„ ์‹œ๊ฐ„๊ณผ ๋น„์šฉ์ด ํ›จ์”ฌ ๋งŽ์ด ๋“œ๋Š” ๋ฌธ์ œ 

 

 

 

 

 

3๏ธโƒฃ Sub-word Models 


 

๐Ÿ‘€ Subword model ์€ word level ๋ชจ๋ธ๊ณผ ๋™์ผํ•˜๋‚˜, ๋” ์ž‘์€ word ์ธ word pieces ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. 

 

์œ„ํ‚ค๋…์Šค subword ์„ค๋ช… ์ฐธ๊ณ 

 

1. BPE

 

โœ” Byte Pair Encoding 

 

* ์•„์ด๋””์–ด ๊ธฐ๋ฐ˜์ด ๋œ ์„ ํ–‰๋…ผ๋ฌธ : https://aclanthology.org/P16-1162.pdf   → ์‹œํ€€์Šค์—์„œ ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋ฐœ์ƒ๋˜๋Š” byte pair ๋ฅผ ์ƒˆ๋กœ์šด byte ๋กœ clustering ํ•˜์—ฌ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹ 

 

โ—ฝ word level model ๊ณผ ๋น„์Šทํ•˜๋‚˜, word piece ๋กœ ์ ‘๊ทผํ•œ๋‹ค. 

โ—ฝ ๋ณธ๋ž˜ ์••์ถ• ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์—ฐ๊ตฌ๊ฐ€ ์‹œ์ž‘๋˜์—ˆ๋‹ค๊ฐ€, ์ด์— ์ฐฉ์•ˆํ•˜์—ฌ Word segmentation ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๋ฐœ์ „ํ–ˆ๋‹ค. ๋”ฅ๋Ÿฌ๋‹๊ณผ ๋ฌด๊ด€ํ•œ ์•„์ด๋””์–ด๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋‹ค. 

โ—ฝ ๊ฐ€์žฅ ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋ฐœ์ƒ๋˜๋Š” pair ๋ฅผ character ๋‹จ์œ„๋กœ ์„ค์ • 

โ—ฝ Purely data-driven and multi-lingual : ๋ชจ๋“  ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์ ์šฉํ•  ์ˆ˜ ์žˆ๊ณ  ์–ธ์–ด์— ๋Œ€ํ•ด ๋…๋ฆฝ์  

 

unigram ์—์„œ ์‹œ์ž‘

 

๐Ÿ’จ dictionary ์—์„œ ์™ผ์ชฝ ์ˆซ์ž๋“ค์€ ๋‹จ์–ด์˜ ๋“ฑ์žฅ ํšŸ์ˆ˜ 

 

bigram

๐Ÿ’จ es ๋Š” 6+3 = 9 ์ด 9๋ฒˆ ๋“ฑ์žฅํ•จ 

 

dictionary์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•œ es,est,lo ๋ฅผ ์ƒˆ๋กœ ์ถ”๊ฐ€ํ•ด ํ•˜๋‚˜์˜ ๋‹จ์–ด๋กœ ์ทจ๊ธ‰ํ•œ๋‹ค.

 

 

 

 

 

 

 

2. wordpiece model 

๋‹จ์–ด ๋‚ด์—์„œ ํ† ํฐํ™”๋ฅผ ์ง„ํ–‰

 

โ—ฝ BPE ๋ณ€ํ˜• ์•Œ๊ณ ๋ฆฌ์ฆ˜

โ—ฝ Pre-segmentation + BPE : ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์— ๋Œ€ํ•ด ๋จผ์ € ๋‹จ์–ด์‚ฌ์ „์— ์ถ”๊ฐ€ํ•˜๊ณ  ์ดํ›„์— BPE ๋ฅผ ์ ์šฉ

โ—ฝ BPE ๋Š” ๋นˆ๋„์ˆ˜์— ๊ธฐ๋ฐ˜ํ•ด ๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•œ ์Œ์„ ๋ณ‘ํ•ฉํ–ˆ๋Š”๋ฐ, wordpiece model ์—์„œ๋Š” ๋ณ‘ํ•ฉ๋˜์—ˆ์„ ๋•Œ ์ฝ”ํผ์Šค์˜ likelihood ์šฐ๋„๋ฅผ ๊ฐ€์žฅ ๋†’์ด๋Š” ์Œ์„ ๋ณ‘ํ•ฉํ•œ๋‹ค. 

 

๋นˆ๋„์ˆ˜๊ฐ€ ๋‚ฎ์€ Jet ์€ Jโœ”et ๋กœ, feud ๋Š” feโœ”ud ๋กœ ๋‚˜๋ˆ„์–ด์ง

 

โ—ฝ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” piece ๋Š” unit ์œผ๋กœ ๋ฌถ์Œ, ์ž์ฃผ ๋“ฑ์žฅํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์€ ๋ถ„๋ฆฌ 

โ—ฝ Transformer, ELMo, BERT, GPT-2 ์ตœ์‹  ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์—์„œ ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์›๋ฆฌ๊ฐ€ ์‚ฌ์šฉ๋จ 

 

 

 

3. SentencePiece Model 

Raw text ์—์„œ ๋ฐ”๋กœ ์ž‘๋™ 

 

โ—ฝ ๊ตฌ๊ธ€์—์„œ 2018๋…„ ๊ณต๊ฐœํ•œ ๋น„์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜ ํ˜•ํƒœ์†Œ ๋ถ„์„ ํŒจํ‚ค์ง€ : https://github.com/google/sentencepiece

 

GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation.

Unsupervised text tokenizer for Neural Network-based text generation. - GitHub - google/sentencepiece: Unsupervised text tokenizer for Neural Network-based text generation.

github.com

โ—ฝ ์ค‘๊ตญ์–ด ๋“ฑ ๋‹จ์–ด๋กœ ๊ตฌ๋ถ„์ด ์–ด๋ ค์šด ์–ธ์–ด์˜ ๊ฒฝ์šฐ raw text ์—์„œ ๋ฐ”๋กœ character-level ๋กœ ๋‚˜๋‰˜์–ด์ง„๋‹ค. 

โ—ฝ ํ† ํฐํ™” ์‚ฌ์ „์ž‘์—… ์—†์ด ๋ฌธ์žฅ ๋‹จ์œ„์˜ input ๋‹จ์–ด ๋ถ„๋ฆฌ ํ† ํฐํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋‹จ์–ด ๋ถ„๋ฆฌ ํŒจํ‚ค์ง€์ด๋ฏ€๋กœ ์–ธ์–ด์— ์ข…์†๋˜์ง€ ์•Š๊ณ  ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. 

โ—ฝ ๊ธฐ์กด BPE ๋ฐฉ์‹๊ณผ ๋‹ฌ๋ฆฌ bigram ๊ฐ๊ฐ์— ๋Œ€ํ•ด co-occurence ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๊ฐ€์žฅ ๋†’์€ ๊ฐ’์„ ๊ฐ€์ง€๋Š” ๊ฒƒ์„ ์ถ”๊ฐ€ํ•œ๋‹ค. 

 

 

 

4. BERT

โ—ฝ vocab size ๊ฐ€ ํฌ์ง€๋งŒ ๋งค์šฐ ํฌ์ง„ ์•Š์œผ๋ฏ€๋กœ word piece ๋ฅผ ๋ถ€๋ถ„์ ์œผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค. 

โ—ฝ ์ƒ๋Œ€์ ์œผ๋กœ ๋นˆ๋„๊ฐ€ ๋†’์€ ๋‹จ์–ด + wordpiece ๋ฅผ ์ด์šฉ 

 

๐Ÿ‘‰ ์‚ฌ์ „์— ์—†๋Š” Hypatia ๋ผ๋Š” ๋‹จ์–ด์˜ ๊ฒฝ์šฐ์—” 4๊ฐœ์˜ word vector piece ๋กœ ์ชผ๊ฐœ์ง„๋‹ค. 

 

 

๐Ÿ‘€ subword ๋กœ ํ‘œํ˜„๋œ ๊ฒƒ๋“ค์„ ์–ด๋–ป๊ฒŒ ๋‹ค์‹œ ๋‹จ์–ด ๋‹จ์œ„๋กœ ํ‘œํ˜„ํ•˜๋Š”๊ฐ€?

 → ์ž„๋ฒ ๋”ฉ๋œ ๋ฒกํ„ฐ๋“ค์„ ํ‰๊ท ํ•ด์„œ word vector ๋กœ ๋งŒ๋“ฆ

 → cnn,rnn ๊ณ„์—ด๋กœ ์ธํ•ด ๋” ๊ณ ์ฐจ์›์˜ ์ž„๋ฒ ๋”ฉ๋œ ๋ฒกํ„ฐ ํ•™์Šต 

 

 

 

 

 

 

4๏ธโƒฃ Hybrid Models 


 

๐Ÿ‘€ ๊ธฐ๋ณธ์ ์œผ๋กœ word ๋‹จ์œ„๋กœ ์ฒ˜๋ฆฌํ•˜๊ณ , ๊ณ ์œ ๋ช…์‚ฌ๋‚˜ ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด๋Š” character ๋‹จ์œ„๋กœ ์ทจ๊ธ‰ํ•œ๋‹ค. 

 

1. Character-level POS 

 

 

โ—ฝ Conv ์—ฐ์‚ฐ์„ ์ทจํ•˜์—ฌ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•˜๊ณ  ๋” ๋†’์€ level ์— ์ด ์ž„๋ฒ ๋”ฉ์„ ์ ์šฉํ•ด POS ํƒœ๊น… 

 

2. character-based LSTM (2015) 

 

character-based LSTM

 

โ—ฝ character ๋‹จ์œ„๋กœ NMT ์‹ ๊ฒฝ๋ง ๊ธฐ๊ณ„๋ฒˆ์—ญ task ๋ฅผ ์ง„ํ–‰ํ•œ ์—ฐ๊ตฌ ์‚ฌ๋ก€ 

 

 

3. Character-Aware Neural Language models (2015) 

 

 

 

 

โ—ฝ subword ์˜ ๊ด€๊ณ„์„ฑ์„ ๋ชจ๋ธ๋ง : eventful ~ eventufully ~ uneventful ์ฒ˜๋Ÿผ ๊ณตํ†ต๋œ subword ์— ์˜ํ•ด ์˜๋ฏธ์  ๊ด€๊ณ„๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ ์œ ์šฉํ•˜๋‹ค. 

 

โ—ฝ Char ๋‹จ์œ„๋กœ ๊ตฌ๋ถ„ → Conv layer with various filter size (feature representation)  → maxpooling (์–ด๋–ค ngram ์ด ๋‹จ์–ด์˜ ๋œป์„ ๊ฐ€์žฅ ์ž˜ ๋‚˜ํƒ€๋‚ด๋Š”์ง€ ๊ณ ๋ฅด๋Š” ๊ณผ์ •) → highyway network → Word level LSTM 

 

 

โ—ฝ ํ›จ์”ฌ ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ์ˆ˜๋กœ ๋น„์Šทํ•œ ์„ฑ๋Šฅ์„ ๋‚ธ๋‹ค. 

 

 

 

 

โ—ฝ highway ์ด์ „์—๋Š” ์‚ฌ๋žŒ ์ด๋ฆ„์˜ ์˜๋ฏธ๋ฅผ ๋‹ด์€ ๊ฒƒ์ด ์•„๋‹Œ ์ฒ ์ž๊ฐ€ ์œ ์‚ฌํ•œ ๋‹จ์–ด๋“ค์ด ๊ฐ€์žฅ ๋น„์Šทํ•œ ๋‹จ์–ด๋กœ ์ถœ๋ ฅ๋œ ๋ฐ˜๋ฉด, highway network ์ดํ›„์—๋Š” ์˜๋ฏธ๋ฅผ ๊ณ ๋ คํ•ด ๋‹ค๋ฅธ ์‚ฌ๋žŒ์˜ ์ด๋ฆ„์„ ์ถœ๋ ฅํ•ด์ค€๋‹ค ๐Ÿ‘‰ semantic ์„ ๋ฐ˜์˜ํ•˜์—ฌ ๋” ์˜๋ฏธ์žˆ๋Š” ๋‹จ์–ด๋“ค์„ ํ•™์Šต 

 

 

 

 

3. hybrid NMT 

 

cute, joli ๋‹จ์–ด ๋ถ€๋ถ„๋งŒ character level ๋กœ ์ ‘๊ทผํ•จ

 

โ—ฝ ๋Œ€๋ถ€๋ถ„ word level ๋กœ ์ ‘๊ทผํ•˜๊ณ  ํ•„์š”ํ• ๋•Œ๋งŒ character level ๋กœ ์ ‘๊ทผํ•œ๋‹ค. 

โ—ฝ ๊ธฐ๋ณธ์ ์œผ๋กœ seq2seq ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„๋‹ค. 

โ—ฝ ์„ฑ๋Šฅ๋„ ๊ฐ€์žฅ ์ข‹์•˜๋‹ค. 

 

 

โ—ฝ char-based ๋ฒˆ์—ญ์€ ์ด๋ฆ„์— ๋Œ€ํ•ด ์ž˜๋ชป ๋ฒˆ์—ญํ•จ

โ—ฝ word-based ๋ฒˆ์—ญ์€ diagnosis ๋‹จ์–ด๋ฅผ ์žƒ์–ด๋ฒ„๋ ค po ์ง์ „์— ๋“ฑ์žฅํ•œ ๋‹จ์–ด๋ฅผ ๊ทธ๋Œ€๋กœ ์ถœ๋ ฅํ•˜๊ณ  unk ํ† ํฐ์„ ๊ทธ๋ƒฅ ์›๋ฌธ์—์„œ copy ํ•ด์„œ ์‚ฌ์šฉํ•จ 

โ—ฝ ๋ฒˆ์—ญ ๊ฒฐ๊ณผ๋„ hybrid ๊ฒฐ๊ณผ๊ฐ€ ๊ฐ€์žฅ ์šฐ์ˆ˜  

 

 

4. FastText embedding 

 

 

โ—ฝ word2vec ์„ ์ด์„ ์ฐจ์„ธ๋Œ€ word vector learning library ๋กœ ํ•˜๋‚˜์˜ ๋‹จ์–ด์— ์—ฌ๋Ÿฌ ๋‹จ์–ด๋“ค์ด ์กด์žฌํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผํ•˜์—ฌ ํ•™์Šตํ•œ๋‹ค. (n-gram) 

โ—ฝ ํ˜•ํƒœ์†Œ๊ฐ€ ํ’๋ถ€ํ•œ ์–ธ์–ด๋‚˜ ํฌ๊ท€ํ•œ ๋‹จ์–ด๋“ค์„ ๋‹ค๋ฃฐ ๋•Œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๋ชจ๋ธ 

โ—ฝ ํ•œ ๋‹จ์–ด์˜ n-gram ๊ณผ ์›๋ž˜์˜ ๋‹จ์–ด๋ฅผ ๋ชจ๋‘ ํ•™์Šต์— ์‚ฌ์šฉํ•œ๋‹ค. 

 

๐Ÿ‘‰ ๋ชจ๋ฅด๋Š” ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋„ subword ๋ฅผ ํ™œ์šฉํ•ด ๋‹ค๋ฅธ ๋‹จ์–ด์™€์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๊ณ  ๋“ฑ์žฅ ๋นˆ๋„์ˆ˜๊ฐ€ ์ ์€ ํฌ๊ท€ํ•œ ๋‹จ์–ด๋„ ๋‹ค๋ฅธ ๋‹จ์–ด์™€ n-gram ์„ ๋น„๊ตํ•ด ์ž„๋ฒ ๋”ฉ ๊ฐ’์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

ex. ์–ด์ฉ”ํ‹ฐ๋น„ → ์–ด์ฉŒ๋ผ๊ณ  

 

 

๐Ÿ“Œ ์‹ค์Šต ๋งํฌ 

 

1. https://wikidocs.net/22883

 

06) ํŒจ์ŠคํŠธํ…์ŠคํŠธ(FastText)

๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋งŒ๋“œ๋Š” ๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” ํŽ˜์ด์Šค๋ถ์—์„œ ๊ฐœ๋ฐœํ•œ FastText๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. Word2Vec ์ดํ›„์— ๋‚˜์˜จ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์—, ๋ฉ”์ปค๋‹ˆ์ฆ˜ ์ž์ฒด๋Š” Word2Vec์˜ ํ™•์žฅ ...

wikidocs.net

 

2. https://wikidocs.net/22592

 

01) ๋ฐ”์ดํŠธ ํŽ˜์–ด ์ธ์ฝ”๋”ฉ(Byte Pair Encoding, BPE)

๊ธฐ๊ณ„์—๊ฒŒ ์•„๋ฌด๋ฆฌ ๋งŽ์€ ๋‹จ์–ด๋ฅผ ํ•™์Šต์‹œ์ผœ๋„ ์„ธ์ƒ์˜ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ์•Œ๋ ค์ค„ ์ˆ˜๋Š” ์—†๋Š” ๋…ธ๋ฆ‡์ž…๋‹ˆ๋‹ค. ๋งŒ์•ฝ ๊ธฐ๊ณ„๊ฐ€ ๋ชจ๋ฅด๋Š” ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•˜๋ฉด ๊ทธ ๋‹จ์–ด๋ฅผ ๋‹จ์–ด ์ง‘ํ•ฉ์— ์—†๋Š” ๋‹จ์–ด๋ž€ ์˜๋ฏธ์—์„œ ํ•ด ...

wikidocs.net

 

3.  https://wikidocs.net/86657

 

02) ์„ผํ…์Šคํ”ผ์Šค(SentencePiece)

์•ž์„œ ์„œ๋ธŒ์›Œ๋“œ ํ† ํฐํ™”๋ฅผ ์œ„ํ•œ BPE(Byte Pair Encoding) ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๊ทธ ์™ธ BPE์˜ ๋ณ€ํ˜• ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋Œ€ํ•ด์„œ ๊ฐ„๋‹จํžˆ ์–ธ๊ธ‰ํ–ˆ์Šต๋‹ˆ๋‹ค. BPE๋ฅผ ํฌํ•จํ•˜์—ฌ ๊ธฐํƒ€ ์„œ๋ธŒ์›Œ ...

wikidocs.net

 

4. https://wikidocs.net/86792 : IMDB ์‹ค์Šต 

 

03) ์„œ๋ธŒ์›Œ๋“œํ…์ŠคํŠธ์ธ์ฝ”๋”(SubwordTextEncoder)

SubwordTextEncoder๋Š” ํ…์„œํ”Œ๋กœ์šฐ๋ฅผ ํ†ตํ•ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์„œ๋ธŒ์›Œ๋“œ ํ† ํฌ๋‚˜์ด์ €์ž…๋‹ˆ๋‹ค. BPE์™€ ์œ ์‚ฌํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ธ Wordpiece Model์„ ์ฑ„ํƒํ•˜์˜€์œผ๋ฉฐ, ํŒจ ...

wikidocs.net

 

5. https://wikidocs.net/99893 : ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์Šคํƒ€ํŠธ์—… ํ—ˆ๊น…ํŽ˜์ด์Šค๊ฐ€ ๊ฐœ๋ฐœํ•œ ํŒจํ‚ค์ง€ tokenizers๋Š” ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ์„œ๋ธŒ์›Œ๋“œ๋“ค์„ ํ•˜๋‚˜์˜ ํ† ํฐ์œผ๋กœ ์ทจ๊ธ‰ํ•˜๋Š” ๋‹ค์–‘ํ•œ ์„œ๋ธŒ์›Œ๋“œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ œ๊ณต

 

04) ํ—ˆ๊น…ํŽ˜์ด์Šค ํ† ํฌ๋‚˜์ด์ €(Huggingface Tokenizer)

์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์Šคํƒ€ํŠธ์—… ํ—ˆ๊น…ํŽ˜์ด์Šค๊ฐ€ ๊ฐœ๋ฐœํ•œ ํŒจํ‚ค์ง€ tokenizers๋Š” ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ์„œ๋ธŒ์›Œ๋“œ๋“ค์„ ํ•˜๋‚˜์˜ ํ† ํฐ์œผ๋กœ ์ทจ๊ธ‰ํ•˜๋Š” ๋‹ค์–‘ํ•œ ์„œ๋ธŒ์›Œ๋“œ ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฒˆ ์‹ค์Šต ...

wikidocs.net

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

728x90

'1๏ธโƒฃ AIโ€ขDS > ๐Ÿ“— NLP' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

NER ์‹ค์Šต  (0) 2022.06.02
Glove ์‹ค์Šต  (0) 2022.05.31
[cs224n] 11๊ฐ• ๋‚ด์šฉ ์ •๋ฆฌ  (0) 2022.05.19
ํ…์ŠคํŠธ ๋ถ„์„ โ‘ก  (0) 2022.05.17
ํ…์ŠคํŠธ ๋ถ„์„ โ‘   (0) 2022.05.14

๋Œ“๊ธ€