๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“— NLP

Glove ์‹ค์Šต

by isdawell 2022. 5. 31.
728x90

๐Ÿ“Œ ํ•„์‚ฌ ์ž๋ฃŒ ๋งํฌhttps://colab.research.google.com/drive/148V1ytOU36pT4oX1fbcWUaC3B6F4O62s?usp=sharing 

 

week3_Glove ์‹ค์Šต.ipynb

Colaboratory notebook

colab.research.google.com

 

 

 

 


 

 

Glove 

 

 

๐Ÿ’ก ๋…ผ๋ฌธ : https://nlp.stanford.edu/pubs/glove.pdf

 

 

 

1๏ธโƒฃ  glove python 


https://wikidocs.net/22885

 

05) ๊ธ€๋กœ๋ธŒ(GloVe)

๊ธ€๋กœ๋ธŒ(Global Vectors for Word Representation, GloVe)๋Š” ์นด์šดํŠธ ๊ธฐ๋ฐ˜๊ณผ ์˜ˆ์ธก ๊ธฐ๋ฐ˜์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์œผ๋กœ 2014๋…„์— ๋ฏธ๊ตญ ์Šคํƒ ํฌ๋“œ๋Œ€ ...

wikidocs.net

 

 

 

โœ” ๊ฐœ๋… ๋ณต์Šต 

 

 

 

  • ์นด์šดํŠธ ๊ธฐ๋ฐ˜ + ์˜ˆ์ธก ๊ธฐ๋ฐ˜์„ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๋Š” ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ฐฉ๋ฒ•๋ก  
  • ์นด์šดํŠธ ๊ธฐ๋ฐ˜์˜ LSA ์™€ ์˜ˆ์ธก ๊ธฐ๋ฐ˜์˜ Word2Vec ์˜ ๋‹จ์ ์„ ์ง€์ ํ•˜๋ฉฐ ๋ณด์™„ํ•œ๋‹ค๋Š” ๋ชฉ์ ์œผ๋กœ ๋“ฑ์žฅ 

 

โ˜ LSA 

 

  • ๊ฐ ๋ฌธ์„œ์—์„œ์˜ ๊ฐ ๋‹จ์–ด์˜ ๋นˆ๋„์ˆ˜๋ฅผ ์นด์šดํŠธ ํ•œ ํ–‰๋ ฌ์ด๋ผ๋Š” ์ „์ฒด์ ์ธ ํ†ต๊ณ„ ์ •๋ณด๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ ์ฐจ์›์„ ์ถ•์†Œ(Truncated SVD)ํ•˜์—ฌ ์ž ์žฌ๋œ ์˜๋ฏธ๋ฅผ ๋Œ์–ด๋‚ด๋Š” ๋ฐฉ๋ฒ•๋ก 
  •  ์ฝ”ํผ์Šค์˜ ์ „์ฒด์ ์ธ ํ†ต๊ณ„ ์ •๋ณด๋ฅผ ๊ณ ๋ คํ•˜๊ธฐ๋Š” ํ•˜์ง€๋งŒ '์ผ๋ณธ : ๋„์ฟ„ = ํ•œ๊ตญ : ์„œ์šธ' ๊ฐ™์€ ๋‹จ์–ด ์˜๋ฏธ์˜ ์œ ์ถ” ์ž‘์—…(Analogy task)์—๋Š” ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง„๋‹ค.

 

โœŒ Word2Vec

 

  • ์‹ค์ œ๊ฐ’๊ณผ ์˜ˆ์ธก๊ฐ’์— ๋Œ€ํ•œ ์˜ค์ฐจ๋ฅผ ์†์‹ค ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์ค„์—ฌ๋‚˜๊ฐ€๋ฉฐ ํ•™์Šตํ•˜๋Š” ์˜ˆ์ธก ๊ธฐ๋ฐ˜์˜ ๋ฐฉ๋ฒ•๋ก  
  • ๋‹จ์–ด ๊ฐ„ ์œ ์ถ” ์ž‘์—…์—๋Š” LSA๋ณด๋‹ค ๋›ฐ์–ด๋‚˜์ง€๋งŒ, ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๊ฐ€ ์œˆ๋„์šฐ ํฌ๊ธฐ ๋‚ด์—์„œ๋งŒ ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ๊ณ ๋ คํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ฝ”ํผ์Šค์˜ ์ „์ฒด์ ์ธ ํ†ต๊ณ„ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•œ๋‹ค. 

 

 

  • ๋ฒกํ„ฐํ™” ๊ณ„์‚ฐ ๋ฐฉ๋ฒ• : (์ž„๋ฒ ๋”ฉ ๋œ ์ค‘์‹ฌ ๋‹จ์–ด์™€ ์ฃผ๋ณ€ ๋‹จ์–ด ๋ฒกํ„ฐ์˜ ๋‚ด์ ) ≡ (์ „์ฒด ์ฝ”ํผ์Šค ๋‚ด ๋™์‹œ ๋“ฑ์žฅ ํ™•๋ฅ ) ์ด ๋˜๋„๋ก ํ•œ๋‹ค.

 

 

  • glove model ์˜ input ์€ ๋ฐ˜๋“œ์‹œ ๋™์‹œ๋“ฑ์žฅํ–‰๋ ฌ ํ˜•ํƒœ์—ฌ์•ผ ํ•œ๋‹ค โญ 

 

 

โœ” ๋™์‹œ๋“ฑ์žฅํ–‰๋ ฌ 

 

์ „์น˜ํ•ด๋„ ๋™์ผํ•œ ํ–‰๋ ฌ์ด ๋œ๋‹ค๋Š” ํŠน์ง•์ด ์žˆ๋‹ค.

 

 

โœ” ๋™์‹œ๋“ฑ์žฅํ™•๋ฅ  

 

  • P(k|i) : ๋™์‹œ๋“ฑ์žฅ ํ–‰๋ ฌ๋กœ๋ถ€ํ„ฐ ํŠน์ • ๋‹จ์–ด i (์ค‘์‹ฌ ๋‹จ์–ด) ์˜ ์ „์ฒด ๋“ฑ์žฅ ํšŸ์ˆ˜๋ฅผ ์นด์šดํŠธํ•˜๊ณ , ํŠน์ • ๋‹จ์–ด i ๊ฐ€ ๋“ฑ์žฅํ–ˆ์„ ๋•Œ ์–ด๋–ค ๋‹จ์–ด k (์ฃผ๋ณ€ ๋‹จ์–ด) ๊ฐ€ ๋“ฑ์žฅํ•œ ํšŸ์ˆ˜๋ฅผ ์นด์šดํŠธํ•˜์—ฌ ๊ณ„์‚ฐํ•œ ์กฐ๊ฑด๋ถ€ ํ™•๋ฅ  

 

 

โœ” ์†์‹คํ•จ์ˆ˜ 

 

 

 

 

 

๐Ÿ•‍๐Ÿฆบ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import 


 

 

๐Ÿพ  glove library 

 

pip install glove_python_binary # ์ฝ”๋žฉํ™˜๊ฒฝ 
# ๋˜๋Š” ์•„๋ž˜ ์ฝ”๋“œ 
pip install glove_python

 

 

(1) ๋™์‹œ๋“ฑ์žฅํ–‰๋ ฌ ์ƒ์„ฑ

 

 

  • ์ด์— ์•ž์„œ์„œ ๋ฌธ์žฅ ๋‹จ์–ด ์ „์ฒ˜๋ฆฌ+ํ† ํฐํ™”๊ฐ€ ์„ ํ–‰๋˜์–ด์•ผ ํ•จ!

 

from glove import Corpus, Glove

corpus = Corpus() # vocab ๋‹จ์–ด ์‚ฌ์ „์„ ๋งŒ๋“ค์–ด์ฃผ๋Š” ํ•จ์ˆ˜ (glove ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์— ๋‚ด์žฅ๋จ)

# ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ Glove ์—์„œ ์‚ฌ์šฉํ•  ๋™์‹œ๋“ฑ์žฅ ํ–‰๋ ฌ ์ƒ์„ฑํ•˜๊ธฐ 
corpus.fit(result, window = 5)
## corpus ์ž…๋ ฅ์œผ๋กœ๋Š” ๋ฐ˜๋“œ์‹œ โญํ† ํฐํ™”๋œ ๋ฐ์ดํ„ฐ ํ˜•ํƒœ์—ฌ์•ผ ํ•œ๋‹ค. 
## โญ window  : ๊ณ ๋ คํ•  ์ฃผ๋ณ€๋‹จ์–ด ๊ฐœ์ˆ˜

 

 

๋™์‹œ๋“ฑ์žฅํ–‰๋ ฌ (print ๋ฅผ ํ†ตํ•ด ์‹ค์ œ ์›์†Œ๊ฐ’ ํ™•์ธ ๊ฐ€๋Šฅ)

 

๐Ÿ‘‰ ๊ตฌ๋ณ„๋˜๋Š” ๋‹จ์–ด๊ฐ€ 54775๊ฐœ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ!

 

 

๊ฐ ์›์†Œ๊ฐ’ : window ๋‚ด ๋™์‹œ ๋“ฑ์žฅ ๋นˆ๋„

 

 

 

(2) Glove ๋ชจ๋ธ ํ•™์Šต 

 

 

# ๋ชจ๋ธ ํ•™์Šต 
glove = Glove(no_components = 100, learning_rate = 0.05) 
## โญ no_compoenents : 100์ฐจ์›์œผ๋กœ ์ž„๋ฒ ๋”ฉ๋œ ๋ฒกํ„ฐ๋ฅผ output ์œผ๋กœ ๋‚˜์˜ค๋„๋ก 

# ํ•™์Šต์— ์ด์šฉํ•  thread ๊ฐœ์ˆ˜๋Š” 4๋กœ ์„ค์ •. ์—ํฌํฌ๋Š” 20 
glove.fit(corpus.matrix, epochs = 20 , no_threads = 4, verbose = False)
## โญ corpus.matrix : ๋™์‹œ๋“ฑ์žฅํ–‰๋ ฌ

 

  • glove ๋Š” ๋™์‹œ๋“ฑ์žฅํ–‰๋ ฌ์„ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๋Š”๋‹ค. 

 

 

(3) ํ–‰๋ ฌ์˜ index ์ •๋ณด๋ฅผ ์ž…๋ ฅ

 

glove.add_dictionary(corpus.dictionary)

 

  • ์œ ์‚ฌ์–ด ๊ฒ€์ƒ‰์„ ์œ„ํ•ด์„œ๋Š” sparse matrix ์˜ ๊ฐ row, column index ์— ํ•ด๋‹นํ•˜๋Š” vocabulary ์˜ ์ •๋ณด๊ฐ€ ํ•„์š” ๐Ÿ‘‰ dictionary ๋ฅผ add_dictionary ํ•จ์ˆ˜๋ฅผ ํ†ตํ•˜์—ฌ ์ž…๋ ฅ

 

 

 

 

 

(4) ๊ฒฐ๊ณผ ํ™•์ธ 

 

print(glove.most_similar("man"))
print(glove.most_similar("america"))

 

 

 

 

 

๐Ÿพ  gensim library 

 

# glove๋Š” word2vec๊ณผ๋Š” ์•ฝ๊ฐ„ ๋‹ค๋ฅธ ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•˜๋Š” ๋‹ค๋ฅธ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ์ด๋‹ค.
# English๋งŒ ๊ฐ€๋Šฅ
# ๊ทธ๋Ÿฌ๋‚˜ ๋™์ผํ•œ ์†์„ฑ๊ณผ API๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-50")
model
model = gensim.models.KeyedVectors.load_word2vec_format('~/gensim-data/glove-wiki-gigaword-50/glove-wiki-gigaword-50.gz')
print(model)

 

 

 

 

 

 

 

2๏ธโƒฃ  pre-trained glove 


 

โœ” ์‚ฌ์ „ํ•™์Šต๋ชจ๋ธ

 

https://wikidocs.net/33793

 

08) ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ(Pre-trained Word Embedding)

์ด๋ฒˆ์—๋Š” **์ผ€๋ผ์Šค์˜ ์ž„๋ฒ ๋”ฉ ์ธต(embedding layer)** ๊ณผ **์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ(pre-trained word embedding)** ์„ ๊ฐ€์ ธ์™€์„œ ์‚ฌ์šฉํ•˜๋Š” ...

wikidocs.net

 

 

  • ์œ„ํ‚คํ”ผ๋””์•„ ๋“ฑ๊ณผ ๊ฐ™์€ ๋ฐฉ๋Œ€ํ•œ ์ฝ”ํผ์Šค๋ฅผ ๊ฐ€์ง€๊ณ  Word2vec, FastText, GloVe ๋“ฑ์„ ํ†ตํ•ด์„œ ๋ฏธ๋ฆฌ ํ›ˆ๋ จ๋œ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๋ฐฉ๋ฒ• 
  • kaggle ๋Œ€ํšŒ์—์„œ ์‚ฌ์ „ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ๋งŽ์ด ํ™œ์šฉํ•จ 
  • ์ž„์˜์˜ ๊ฐ’์œผ๋กœ ์ดˆ๊ธฐํ™”ํ•˜๋˜ ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜๋“ค์„ ๋‹ค๋ฅธ ๋ฌธ์ œ์— ํ•™์Šต์‹œํ‚จ ๊ฐ€์ค‘์น˜๋“ค๋กœ ์ดˆ๊ธฐํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.์‚ฌ์ „ ํ•™์Šตํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ํ™œ์šฉํ•ด ํ•™์Šตํ•˜๊ณ ์ž ํ•˜๋Š” ๋ณธ๋ž˜ ๋ฌธ์ œ๋ฅผ ํ•˜์œ„๋ฌธ์ œ๋ผ๊ณ  ํ•œ๋‹ค.
  • pre-trained data ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š”๋ฐ ์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆผ

 

 

โ˜ ์‹ค์Šต ์˜ˆ์ œ : ๋ฌธ์žฅ์˜ ๊ธ๋ถ€์ •์„ ํŒ๋‹จํ•˜๋Š” ๊ฐ์„ฑ๋ถ„๋ฅ˜ ๋ชจ๋ธ ๋งŒ๋“ค๊ธฐ

 

 

(1) ์ผ€๋ผ์Šค embedding layer 

 

๐Ÿ‘€ ์ผ€๋ผ์Šค์˜ ์ž„๋ฒ ๋”ฉ์ธต 

 

  • ์ผ€๋ผ์Šค๋Š” ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ์„ ์ˆ˜ํ–‰ํ•˜๋Š” Embedding  layer ๋ฅผ ์ œ๊ณตํ•œ๋‹ค.
  • ํ˜„์žฌ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋กœ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ• : Embedding( )
  • ์–ด๋–ค ๋‹จ์–ด → ๋‹จ์–ด์— ๋ถ€์—ฌ๋œ ๊ณ ์œ ํ•œ ์ •์ˆ˜๊ฐ’ → ์ž„๋ฒ ๋”ฉ์ธต → ๋ฐ€์ง‘๋ฒกํ„ฐ(=์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ)
  • ๋ฐ€์ง‘๋ฒกํ„ฐ๋Š” ์ธ๊ณต์‹ ๊ฒฝ๋ง์˜ ํ•™์Šต๊ณผ์ •์—์„œ ๊ฐ€์ค‘์น˜๊ฐ€ ํ•™์Šต๋˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จ๋œ๋‹ค.

 

 

 

๐Ÿ’จ great์€ ์ •์ˆ˜ ์ธ์ฝ”๋”ฉ ๊ณผ์ •์—์„œ 1,918์˜ ์ •์ˆ˜๋กœ ์ธ์ฝ”๋”ฉ์ด ๋˜์—ˆ๊ณ  ๊ทธ์— ๋”ฐ๋ผ ๋‹จ์–ด ์ง‘ํ•ฉ์˜ ํฌ๊ธฐ๋งŒํผ์˜ ํ–‰์„ ๊ฐ€์ง€๋Š” ํ…Œ์ด๋ธ”์—์„œ ์ธ๋ฑ์Šค 1,918๋ฒˆ์— ์œ„์น˜ํ•œ ํ–‰์„ ๋‹จ์–ด great์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋กœ ์‚ฌ์šฉ

 

๐Ÿ’จ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋Š” ๋ชจ๋ธ์˜ ์ž…๋ ฅ์ด ๋˜๊ณ , ์—ญ์ „ํŒŒ ๊ณผ์ •์—์„œ ๋‹จ์–ด great์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๊ฐ’์ด ํ•™์Šต๋จ

 

 

 

๐Ÿพ  Embedding

 

 

vocab_size = 20000
output_dim = 128
input_length = 500

v = Embedding(vocab_size, output_dim, input_length=input_length)

 

  • vocab_size = ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์˜ ์ „์ฒด ๋‹จ์–ด ์ง‘ํ•ฉ์˜ ํฌ๊ธฐ
  • output_dim = ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ ํ›„์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์˜ ์ฐจ์›
  • input_length = ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๊ธธ์ด. ๋งŒ์•ฝ ๊ฐ–๊ณ ์žˆ๋Š” ๊ฐ ์ƒ˜ํ”Œ์˜ ๊ธธ์ด๊ฐ€ 500๊ฐœ์ด๋ผ๋ฉด ์ด ๊ฐ’์€ 500 ์ด๋‹ค. 
  • Embedding ์€ 2D ์ •์ˆ˜ ํ…์„œ๋ฅผ ์ž…๋ ฅ๋ฐ›๊ณ  3D ์‹ค์ˆ˜ ํ…์„œ๋ฅผ ์ถœ๋ ฅ์œผ๋กœ ๋ฆฌํ„ดํ•œ๋‹ค. 

 

 

1. ๋ฌธ์žฅ์˜ ๊ธ๋ถ€์ •์„ ํŒ๋‹จํ•˜๋Š” ๊ฐ์„ฑ๋ถ„๋ฅ˜ ๋ชจ๋ธ 

 

import numpy as np 
from tensorflow.keras.preprocessing.text import Tokenizer 
from tensorflow.keras.preprocessing.sequence import pad_sequences 

sentences = ['nice great best amazing', 'stop lies', 'pitiful nerd', 'excellent work', 'supreme quality', 'bad', 'highly respectable']
y_train = [1, 0, 0, 1, 1, 0, 1]

 

 

2. ์ผ€๋ผ์Šค ํ† ํฌ๋‚˜์ด์ €๋ฅผ ์‚ฌ์šฉํ•ด ๋‹จ์–ด ์ง‘ํ•ฉ ์ƒ์„ฑ 

 

tokenizer = Tokenizer() 
tokenizer.fit_on_texts(sentences) 
vocab_size = len(tokenizer.word_index) + 1 # ๐Ÿ“Œ ํŒจ๋”ฉ์„ ๊ณ ๋ คํ•˜์—ฌ +1 (0๊ฐ’)
print('๋‹จ์–ด ์ง‘ํ•ฉ : ', vocab_size)

 

 

 

 

 

3. ์ •์ˆ˜ ์ธ์ฝ”๋”ฉ ์ˆ˜ํ–‰ 

 

X_encoded = tokenizer.texts_to_sequences(sentences) 
print('์ •์ˆ˜ ์ธ์ฝ”๋”ฉ ๊ฒฐ๊ณผ : ', X_encoded)

 

 

 

 

 

4. ํŒจ๋”ฉ ์ง„ํ–‰ 

 

X_encoded = tokenizer.texts_to_sequences(sentences) 
print('์ •์ˆ˜ ์ธ์ฝ”๋”ฉ ๊ฒฐ๊ณผ : ', X_encoded)

 

 

 

 

5. ๋ชจ๋ธ ์ƒ์„ฑ ๋ฐ ํ›ˆ๋ จ 

 

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Flatten

embedding_dim = 4 

model = Sequential() 
model.add(Embedding(vocab_size, embedding_dim, input_length = max_len)) 
model.add(Flatten()) 
model.add(Dense(1, activation='sigmoid')) 

model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics=['acc']) 
model.fit(X_train, y_train, epochs=100, verbose=2)

 

ํ•™์Šต ๊ณผ์ •์—์„œ ํ˜„์žฌ ๊ฐ ๋‹จ์–ด๋“ค์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋“ค์˜ ๊ฐ’์€ ์ถœ๋ ฅ์ธต์˜ ๊ฐ€์ค‘์น˜์™€ ํ•จ๊ป˜ ํ•™์Šต๋œ๋‹ค.

 

 

 

(2) ์‚ฌ์ „ํ›ˆ๋ จ๋œ ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ 

 

1. ์‚ฌ์ „ํ›ˆ๋ จ๋œ glove ๊ฐ€์ ธ์˜ค๊ธฐ 

 

  • glove.68.zop ์„ ๋‹ค์šด๋กœ๋“œํ•˜๊ณ  ์••์ถ•์„ ํ’€์–ด์„œ glove.6B.100d.txt ํŒŒ์ผ์„ ์‚ฌ์šฉํ•ด ํ•ด๋‹น ํŒŒ์ผ์— ์žˆ๋Š” ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋“ค์„ ๋ถˆ๋Ÿฌ์˜จ๋‹ค. 40๋งŒ๊ฐœ ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๊ฐ€ ์กด์žฌํ•œ๋‹ค. 

 

from urllib.request import urlretrieve, urlopen
import gzip
import zipfile

urlretrieve("http://nlp.stanford.edu/data/glove.6B.zip", filename="glove.6B.zip")
zf = zipfile.ZipFile('glove.6B.zip')
zf.extractall() 
zf.close()


# โญ ์‚ฌ์ „ ํ›ˆ๋ จ๋œ glove ๊ฐ€์ ธ์˜ค๊ธฐ 


import numpy as np

embedding_dict = dict() # ์‚ฌ์ „ํ›ˆ๋ จ๋œ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ์‚ฌ์ „ : key , value ํ˜•ํƒœ๋กœ ๊ฐ€์ ธ์˜ด 

f = open('glove.6B.100d.txt', encoding="utf8") # glove.6B.100d.txt์— ์žˆ๋Š” ๋ชจ๋“  ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋“ค์„ ๋ถˆ๋Ÿฌ์˜จ๋‹ค

for line in f:
    word_vector = line.split()
    word = word_vector[0]

    # 100๊ฐœ์˜ ๊ฐ’์„ ๊ฐ€์ง€๋Š” array๋กœ ๋ณ€ํ™˜
    word_vector_arr = np.asarray(word_vector[1:], dtype='float32')
    embedding_dict[word] = word_vector_arr
    
f.close()

print('%s๊ฐœ์˜ Embedding vector๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.' % len(embedding_dict)) 

# 40๋งŒ๊ฐœ์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ

 

2. 'respectable' ๋‹จ์–ด์— ๋Œ€ํ•ด ์ž„๋ฒ ๋”ฉ๋œ ๋ฒกํ„ฐ ํ™•์ธํ•ด๋ณด๊ธฐ 

 

 

 

 

๋ฌธ์žฅ ๊ธ๋ถ€์ • ์˜ˆ์ œ ๋ฌธ์žฅ์—์„œ tokenizer ์ ์šฉํ•œ ๊ฒƒ (1๋ฒˆ keras ์ฝ”๋“œ ์ฐธ๊ณ ) 

print(tokenizer.word_index.items()) # ๊ธฐ์กด ๋ฐ์ดํ„ฐ์˜ ๊ฐ ๋‹จ์–ด์™€ ๋งคํ•‘๋œ ์ •์ˆ˜๊ฐ’

###################
dict_items([('nice', 1), ('great', 2), ('best', 3), ('amazing', 4), ('stop', 5), ('lies', 6), ('pitiful', 7), ('nerd', 8), ('excellent', 9), ('work', 10), ('supreme', 11), ('quality', 12), ('bad', 13), ('highly', 14), ('respectable', 15)])



print('๋‹จ์–ด great์˜ ๋งตํ•‘๋œ ์ •์ˆ˜ :',tokenizer.word_index['great'])

###################
๋‹จ์–ด great์˜ ๋งตํ•‘๋œ ์ •์ˆ˜ : 2

 

 

3. ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์›Œ๋“œ ๋ฒกํ„ฐ ํ–‰๋ ฌ๋กœ๋ถ€ํ„ฐ, ๊ธฐ์กด ๋ฌธ์žฅ์—์„œ ์ถ”์ถœ๋œ ๋‹จ์–ด์‚ฌ์ „ ์ธ๋ฑ์Šค์™€ ๋งคํ•‘๋˜๋„๋ก ๊ฐ’์„ ์ถ”์ถœ 

 

for word, index in tokenizer.word_index.items():
    # ๋‹จ์–ด์™€ ๋งตํ•‘๋˜๋Š” ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๊ฐ’
    vector_value = embedding_dict.get(word)
    if vector_value is not None:
        embedding_matrix[index] = vector_value

great ์— ํ•ด๋‹นํ•˜๋Š” ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ index=2 ๋กœ ํ• ๋‹นํ•˜์—ฌ ์ƒˆ๋กญ๊ฒŒ ์ €์žฅ

 

 

4. ๋ชจ๋ธ ํ›ˆ๋ จ 

 

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Flatten

output_dim = 100

model = Sequential() 

# โญ embedding layer ์— ์ดˆ๊ธฐ๊ฐ’์œผ๋กœ embedding_matrix ๋ฅผ ์„ค์ •ํ•œ๋‹ค. 
# ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ์„ 100์ฐจ์›์˜ ๊ฐ’์ธ ๊ฒƒ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ž„๋ฒ ๋”ฉ ์ธต์˜ output_dim์˜ ์ธ์ž๊ฐ’์œผ๋กœ 100์„  ์คŒ
e = Embedding(vocab_size, output_dim, weights=[embedding_matrix], input_length=max_length, trainable=False)

# โญ trainable = False : ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ์ž„๋ฒ ๋”ฉ์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ (์ถ”๊ฐ€ํ›ˆ๋ จ X)

model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
model.fit(X_train, y_train, epochs=100, verbose=0)

 

 

 

 

 

 

3๏ธโƒฃ fine tuning glove 


 

โœ” Fine-Tuning

 

  • ์‚ฌ์ „ ํ•™์Šตํ•œ ๋ชจ๋“  ๊ฐ€์ค‘์น˜์™€ ๋”๋ถˆ์–ด ํ•˜์œ„ ๋ฌธ์ œ๋ฅผ ์œ„ํ•œ ์ตœ์†Œํ•œ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์ถ”๊ฐ€ํ•ด ๋ชจ๋ธ์„ ์ถ”๊ฐ€๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
  • pretrained model ์— ๋ฐ์ดํ„ฐ์…‹์— ์žˆ๋Š” ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ํ˜น์€ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์ด ๋„ˆ๋ฌด ์ž‘์•„์„œ ์ „์ฒด ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ค๊ธฐ ์–ด๋ ค์šด ๊ฒฝ์šฐ์— fine tuning ์„ ์‚ฌ์šฉํ•œ๋‹ค. 

 

โœ” Mittens 

 

  • pre-trained ๋œ glove ๋ชจ๋ธ ํ™œ์šฉ์—์„œ ์†๋„๋ฅผ ๊ฐœ์„ ํ•œ package

 

pip install -U mittens
from mittens import GloVe

 

 

https://github.com/roamanalytics/mittens

 

GitHub - roamanalytics/mittens: A fast implementation of GloVe, with optional retrofitting

A fast implementation of GloVe, with optional retrofitting - GitHub - roamanalytics/mittens: A fast implementation of GloVe, with optional retrofitting

github.com

 

 

 

 

 

 

 

 

 

728x90

๋Œ“๊ธ€