๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“’ Deep learning

[๋”ฅ๋Ÿฌ๋‹ ํŒŒ์ดํ† ์น˜ ๊ต๊ณผ์„œ] ์ž์—ฐ์–ด์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์ž„๋ฒ ๋”ฉ

by isdawell 2022. 12. 30.
728x90

 

 

 

๐Ÿ‘€ ์ž„๋ฒ ๋”ฉ 


 

์ž„๋ฒ ๋”ฉ : ์‚ฌ๋žŒ์˜ ์–ธ์–ด๋ฅผ ์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ์–ธ์–ด (์ˆซ์ž) ํ˜•ํƒœ์ธ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•œ ๊ฒฐ๊ณผ 

 

์ž„๋ฒ ๋”ฉ์˜ ์—ญํ•  

 

โ†ช ๋‹จ์–ด ๋ฐ ๋ฌธ์žฅ ๊ฐ„ ๊ด€๋ จ์„ฑ ๊ณ„์‚ฐ 

โ†ช ์˜๋ฏธ์  ํ˜น์€ ๋ฌธ๋ฒ•์  ์ •๋ณด์˜ ํ•จ์ถ• (ex. ์™•-์—ฌ์™•, ๊ต์‚ฌ-ํ•™์ƒ) 

 

 

 

 

โ‘  ํฌ์†Œํ‘œํ˜„ ๊ธฐ๋ฐ˜ ์ž„๋ฒ ๋”ฉ : ์›ํ•ซ์ธ์ฝ”๋”ฉ 

 

• Sparse representation : ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ’์ด 0์œผ๋กœ ์ฑ„์›Œ์ ธ ์žˆ๋Š” ๊ฒฝ์šฐ๋กœ ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•์ด ์›ํ•ซ์ธ์ฝ”๋”ฉ

• ์›ํ•ซ์ธ์ฝ”๋”ฉ : ๋‹จ์–ด N ๊ฐœ๋ฅผ ๊ฐ๊ฐ N ์ฐจ์›์˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ์‹ 

 

 

from sklearn import preprocessing 

label_encoder = preprocessing.LabelEncoder() 
onehot_encoder = preprocessing.OneHotEncoder() 

a = label_encoder.fit_transform(data['colunn'])

 

์›ํ•ซ์ธ์ฝ”๋”ฉ์˜ ๋‹จ์  

 

โ†ช ์ˆ˜ํ•™์ ์ธ ์˜๋ฏธ์—์„œ ์›ํ•ซ๋ฒกํ„ฐ๋Š” ํ•˜๋‚˜์˜ ์š”์†Œ๋งŒ 1์„ ๊ฐ€์ง€๊ณ  ๋‚˜๋จธ์ง€๋Š” 0์„ ๊ฐ€์ง€๋Š” ํฌ์†Œ๋ฒกํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‘ ๋‹จ์–ด์— ๋Œ€ํ•œ ๋ฒกํ„ฐ ๋‚ด์ ์„ ๊ตฌํ•˜๋ฉด 0์„ ๊ฐ€์ง€๋ฏ€๋กœ ์ง๊ตํ•˜๊ฒŒ ๋œ๋‹ค. ์ฆ‰ ๋‹จ์–ด๋ผ๋ฆฌ ๊ด€๊ณ„์„ฑ (์œ ์˜์–ด, ๋ฐ˜์˜์–ด) ์—†์ด ์„œ๋กœ ๋…๋ฆฝ์ ์ธ ๊ด€๊ณ„๊ฐ€ ๋œ๋‹ค. 

โ†ช ํ•˜๋‚˜์˜ ๋‹จ์–ด๋ฅผ ํ‘œํ˜„ํ•˜๋Š”๋ฐ ๋ง๋ญ‰์น˜์— ์žˆ๋Š” ๋‹จ์–ด ๊ฐœ์ˆ˜๋งŒํผ ์ฐจ์›์ด ์กด์žฌํ•˜๋ฏ€๋กœ ์ฐจ์›์˜ ์ €์ฃผ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค. 

 

๐Ÿ‘‰ ๋Œ€์•ˆ์ฑ… : ์‹ ๊ฒฝ๋ง์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋ฐ”๊พธ๋Š” Word2Vec, GloVe, FastText ๋ฐฉ๋ฒ•๋ก ์ด ์ฃผ๋ชฉ์„ ๋ฐ›๊ณ ์žˆ์Œ 

 

 

 

 

 

โ‘ก ํšŸ์ˆ˜๊ธฐ๋ฐ˜ ์ž„๋ฒ ๋”ฉ : CountVector, TF-IDF 

 

• ๋‹จ์–ด๊ฐ€ ์ถœํ˜„ํ•œ ๋นˆ๋„๋ฅผ ๊ณ ๋ คํ•ด ์ž„๋ฒ ๋”ฉ ํ•˜๋Š” ๋ฐฉ๋ฒ• 

 

 

(1) Count vector 

 

• ๋ฌธ์„œ์ง‘ํ•ฉ์—์„œ ๋‹จ์–ด๋ฅผ ํ† ํฐ์œผ๋กœ ์ƒ์„ฑํ•˜๊ณ  ๊ฐ ๋‹จ์–ด์˜ ์ถœํ˜„ ๋นˆ๋„์ˆ˜๋ฅผ ์ด์šฉํ•ด ์ธ์ฝ”๋”ฉํ•ด ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ• 

• ํ† ํฌ๋‚˜์ด์ง•๊ณผ ๋ฒกํ„ฐํ™”๊ฐ€ ๋™์‹œ์— ๊ฐ€๋Šฅ 

 

๋ฌธ์„œ๋ฅผ ํ† ํฐ ๋ฆฌ์ŠคํŠธ๋กœ ๋ณ€ํ™˜ → ๊ฐ ๋ฌธ์„œ์—์„œ ํ† ํฐ์˜ ์ถœํ˜„ ๋นˆ๋„๋ฅผ ์นด์šดํŠธ → ๊ฐ ๋ฌธ์„œ๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๊ณ  ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ 

 

from sklearn.feature_extraction.text import CountVectorizer 

corpus = [
    'This is last chance.',
    'and if you do not have this chance.',
    'you will never get any chance.',
    'will you do get this one?',
    'please, get this chance',
]

vect = CountVectorizer() 

vect.fit(corpus) 
vect.vocabulary_ # ๋‹จ์–ด์‚ฌ์ „

 

# countvector ์ ์šฉ ๊ฒฐ๊ณผ๋ฅผ ๋ฐฐ์—ด๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ 

vect.transform(['you will never get any chance']).toarray()

 

 

(2) TF-IDF 

 

• ์ •๋ณด๊ฒ€์ƒ‰๋ก ์—์„œ ๊ฐ€์ค‘์น˜๋ฅผ ๊ตฌํ•  ๋•Œ ์‚ฌ์šฉ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ 

• Term Frequency ๋‹จ์–ด๋นˆ๋„ : ๋ฌธ์„œ ๋‚ด์—์„œ ํŠน์ • ๋‹จ์–ด๊ฐ€ ์ถœํ˜„ํ•œ ๋นˆ๋„ 

• Inverse Document Frequency ์—ญ๋ฌธ์„œ ๋นˆ๋„ : ํ•œ ๋‹จ์–ด๊ฐ€ ์ „์ฒด ๋ฌธ์„œ์—์„œ ์–ผ๋งˆ๋‚˜ ๊ณตํ†ต์ ์œผ๋กœ ๋งŽ์ด ๋“ฑ์žฅํ•˜๋Š”์ง€ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’ (ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋‚˜ํƒ€๋‚œ ๋ฌธ์„œ ๊ฐœ์ˆ˜) ์ธ DF ์˜ ์—ญ์ˆ˜ ๊ฐ’ 

 

 

โ†ช ํ‚ค์›Œ๋“œ ๊ฒ€์ƒ‰์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ๊ฒ€์ƒ‰์—”์ง„, ์ค‘์š” ํ‚ค์›Œ๋“œ ๋ถ„์„, ๊ฒ€์ƒ‰ ์—”์ง„์—์„œ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ์˜ ์ˆœ์œ„๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ์ƒํ™ฉ์—์„œ ์‚ฌ์šฉ๋œ๋‹ค.  

 

from sklearn.feature_extraction.text import TfidfVectorizer 

doc = ['I like machine learning', 'I love deep learning', 'I run everyday']
tfidf_vectorizer = TfidfVectorizer(min_df = 1) 
# min_df : ์ตœ์†Œ ๋นˆ๋„๊ฐ’์„ ์„ค์ •ํ•ด์ฃผ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ, DF ์˜ ์ตœ์†Œ๊ฐ’์„ ์„ค์ • (ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฌธ์„œ์˜ ์ˆ˜)ํ•˜์—ฌ 
## ํ•ด๋‹น ๊ฐ’๋ณด๋‹ค ์ž‘์€ DF ๋ฅผ ๊ฐ€์ง„ ๋‹จ์–ด๋“ค์€ ๋‹จ์–ด์‚ฌ์ „์—์„œ ์ œ์™ธํ•œ๋‹ค. 

tfidf_matrix = tfidf_vectorizer.fit_transform(doc)

 

TF-IDF ๊ฐ’์€ ํŠน์ • ๋ฌธ์„œ ๋‚ด์—์„œ ๋‹จ์–ด์˜ ์ถœํ˜„๋นˆ๋„๊ฐ€ ๋†’๊ฑฐ๋‚˜ ์ „์ฒด ๋ฌธ์„œ์—์„œ ํŠน์ • ๋‹จ์–ด๊ฐ€ ํฌํ•จ๋œ ๋ฌธ์„œ๊ฐ€ ์ ์„์ˆ˜๋ก ๊ฐ’์ด ๋†’๋‹ค. ๋”ฐ๋ผ์„œ ์ด ๊ฐ’์„ ํ™œ์šฉํ•ด ๋ฌธ์„œ์— ๋‚˜ํƒ€๋‚˜๋Š” ํ”ํ•œ ๋‹จ์–ด (a, the) ๋“ค์„ ๊ฑธ๋Ÿฌ๋‚ด๊ฑฐ๋‚˜ ํŠน์ • ๋‹จ์–ด์— ๋Œ€ํ•œ ์ค‘์š”๋„๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 

โ‘ข ์˜ˆ์ธก๊ธฐ๋ฐ˜ ์ž„๋ฒ ๋”ฉ 

 

• ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ๋ฅผ ์ด์šฉํ•ด ํŠน์ • ๋ฌธ๋งฅ์—์„œ ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ์ง€ "์˜ˆ์ธก" ํ•˜๋ฉด์„œ ๋‹จ์–ด๋ฅผ ๋ฒกํ„ฐ๋กœ ๋งŒ๋“œ๋Š” ๋ฐฉ์‹ 

 

 

 

(1) Word2Vec 

 

• ํ…์ŠคํŠธ์˜ ๊ฐ ๋‹จ์–ด๋งˆ๋‹ค ํ•˜๋‚˜์”ฉ ์ผ๋ จ์˜ ๋ฒกํ„ฐ๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค. 

• word2vec ์˜ ์ถœ๋ ฅ ๋ฒกํ„ฐ๊ฐ€ 2์ฐจ์› ๊ทธ๋ž˜ํ”„์— ํ‘œ์‹œ๋  ๋•Œ ์˜๋ฏธ๋ก ์ ์œผ๋กœ ์œ ์‚ฌํ•œ ๋‹จ์–ด์˜ ๋ฒกํ„ฐ๋Š” ์„œ๋กœ ๊ฐ€๊น๊ฒŒ ํ‘œํ˜„๋œ๋‹ค. ์ฆ‰, ํŠน์ • ๋‹จ์–ด์˜ ๋™์˜์–ด๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค. 

 

import gensim
from gensim.models import Word2Vec

 

๋Œ€์ƒ ๋‹จ์–ด์™€ ์ปจํ…์ŠคํŠธ ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋˜์–ด ๋„คํŠธ์›Œํฌ์— ๊ณต๊ธ‰๋œ๋‹ค.

 

→ ์€๋‹‰์ธต์—๋Š” ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๋‹ค. 

 

 

๐Ÿ”น CBOW 

 

• ์ฃผ๋ณ€๋‹จ์–ด๋กœ๋ถ€ํ„ฐ ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ• 

• ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์€ ๋ชจ๋“  ๋‹จ์–ด์—์„œ ๊ณตํ†ต์œผ๋กœ ์‚ฌ์šฉ 

 

 

โ†ช ํฌ๊ธฐ๊ฐ€ N ์ธ ์€๋‹‰์ธต : N ์€ ์ž…๋ ฅ ํ…์ŠคํŠธ๋ฅผ ์ž„๋ฒ ๋”ฉํ•œ ๋ฒกํ„ฐ ํฌ๊ธฐ 

โ†ช V : ๋‹จ์–ด์ง‘ํ•ฉ์˜ ํฌ๊ธฐ 

โ†ช ์€๋‹‰์ธต์— ์ž„๋ฒ ๋”ฉ๋œ ๋ฒกํ„ฐ ์ฐจ์›์„ ๋‹จ์–ด์˜ ๋ฒกํ„ฐ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ!

 

gensim.models.Word2Vec(data, min_count=1, size = 100, window = 5, sg = 0) 

# data : ํ† ํฐํ™”๋œ ๋ฌธ์„œ-๋‹จ์–ด ๋ฆฌ์ŠคํŠธ 
# min_count : ๋‹จ์–ด์— ๋Œ€ํ•œ ์ตœ์†Œ ๋นˆ๋„์ˆ˜ ์ œํ•œ (๋นˆ๋„๊ฐ€ ์ž‘์€ ๋‹จ์–ด๋Š” ํ•™์ŠตX) 
# size : ์›Œ๋“œ ๋ฒกํ„ฐ์˜ ํŠน์ง•๊ฐ’. ์ž„๋ฒ ๋”ฉ๋œ ๋ฒกํ„ฐ์˜ ์ฐจ์› 
# window : ์ปจํ…์ŠคํŠธ ์œˆ๋„์šฐ์˜ ํฌ๊ธฐ 
# sg = 0 : CBOW, sg = 1 : skip-gram

 

 

๐Ÿ”น Skip-gram 

 

 

• ์ค‘์‹ฌ๋‹จ์–ด๋กœ๋ถ€ํ„ฐ ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ• 

• ์ž…๋ ฅ ๋‹จ์–ด ์ฃผ๋ณ€์˜ k ๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ๋ฌธ๋งฅ์œผ๋กœ ๋ณด๊ณ  ์˜ˆ์ธก ๋ชจํ˜•์„ ๋งŒ๋“ ๋‹ค. 

 

gensim.models.Word2Vec(data, min_count=1, size=100, window=5, sg=1)


# ๋‹จ์–ด ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ 
model2.wv.similarity('peter','wendy')

 

๋ฐ์ดํ„ฐ ์„ฑ๊ฒฉ, ๋ถ„์„์— ๋Œ€ํ•œ ์ ‘๊ทผ ๋ฐฉ๋ฒ• ๋ฐ ๋„์ถœํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฒฐ๋ก  ๋“ฑ์„ ์ข…ํ•ฉ์ ์œผ๋กœ ๊ณ ๋ คํ•ด ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. 

 

 

 

(2) Fasttext 

 

 ํŒจ์ŠคํŠธํ…์ŠคํŠธ

 

Word2vec ์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•ด ๊ฐœ๋ฐœ๋œ ์ž„๋ฒ ๋”ฉ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. Word2vec ์€ ๋ถ„์‚ฐํ‘œํ˜„์„ ์ด์šฉํ•ด ๋ถ„์‚ฐ๋ถ„ํฌ๊ฐ€ ์œ ์‚ฌํ•œ ๋‹จ์–ด๋“ค์— ๋น„์Šทํ•œ ๋ฒกํ„ฐ๊ฐ’์„ ํ• ๋‹นํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋Š” ๋ฒกํ„ฐ ๊ฐ’์„ ์–ป์„ ์ˆ˜ ์—†๋‹ค๋Š” ๋‹จ์ ๊ณผ, ์ž์ฃผ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•˜๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. 

 

 

FastText ๋Š” ๋…ธ์ด์ฆˆ์— ๊ฐ•ํ•˜๋ฉฐ ์ƒˆ๋กœ์šด ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋Š” "ํ˜•ํƒœ์  ์œ ์‚ฌ์„ฑ" ์„ ๊ณ ๋ คํ•œ ๋ฒกํ„ฐ ๊ฐ’์„ ์–ป์–ด ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ค‘ ํ•˜๋‚˜๋‹ค. 

 

 

FastText ๋Š” ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด์— ๋Œ€ํ•ด ๋ฒกํ„ฐ ๊ฐ’์„ ๋ถ€์—ฌํ•˜๊ธฐ ์œ„ํ•ด, ์ฃผ์–ด์ง„ ๋ฌธ์„œ์˜ ๊ฐ ๋‹จ์–ด๋ฅผ n-gram ์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ์‹์„ ์ฑ„ํƒํ•œ๋‹ค. ์ธ๊ณต ์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•ด ํ•™์Šต์ด ์™„๋ฃŒ๋œ ํ›„ ๋ฐ์ดํ„ฐ์…‹์˜ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ๊ฐ n-gram ์— ๋Œ€ํ•ด ์ž„๋ฒ ๋”ฉํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์‚ฌ์ „์— ์—†๋Š” ๋‹จ์–ด๊ฐ€ ๋“ฑ์žฅํ•˜๋ฉด, n-gram ์œผ๋กœ ๋ถ„๋ฆฌ๋œ subword ์™€ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•ด ์˜๋ฏธ๋ฅผ ์œ ์ถ”ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

from gensim.test.utils import common_texts 
from gensim.models import FastText 

model = FastText(data, size = 4, window = 3, min_count = 1)

# data : word lists 
# window : ๊ณ ๋ คํ•  ์•ž๋’ค ํญ (์•ž๋’ค ์„ธ ๋‹จ์–ด) 
# min_count : ๋‹จ์–ด์— ๋Œ€ํ•œ ์ตœ์†Œ ๋นˆ๋„์ˆ˜ ์ œํ•œ


# ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ 
model.wv.similarity('peter', 'wendy')

 

 

์‚ฌ์ „ํ›ˆ๋ จ๋œ ํŒจ์ŠคํŠธํ…์ŠคํŠธ ๋ชจ๋ธ ์‚ฌ์šฉ ์˜ˆ์ œ 

 

์œ„ํ‚คํ”ผ๋””์•„ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์ „ ํ•™์Šตํ•œ ํ•œ๊ตญ์–ด ๋ชจ๋ธ์„ ๋‚ด๋ ค๋ฐ›์•„ ์‚ฌ์šฉํ•œ๋‹ค. 

 

from __future__ import print_function
from gensim.models import KeyedVectors # gensim ์„ ์ด์šฉํ•ด ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ด์šฉ 

model_kr = KeyedVectors.load_word2vec_format('wiki.ko.vec') # ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ ๋กœ๋”ฉ 


find_similar_to = '๋…ธ๋ ฅ' # ๋…ธ๋ ฅ๊ณผ ์œ ์‚ฌํ•œ ๋‹จ์–ด์™€ ์œ ์‚ฌ๋„๋ฅผ ํ™•์ธ 

for similar_word in model_kr.similar_by_word(find_similar_to) :  # similar_by_word
  print("Word: {0}, Similarity: {1:.2f}".format(
        similar_word[0], similar_word[1]
    ))

 

๋…ธ๋ ฅ์ด๋ผ๋Š” ๋‹จ์–ด์— ์กฐ์‚ฌ๊ฐ€ ๋ถ™์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. 

 

 

similarities = model_kr.most_similar(positive=['๋™๋ฌผ', '์œก์‹๋™๋ฌผ'], negative=['์‚ฌ๋žŒ']) # most_similar
print(similarities)

# "๋™๋ฌผ", "์œก์‹๋™๋ฌผ" ์—๋Š” ๊ธ์ •์ ์ด์ง€๋งŒ, "์‚ฌ๋žŒ" ์—๋Š” ๋ถ€์ •์ ์ธ ๋‹จ์–ด๋ฅผ ์‚ดํŽด๋ณด๊ธฐ 

# ์‚ฌ๋žŒ๊ณผ๋Š” ๊ด€๊ณ„๊ฐ€ ์—†์œผ๋ฉด์„œ ๋™๋ฌผ๊ณผ ๊ด€๋ จ๋œ ๋‹จ์–ด๋“ค์„ ์ถœ๋ ฅํ•œ๋‹ค.

 

 

 

 

 

โ‘ฃ ํšŸ์ˆ˜/์˜ˆ์ธก ๊ธฐ๋ฐ˜ ์ž„๋ฒ ๋”ฉ 

 

• Glove 

 

ํšŸ์ˆ˜๊ธฐ๋ฐ˜์˜ LSA ์™€ ์˜ˆ์ธก๊ธฐ๋ฐ˜์˜ Word2vec ๋‹จ์ ์„ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ์ด๋‹ค. ๊ธ€๋กœ๋ฒŒ ๋™์‹œ๋ฐœ์ƒ ํ™•๋ฅ ์„ ํฌํ•จํ•˜๋Š” ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ฐฉ๋ฒ•์œผ๋กœ, ๋‹จ์–ด์— ๋Œ€ํ•œ ํ†ต๊ณ„ ์ •๋ณด์™€ skip-gram ์„ ํ•ฉ์นœ ๋ฐฉ์‹์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec 


glove_file = datapath('..\chap10\data\glove.6B.100d.txt')  # ์ˆ˜๋งŽ์€ ๋‹จ์–ด์— ๋Œ€ํ•ด ์ฐจ์› 100๊ฐœ๋ฅผ ๊ฐ€์ง€๋Š” ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ์ œ๊ณต
word2vec_glove_file = get_tmpfile("glove.6B.100d.word2vec.txt") 
glove2word2vec(glove_file, word2vec_glove_file) # word2vec ํ˜•ํƒœ๋กœ ๊ธ€๋กœ๋ธŒ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ 



model = KeyedVectors.load_word2vec_format(word2vec_glove_file) # word2vec.c ํ˜•์‹์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ด

 

# bill ๋‹จ์–ด ๊ธฐ์ค€์œผ๋กœ ๊ฐ€์žฅ ์œ ์‚ฌํ•œ ๋‹จ์–ด์˜ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ถœ๋ ฅ 
model.most_similar('bill') 


# ‘cherry’์™€ ๊ด€๋ จ์„ฑ์ด ์—†๋Š” ๋‹จ์–ด์˜ ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฐ˜ํ™˜
model.most_similar(negative=['cherry'])


# ‘woman’, ‘king’๊ณผ ์œ ์‚ฌ์„ฑ์ด ๋†’์œผ๋ฉด์„œ ‘man’๊ณผ ๊ด€๋ จ์„ฑ์ด ์—†๋Š” ๋‹จ์–ด๋ฅผ ๋ฐ˜ํ™˜
model.most_similar(positive=['woman', 'king'], negative=['man'])


#  ‘breakfast cereal dinner lunch’ ์ค‘ ์œ ์‚ฌ๋„๊ฐ€ ๋‚ฎ์€ ๋‹จ์–ด๋ฅผ ๋ฐ˜ํ™˜
model.doesnt_match("breakfast cereal dinner lunch".split())

 

 

 

 

 

 

728x90

๋Œ“๊ธ€