๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“— NLP

ํ…์ŠคํŠธ ๋ถ„์„ โ‘ 

by isdawell 2022. 5. 14.
728x90

๐Ÿ“Œ ํŒŒ์ด์ฌ ๋จธ์‹ ๋Ÿฌ๋‹ ์™„๋ฒฝ๊ฐ€์ด๋“œ ๊ณต๋ถ€ ๋‚ด์šฉ ์ •๋ฆฌ 

 

 

๐Ÿ“Œ ์‹ค์Šต ์ฝ”๋“œ 

 

https://colab.research.google.com/drive/1UzQNyu-rafb1SQEDcQCeCyYO54ECgULT?usp=sharing 

 

08. ํ…์ŠคํŠธ ๋ถ„์„.ipynb

Colaboratory notebook

colab.research.google.com

 

 

1๏ธโƒฃ ํ…์ŠคํŠธ ๋ถ„์„์˜ ์ดํ•ด 


๐Ÿ‘€ ๊ฐœ์š” 

 

๐Ÿ’ก NLP ์™€ ํ…์ŠคํŠธ ๋งˆ์ด๋‹

 

 

โœ” NLP 

 

  • ์ธ๊ฐ„์˜ ์–ธ์–ด๋ฅผ ์ดํ•ดํ•˜๊ณ  ํ•ด์„ํ•˜๋Š”๋ฐ ์ค‘์ ์„ ๋‘๊ณ  ๋ฐœ์ „ 
  • ํ…์ŠคํŠธ ๋งˆ์ด๋‹์„ ํ–ฅ์ƒํ•˜๊ฒŒ ํ•˜๋Š” ๊ธฐ๋ฐ˜ ๊ธฐ์ˆ  
  • ๊ธฐ๊ณ„๋ฒˆ์—ญ, ์งˆ์˜์‘๋‹ต ์‹œ์Šคํ…œ ๋“ฑ 

 

โœ” ํ…์ŠคํŠธ ๋งˆ์ด๋‹

 

  • ๋น„์ •ํ˜• ํ…์ŠคํŠธ์—์„œ ์˜๋ฏธ์žˆ๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์— ์ค‘์  

 

1. ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ : ๋ฌธ์„œ๊ฐ€ ํŠน์ • ๋ถ„๋ฅ˜ ๋˜๋Š” ์นดํ…Œ๊ณ ๋ฆฌ์— ์†ํ•˜๋Š” ๊ฒƒ์„ ์˜ˆ์ธกํ•˜๋Š” ๊ธฐ๋ฒ• 
   ex. ์‹ ๋ฌธ ๊ธฐ์‚ฌ ์นดํ…Œ๊ณ ๋ฆฌ ๋ถ„๋ฅ˜, ์ŠคํŒธ๋ฉ”์ผ ๊ฒ€์ถœ๊ณผ ๊ฐ™์€ ์ง€๋„ํ•™์Šต 

2. ๊ฐ์„ฑ๋ถ„์„ : ํ…์ŠคํŠธ์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฐ์ •/ํŒ๋‹จ/๊ธฐ๋ถ„ ๋“ฑ ์ฃผ๊ด€์  ์š”์†Œ๋ฅผ ๋ถ„์„ 
   ex. ์ œํ’ˆ ๋ฆฌ๋ทฐ ๋ถ„์„, ์˜๊ฒฌ๋ถ„์„ ๋“ฑ ๋น„์ง€๋„ ํ•™์Šต & ์ง€๋„ํ•™์Šต ์ ์šฉ 

3. ํ…์ŠคํŠธ ์š”์•ฝ : ์ค‘์š”ํ•œ ์ฃผ์ œ๋‚˜ ์ค‘์‹ฌ ์‚ฌ์ƒ์„ ์ถ”์ถœํ•˜๋Š” ๊ธฐ๋ฒ• 
   ex. ํ† ํ”ฝ ๋ชจ๋ธ๋ง 

4. ํ…์ŠคํŠธ ๊ตฐ์ง‘ํ™”์™€ ์œ ์‚ฌ๋„ ์ธก์ • : ๋น„์Šทํ•œ ์œ ํ˜•์˜ ๋ฌธ์„œ์— ๋Œ€ํ•ด ๊ตฐ์ง‘ํ™”/์œ ์‚ฌ๋„ ์ธก์ • ์ˆ˜ํ–‰   

 

 

๐Ÿ’ก ํ…์ŠคํŠธ ๋ถ„์„์˜ ์ดํ•ด 

 

โœ” ํ”ผ์ฒ˜ ์ถ”์ถœ (ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”) 

 

  • ํ…์ŠคํŠธ๋ฅผ ๋‹จ์–ด ๊ธฐ๋ฐ˜์˜ ๋‹ค์ˆ˜์˜ ํ”ผ์ฒ˜๋กœ ์ถ”์ถœํ•˜๊ณ , ์ด ํ”ผ์ฒ˜์— ๋‹จ์–ด ๋นˆ๋„์ˆ˜์™€ ๊ฐ™์€ ์ˆซ์ž๊ฐ’์„ ๋ถ€์—ฌ ๐Ÿ‘‰ ํ…์ŠคํŠธ๋Š” ๋‹จ์–ด์˜ ์กฐํ•ฉ์ธ ๋ฒกํ„ฐ๊ฐ’์œผ๋กœ ํ‘œํ˜„
  • ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐ๊ฐ’์„ ๊ฐ€์ง€๋Š” ํ”ผ์ฒ˜๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์€ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์ ์šฉํ•˜๊ธฐ ์ „์— ์ˆ˜ํ–‰ํ•ด์•ผํ•  ๋งค์šฐ ์ค‘์š”ํ•œ ์š”์†Œ์ด๋‹ค.
  • ๋Œ€ํ‘œ์ ์ธ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™” ๋ณ€ํ™˜ ๋ฐฉ๋ฒ• : BoW, Word2Vec 

 

โœ” ๋ถ„์„ ์ˆ˜ํ–‰ ํ”„๋กœ์„ธ์Šค 

 

1. ํ…์ŠคํŠธ ์‚ฌ์ „ ์ค€๋น„์ž‘์—… (์ „์ฒ˜๋ฆฌ) 

  - ๋Œ€/์†Œ๋ฌธ์ž ๋ณ€๊ฒฝ, ํŠน์ˆ˜๋ฌธ์ž ์‚ญ์ œ ๋“ฑ์˜ ํด๋ Œ์ง• 
  - ๋‹จ์–ด ๋“ฑ์˜ ํ† ํฐํ™” ์ž‘์—… 
  - ์˜๋ฏธ์—†๋Š” ๋‹จ์–ด ์ œ๊ฑฐ ์ž‘์—… 
  - ์–ด๊ทผ ์ถ”์ถœ (Stemming, Lemmatization) 

2. ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”/์ถ”์ถœ 
  - ๊ฐ€๊ณต๋œ ํ…์ŠคํŠธ์—์„œ ํ”ผ์ฒ˜๋ฅผ ์ถ”์ถœํ•˜๊ณ  ๋ฒกํ„ฐ ๊ฐ’์„ ํ• ๋‹นํ•œ๋‹ค. 
  - BoW ๐Ÿ‘‰ Count ๊ธฐ๋ฐ˜, TF-IDF ๊ธฐ๋ฐ˜ 
  - Word2vec 


3. ML ๋ชจ๋ธ ์ˆ˜๋ฆฝ ๋ฐ ํ•™์Šต/์˜ˆ์ธก/ํ‰๊ฐ€ 

 

๐Ÿ’ก ๋ถ„์„ ํŒจํ‚ค์ง€ 

 

โœ” NLTK

 

  • ๋Œ€ํ‘œ์ ์ธ ํŒŒ์ด์ฌ NLP ํŒจํ‚ค์ง€
  • ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์„ธํŠธ์™€ ์„œ๋ธŒ ๋ชจ๋“ˆ์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ NLP ์˜ ๊ฑฐ์˜ ๋ชจ๋“  ์˜์—ญ์„ ์ปค๋ฒ„ํ•˜๊ณ  ์žˆ์Œ 
  • ์ˆ˜ํ–‰ ์†๋„ ์ธก๋ฉด์—์„œ ์•„์‰ฌ์›€ : ์‹ค์ œ ๋Œ€๋Ÿ‰ ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜์—์„œ๋Š” ์ œ๋Œ€๋กœ ํ™œ์šฉ๋˜์ง€ ๋ชปํ•จ

 

โœ” Gensim 

 

  • ํ† ํ”ฝ๋ชจ๋ธ๋ง ๋ถ„์•ผ์—์„œ ๊ฐ€์žฅ ๋‘๊ฐ์„ ๋‚˜ํƒ€๋‚ด๋Š” ํŒจํ‚ค์ง€ 
  • Word2Vec ๊ตฌํ˜„ ๋“ฑ ๋‹ค์–‘ํ•œ ๊ธฐ๋Šฅ ์ œ๊ณต 

 

โœ” SpaCy

 

  • ๋›ฐ์–ด๋‚œ ์ˆ˜ํ–‰ ์„ฑ๋Šฅ, ์ตœ๊ทผ ๊ฐ€์žฅ ์ฃผ๋ชฉ๋ฐ›๋Š” ํŒจํ‚ค์ง€ 

 

๐Ÿƒ‍โ™€๏ธ ํ•œ๊ตญ์–ด ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ : KoNLPy

 

 

 

 

2๏ธโƒฃ ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ 


๐Ÿ‘€ ๊ฐœ์š” 

 

๐Ÿ’ก ํด๋ Œ์ง•

 

โœ” ๋ถˆํ•„์š”ํ•œ ๋ฌธ์ž, ๊ธฐํ˜ธ๋“ฑ์„ ์‚ฌ์ „์— ์ œ๊ฑฐํ•˜๋Š” ์ž‘์—… 

โœ” HTML, XML ํƒœ๊ทธ๋‚˜ ํŠน์ • ๊ธฐํ˜ธ ๋“ฑ์„ ์‚ฌ์ „์— ์ œ๊ฑฐํ•œ๋‹ค. 

 

๐Ÿ’ก ์ •๊ทœํ‘œํ˜„์‹ 

 

 

๐Ÿ’ก ํ…์ŠคํŠธ ํ† ํฐํ™” 

 

โœ” ๋ฌธ์žฅ ํ† ํฐํ™” 

 

  • ๋ฌธ์žฅ์˜ ๋งˆ์นจํ‘œ๋‚˜ ๊ฐœํ–‰๋ฌธ์ž(\n) ๋“ฑ ๋ฌธ์žฅ์˜ ๋งˆ์ง€๋ง‰์„ ๋œปํ•˜๋Š” ๊ธฐํ˜ธ์— ๋”ฐ๋ผ ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค. 
  • ํ˜น์€ ์ •๊ทœํ‘œํ˜„์‹์— ๋”ฐ๋ฅธ ๋ฌธ์žฅ ํ† ํฐํ™”๋„ ๊ฐ€๋Šฅํ•˜๋‹ค. 
  • sent_tokenize( ) ๊ฐ€ ๋ฐ˜ํ™˜ํ•˜๋Š” ๊ฒƒ์€ ๊ฐ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋œ ๋ฆฌ์ŠคํŠธ ๊ฐ์ฒด์ด๋‹ค. 

 

from nltk import sent_tokenize 
import nltk 
nltk.download('punkt') # ๋งˆ์นจํ‘œ, ๊ฐœํ–‰ ๋ฌธ์ž ๋“ฑ์˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ ๋‹ค์šด๋กœ๋“œ

text_sample = 'I ate an apple. \ I ate a banana. \ I worked at home.' 
sentences = sent_tokenize(text = text_sample) 

print(sentences) 


### ๐Ÿ“Œ ๊ฒฐ๊ณผ #### 
['I ate an apple.' , 'I ate a banana.', 'I worked at home.']

 

 

โœ” ๋‹จ์–ด ํ† ํฐํ™” 

 

  • ๋ฌธ์žฅ์„ ๋‹จ์–ด๋กœ ํ† ํฐํ™” 
  • word_tokenize( )
  • ๊ธฐ๋ณธ์ ์œผ๋กœ ๊ณต๋ฐฑ, ์ฝค๋งˆ, ๋งˆ์นจํ‘œ, ๊ฐœํ–‰๋ฌธ์ž ๋“ฑ์œผ๋กœ ๋‹จ์–ด๋ฅผ ๋ถ„๋ฆฌํ•˜๋‚˜, ์ •๊ทœํ‘œํ˜„์‹์„ ํ†ตํ•ด ํ† ํฐํ™”๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜๋„ ์žˆ๋‹ค. 
  • BoW ์™€ ๊ฐ™์ด ๋‹จ์–ด์˜ ์ˆœ์„œ๊ฐ€  ์ค‘์š”ํ•˜์ง€ ์•Š๋‹ค๋ฉด, ๋ฌธ์žฅ ํ† ํฐํ™”๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ๋‹จ์–ด ํ† ํฐํ™”๋งŒ ์‚ฌ์šฉํ•ด๋„ ์ถฉ๋ถ„ํ•˜๋‹ค. 

 

from nltk import word_tokenize 

sentence = 'I ate the pizza.' 
words = word_tokenize(sentence) 
print(words) 

## ๐Ÿ“Œ ๊ฒฐ๊ณผ ## 
['I', 'ate', 'the', 'pizza']

 

๐Ÿ‘€ ์ผ๋ฐ˜์ ์œผ๋กœ, ํ•˜๋‚˜์˜ ๋ฌธ์„œ์— ๋Œ€ํ•ด ๋ฌธ์žฅ ํ† ํฐํ™”๋ฅผ ์ ์šฉํ•˜๊ณ , ๋ถ„๋ฆฌ๋œ ๊ฐ ๋ฌธ์žฅ์— ๋Œ€ํ•ด ๋‹ค์‹œ ๋‹จ์–ด ํ† ํฐํ™”๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค. 

 

๐Ÿ‘€ ๋ฌธ์žฅ์„ ๋‹จ์–ด๋ณ„๋กœ ํ† ํฐํ™”ํ•  ๊ฒฝ์šฐ, ๋ฌธ๋งฅ์ ์ธ ์˜๋ฏธ๊ฐ€ ๋ฌด์‹œ๋œ๋‹ค ๐Ÿ‘‰ n-gram ์„ ํ†ตํ•ด ํ•ด๊ฒฐ์„ ์‹œ๋„ 

 

๐Ÿ‘€ n-gram : ์—ฐ์†๋œ n ๊ฐœ ๋‹จ์–ด๋ฅผ ํ•˜๋‚˜์˜ ํ† ํฐํ™” ๋‹จ์œ„๋กœ ๋ถ„๋ฆฌ 

   ex. ('I', 'ate'), ('ate', 'the'), ('the', 'pizza') 

 

 

 

 

๐Ÿ’ก ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ 

 

โœ” Stop word

 

  • ๋ถ„์„์— ํฐ ์˜๋ฏธ๊ฐ€ ์—†๋Š” ๋‹จ์–ด
  • is, the, a, will ๋“ฑ ๋ฌธ๋งฅ์ ์œผ๋กœ ํฐ ์˜๋ฏธ๊ฐ€ ์—†๋Š” ๋‹จ์–ด๊ฐ€ ์ด์— ํ•ด๋‹นํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ๋‹จ์–ด๋Š” ๋ฌธ์žฅ ๋‚ด์— ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋“ฑ์žฅํ•˜๋ฏ€๋กœ, ์˜คํžˆ๋ ค ์ค‘์š”ํ•œ ๋‹จ์–ด๋กœ ์ธ์ง€๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฌํ•œ ์˜๋ฏธ์—†๋Š” ๋‹จ์–ด๋ฅผ ์‚ฌ์ „์— ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. 
  • ์–ธ์–ด๋ณ„๋กœ ์ด๋Ÿฌํ•œ ๋ถˆ์šฉ์–ด๋“ค์ด ๋ชฉ๋กํ™” ๋˜์–ด ์žˆ๋‹ค. 

 

import nltk
nltk.download('stopwords')

stopwords = nltk.corpus.stopwords.words('english')
all_tokens = []

# ๐Ÿ“Œ ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ for ๋ฌธ 

for sentence in word_tokens:
    filtered_words=[]
    
    # ๊ฐœ๋ณ„ ๋ฌธ์žฅ๋ณ„๋กœ โญtokenize๋œ sentence listโญ ์— ๋Œ€ํ•ด stop word ์ œ๊ฑฐ Loop
    for word in sentence:
        #์†Œ๋ฌธ์ž๋กœ ๋ชจ๋‘ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. 
        word = word.lower()
        # tokenize ๋œ ๊ฐœ๋ณ„ word๊ฐ€ stop words ๋“ค์˜ ๋‹จ์–ด์— ํฌํ•จ๋˜์ง€ ์•Š์œผ๋ฉด word_tokens์— ์ถ”๊ฐ€
        if word not in stopwords:
            filtered_words.append(word)
    all_tokens.append(filtered_words)
    
print(all_tokens)

 

 

 

 

๐Ÿ’ก ์–ด๊ทผ ์ถ”์ถœ 

 

โœ” Stemming๊ณผ Lemmatization

 

  • ์˜์–ด์˜ ๊ฒฝ์šฐ ๊ณผ๊ฑฐ/ํ˜„์žฌ, 3์ธ์นญ ๋‹จ์ˆ˜ ์—ฌ๋ถ€, ์ง„ํ–‰ํ˜• ๋“ฑ ๋งค์šฐ ๋งŽ์€ ์กฐ๊ฑด์— ๋”ฐ๋ผ ์›๋ž˜ ๋‹จ์–ด๊ฐ€ ๋ณ€ํ™”ํ•œ๋‹ค. 
  • stemming ๊ณผ lemmatization ์€ ๋‹จ์–ด์˜ ์›ํ˜•์„ ๊ฒฐ๊ณผ๋กœ ๋„์ถœํ•ด์ค€๋‹ค.
  • ์ด๋•Œ Lemmatization ์ด ๋ณด๋‹ค ์ •๊ตํ•˜๊ฒŒ ์˜๋ฏธ๋ก ์ ์ธ ๊ธฐ๋ฐ˜์—์„œ ๋‹จ์–ด์˜ ์›ํ˜•์„ ์ฐพ์•„์ฃผ๋Š” ๋Œ€์‹  ์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆฐ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. 
  • Stemming ์€ ๋‹จ์ˆœํ•˜๊ฒŒ ์›ํ˜• ๋‹จ์–ด๋ฅผ ์ฐพ์•„์ค€๋‹ค. 

 

โœ” NLTK ํด๋ž˜์Šค 

 

  • Stemmer : Porter, Lancaster, Snowball Stemmer
  • Lemmatization : WordNetLemmatizer 

 

from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()
print(stemmer.stem('happier'),stemmer.stem('happiest'))

## ๐Ÿ“Œ ๊ฒฐ๊ณผ ##

happy happiest

 

๐Ÿ‘‰ stemmer ๋Š” ํ˜•์šฉ์‚ฌ happy ์— ๋Œ€ํ•œ ์›ํ˜•์„ ์ œ๋Œ€๋กœ ์ฐพ์ง€ ๋ชปํ•œ๋‹ค. 

 

 

from nltk.stem import WordNetLemmatizer
import nltk
nltk.download('wordnet')

lemma = WordNetLemmatizer()
print(lemma.lemmatize('happier','a'),lemma.lemmatize('happiest','a'))

## ๐Ÿ“Œ ๊ฒฐ๊ณผ ##
happy happy

 

๐Ÿ‘‰ Lemmatization ์€ ๋ณด๋‹ค ์ •ํ™•ํ•œ ์›ํ˜• ๋‹จ์–ด๋ฅผ ์ถ”์ถœํ•ด์ฃผ๊ธฐ ์œ„ํ•˜์—ฌ ๋‹จ์–ด์˜ 'ํ’ˆ์‚ฌ' ๋ฅผ ์ž…๋ ฅํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. 

 

 * ๋™์‚ฌ : 'v'

 * ํ˜•์šฉ์‚ฌ : 'a' 

 

 

 

3๏ธโƒฃ BoW


๐Ÿ‘€ ๊ฐœ์š” 

 

๐Ÿ’ก Bag of Words 

 

 

โœ” BoW ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™” 

 

  • ๋ฌธ์„œ๊ฐ€ ๊ฐ€์ง€๋Š” ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ๋ฌธ๋งฅ์ด๋‚˜ ์ˆœ์„œ๋ฅผ ๋ฌด์‹œํ•˜๊ณ  ์ผ๊ด„์ ์œผ๋กœ ๋‹จ์–ด์— ๋Œ€ํ•ด ๋นˆ๋„ ๊ฐ’์„ ๋ถ€์—ฌํ•ด ํ”ผ์ฒ˜ ๊ฐ’์„ ์ถ”์ถœํ•˜๋Š” ๋ชจ๋ธ์ด๋‹ค. 
  • ๋ชจ๋“  ๋ฌธ์žฅ๋“ค์— ๋Œ€ํ•˜์—ฌ, ๋ชจ๋“  ๋‹จ์–ด์—์„œ ์ค‘๋ณต์„ ์ œ๊ฑฐํ•˜๊ณ  ๊ฐ ๋‹จ์–ด๋ฅผ ์นผ๋Ÿผ ํ˜•ํƒœ๋กœ ๋‚˜์—ดํ•œ๋‹ค.
  • ๊ทธ๋ฆฌ๊ณ  ๊ฐ ๋‹จ์–ด์— ๊ณ ์œ ์˜ ์ธ๋ฑ์Šค๋ฅผ ๋ถ€์—ฌํ•œ ํ›„, ๊ฐœ๋ณ„ ๋ฌธ์žฅ์—์„œ ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š” ํšŸ์ˆ˜๋ฅผ ๊ฐ ๋‹จ์–ด์— ๊ธฐ์žฌํ•œ๋‹ค. 

 

 

 

โœ” ์žฅ๋‹จ์  

 

๐Ÿ’จ ์žฅ์ 

 

  • ์‰ฝ๊ณ  ๋น ๋ฅธ ๋ชจ๋ธ ๊ตฌ์ถ•์ด ๊ฐ€๋Šฅ 
  • ๋‹จ์ˆœํžˆ ๋‹จ์–ด์˜ ๋ฐœ์ƒ ํšŸ์ˆ˜์— ๊ธฐ๋ฐ˜ํ•˜๋‚˜, ๋ฌธ์„œ์˜ ํŠน์ง•์„ ์ž˜ ์žก์•„๋‚ด๋Š” ๋ชจ๋ธ์ด๋ผ ์ „ํ†ต์ ์œผ๋กœ ์—ฌ๋Ÿฌ ๋ถ„์•ผ์—์„œ ํ™œ์šฉ๋„๊ฐ€ ๋†’๋‹ค. 

 

๐Ÿ’จ ๋‹จ์  

 

  • ๋ฌธ๋งฅ ์˜๋ฏธ ๋ฐ˜์˜ ๋ถ€์กฑ : ๋‹จ์–ด์˜ ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์–ด์˜ ๋ฌธ๋งฅ์  ์˜๋ฏธ๊ฐ€ ๋ฌด์‹œ๋œ๋‹ค. ์ด๋ฅผ ๋ณด์™„ํ•˜๊ธฐ ์œ„ํ•ด n-gram ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋‚˜ ๋ฌธ๋งฅ์ ์ธ ํ•ด์„์— ๋Œ€ํ•ด์„  ์ œํ•œ์ ์ด๋‹ค. 
  • ํฌ์†Œํ–‰๋ ฌ sparse ๋ฌธ์ œ : ๋ฌธ์„œ๋งˆ๋‹ค ์„œ๋กœ ๋‹ค๋ฅธ ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋˜๊ธฐ์— ๋ฌธ์„œ๋งˆ๋‹ค ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š” ๋‹จ์–ด๊ฐ€ ํ›จ์”ฌ ๋งŽ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋‹จ์–ด ํ–‰๋ ฌ์—์„œ (์ค‘๋ณต๋˜์ง€ ์•Š์€) ๋ชจ๋“  ๋‹จ์–ด๊ฐ€ ์—ด์— ์œ„์น˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋Œ€๋ถ€๋ถ„์˜ ์…€์˜ ๊ฐ’์ด 0์œผ๋กœ ์ฑ„์›Œ์ง„๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. ํฌ์†Œํ–‰๋ ฌ์€ ์ผ๋ฐ˜์ ์œผ๋กœ ML ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์ˆ˜ํ–‰ ์‹œ๊ฐ„๊ณผ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ๋–จ์–ด๋œจ๋ฆฌ๋ฏ€๋กœ ์ด์— ๋Œ€ํ•œ ํŠน๋ณ„ํ•œ ๊ธฐ๋ฒ•์ด ๋งˆ๋ จ๋˜์–ด ์žˆ๋‹ค.

 

๐Ÿ‘€ ๋ฐ€์ง‘ํ–‰๋ ฌ : ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ’์ด 0์ด ์•„๋‹Œ ์˜๋ฏธ์žˆ๋Š” ๊ฐ’์œผ๋กœ ์ฑ„์›Œ์ง„ ํ–‰๋ ฌ 

 

 

 

๐Ÿ’ก ์‚ฌ์ดํ‚ท๋Ÿฐ CountVectorizer, TfidfVectorizer 

 

 

โœ” BOW ์˜ ํ”ผ์ฒ˜๋ฒกํ„ฐํ™” 

 

  • ๋ชจ๋“  ๋ฌธ์„œ์—์„œ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ์นผ๋Ÿผ ํ˜•ํƒœ๋กœ ๋‚˜์—ดํ•˜๊ณ  ๊ฐ ๋ฌธ์„œ์—์„œ ํ•ด๋‹น ๋‹จ์–ด์˜ ํšŸ์ˆ˜๋‚˜ ์ •๊ทœํ™”๋œ ๋นˆ๋„๋ฅผ ๊ฐ’์œผ๋กœ ๋ถ€์—ฌํ•˜๋Š” ๋ฐ์ดํ„ฐ์„ธํŠธ ๋ชจ๋ธ๋กœ ๋ณ€๊ฒฝํ•˜๋Š” ๊ฒƒ 
  • M๊ฐœ์˜ ๋ฌธ์„œ, ์ด N ๊ฐœ์˜ ๋‹จ์–ด ๐Ÿ‘‰ MxN ํ–‰๋ ฌ 

 

 

 

โœ” CountVectorizer 

 

  • ๊ฐ ๋ฌธ์„œ์— ํ•ด๋‹น ๋‹จ์–ด๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š” ํšŸ์ˆ˜๊ฐ€ ํ–‰๋ ฌ์˜ ๊ฐ ์…€์˜ ๊ฐ’์— ํ•ด๋‹นํ•œ๋‹ค.
  • ์นด์šดํŠธ ๊ฐ’์ด ๋†’์„์ˆ˜๋ก ์ค‘์š”ํ•œ ๋‹จ์–ด๋กœ ์ธ์‹๋œ๋‹ค. 
  • ์นด์šดํŠธ๋งŒ ๋ถ€์—ฌํ•˜๋‹ค๋ณด๋‹ˆ, ๊ทธ ๋ฌธ์„œ์˜ ํŠน์ง•์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ๋ณด๋‹ค๋Š” ์–ธ์–ด์˜ ํŠน์„ฑ์ƒ ๋ฌธ์žฅ์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋  ์ˆ˜๋ฐ–์— ์—†๋Š” ๋‹จ์–ด๊นŒ์ง€ ๋†’์€ ๊ฐ’์„ ๋ถ€์—ฌํ•˜๊ฒŒ ๋œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. 
  • ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ CountVectorizer ํด๋ž˜์Šค๋Š” ๋‹จ์ง€ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”๋งŒ์„ ์ˆ˜ํ–‰ํ•˜์ง„ ์•Š์œผ๋ฉฐ ์†Œ๋ฌธ์ž ์ผ๊ด„ ๋ณ€ํ™˜, ํ† ํฐํ™”, ์Šคํ†ฑ ์›Œ๋“œ ํ•„ํ„ฐ๋ง ๋“ฑ์˜ ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ๋„ ํ•จ๊ป˜ ์ˆ˜ํ–‰ํ•œ๋‹ค ๐Ÿ‘‰ ํ† ํฌ๋‚˜์ด์ง• + ๋ฒกํ„ฐํ™”๋ฅผ ๋™์‹œ์— ํ•ด์ค€๋‹ค. (๋”ฐ๋กœ ํ† ํฐํ™”๊นŒ์ง€ ์ง„ํ–‰๋œ ํ˜•ํƒœ๋ฅผ ๋„ฃ์–ด์ฃผ์–ด๋„ ์ƒ๊ด€์€ ์—†์Œ. ๊ฐ ๋ฌธ์„œ๊ฐ€ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ธฐ๋งŒ ํ•˜๋ฉด๋จ) 

 

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> vectorizer.get_feature_names_out()
array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], ...)
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]

 

๐Ÿ’จ ํŒŒ๋ผ๋ฏธํ„ฐ

 

max_df โ—พ ์ „์ฒด ๋ฌธ์„œ์— ๊ฑธ์ณ ๋„ˆ๋ฌด ๋†’์€ ๋นˆ๋„์ˆ˜๋ฅผ ๊ฐ€์ง„ ๋‹จ์–ด ํ”ผ์ฒ˜ ์ œ์™ธ 

๐Ÿ‘‰ max_df = 100 : ์ „์ฒด ๋ฌธ์„œ์— ๊ฑธ์ณ 100๊ฐœ ์ดํ•˜๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด๋งŒ ํ”ผ์ฒ˜๋กœ ์ถ”์ถœ 
๐Ÿ‘‰ max_df = 0.9 : ๋ถ€๋™ ์†Œ์ˆ˜์ ์œผ๋กœ ๊ฐ’์„ ์ง€์ •ํ•˜๋ฉด ์ „์ฒด ๋ฌธ์„œ์— ๊ฑธ์ณ ๋นˆ๋„์ˆ˜ 0~90% ๊นŒ์ง€์˜ ๋‹จ์–ด๋งŒ ํ”ผ์ฒ˜๋กœ ์ถ”์ถœํ•˜๊ณ  ๋‚˜๋จธ์ง€ ์ƒ์œ„ 5%๋Š” ์ถ”์ถœํ•˜์ง€ ์•Š์Œ 
min_df โ—พ ์ „์ฒด ๋ฌธ์„œ์— ๊ฑธ์ณ ๋„ˆ๋ฌด ๋‚ฎ์€ ๋นˆ๋„์ˆ˜๋ฅผ ๊ฐ€์ง„ ๋‹จ์–ด ํ”ผ์ฒ˜ ์ œ์™ธ (๋ณดํ†ต 2~3์œผ๋กœ ์„ค์ •)
โ—พ ๋„ˆ๋ฌด ๋‚ฎ์€ ๋นˆ๋„๋กœ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋Š” ํฌ๊ฒŒ ์ค‘์š”ํ•˜์ง€ ์•Š๊ฑฐ๋‚˜ garbage ์„ฑ ๋‹จ์–ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๐ŸŽˆ max_df ์™€ ๋™์ž‘๋ฐฉ์‹์ด ๊ฐ™์Œ 
max_features ์ถ”์ถœํ•˜๋Š” ํ”ผ์ฒ˜์˜ ๊ฐœ์ˆ˜๋ฅผ ์ œํ•œํ•˜๋ฉฐ ์ •์ˆ˜๋กœ ๊ฐ’์„ ์ง€์ •ํ•œ๋‹ค. ๊ฐ€์žฅ ๋†’์€ ๋นˆ๋„๋ฅผ ๊ฐ€์ง€๋Š” ๋‹จ์–ด์ˆœ์œผ๋กœ ์ •๋ ฌํ•ด ์ง€์ •ํ•œ ๊ฐœ์ˆ˜๊นŒ์ง€๋งŒ ํ”ผ์ฒ˜๋กœ ์ถ”์ถœํ•œ๋‹ค.
stop_words 'english' ๋กœ ์ง€์ •ํ•˜๋ฉด ์˜์–ด ๋ถˆ์šฉ์–ด๋กœ ์ง€์ •๋œ ๋‹จ์–ด๋Š” ์ถ”์ถœ์—์„œ ์ œ์™ธํ•œ๋‹ค. 
n_gram_range โ—พ ๋‹จ์–ด ์ˆœ์„œ๋ฅผ ์–ด๋Š์ •๋„ ๋ณด๊ฐ•ํ•˜๊ธฐ ์œ„ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ
โ—พ ํŠœํ”Œ ํ˜•ํƒœ๋กœ (๋ฒ”์œ„ ์ตœ์†Ÿ๊ฐ’, ๋ฒ”์œ„ ์ตœ๋Œ“๊ฐ’) ์„ ์ง€์ •ํ•œ๋‹ค. 

๐Ÿ‘‰ n_gram_range = (1,1) : ํ† ํฐํ™”๋œ ๋‹จ์–ด๋ฅผ 1๊ฐœ์”ฉ ํ”ผ์ฒ˜๋กœ ์ถ”์ถœํ•œ๋‹ค. 
๐Ÿ‘‰ n_gram_range = (1,2) : ํ† ํฐํ™”๋œ ๋‹จ์–ด๋ฅผ 1๊ฐœ์”ฉ, ๊ทธ๋ฆฌ๊ณ  ์ˆœ์„œ๋Œ€๋กœ 2๊ฐœ์”ฉ ๋ฌถ์–ด์„œ ์ถ”์ถœํ•œ๋‹ค.
analyzer โ—พ ํ”ผ์ฒ˜ ์ถ”์ถœ์„ ์ˆ˜ํ–‰ํ•œ ๋‹จ์œ„๋ฅผ ์ง€์ •
โ—พ ๊ธฐ๋ณธ๊ฐ’ : 'word' 
โ—พ character ์˜ ํŠน์ • ๋ฒ”์œ„๋ฅผ ํ”ผ์ฒ˜๋กœ ๋งŒ๋“œ๋Š” ํŠน์ •ํ•œ ๊ฒฝ์šฐ ๋“ฑ์„ ์ ์šฉํ•  ๋•Œ ์‚ฌ์šฉ๋œ๋‹ค. 
token_pattern โ—พ ํ† ํฐํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ์ •๊ทœ ํ‘œํ˜„์‹ ํŒจํ„ด์„ ์ง€์ • 
โ—พ '\b\w\w+\b' ๊ฐ€ ๋””ํด๋“œ ๊ฐ’์œผ๋กœ ๊ณต๋ฐฑ, ๊ฐœํ–‰๋ฌธ์ž ๋“ฑ์œผ๋กœ ๊ตฌ๋ถ„๋œ ๋‹จ์–ด๋ถ„๋ฆฌ์ž ํ˜น์€ 2๋ฌธ์ž ์ด์ƒ์˜ ๋‹จ์–ด๋ฅผ ํ† ํฐ์œผ๋กœ ๋ถ„๋ฆฌํ•œ๋‹ค. 
โ—พ analyzer='word' ์ผ๋•Œ๋งŒ ๋ณ€๊ฒฝ ๊ฐ€๋Šฅ (๋ณดํ†ต ๋””ํดํŠธ ๊ฐ’์„ ์‚ฌ์šฉ) 
tokenizer ํ† ํฐํ™”๋ฅผ ๋ณ„๋„์˜ ์ปค์Šคํ…€ ํ•จ์ˆ˜๋กœ ์ด์šฉ์‹œ ์ ์šฉ 

 

 

 

โœ” TF-IDF

 

 

  • ๊ฐœ๋ณ„ ๋ฌธ์„œ์—์„œ ์ž์ฃผ ๋‚˜ํƒ€๋‚˜๋Š” ๋‹จ์–ด์—๋Š” ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ๋˜, ๋ชจ๋“  ๋ฌธ์„œ์—์„œ ์ „๋ฐ˜์ ์œผ๋กœ ์ž์ฃผ ๋‚˜ํƒ€๋‚˜๋Š” ๋ฒ”์šฉ์ ์ธ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋Š” ํŽ˜๋„ํ‹ฐ๋ฅผ ์ฃผ๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ฐ’์„ ๋ถ€์—ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.
  • ๋ฌธ์„œ๋งˆ๋‹ค ํ…์ŠคํŠธ๊ฐ€ ๊ธธ๊ณ , ๋ฌธ์„œ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์€ ๊ฒฝ์šฐ ์นด์šดํŠธ ๋ฐฉ์‹๋ณด๋‹ค ๋” ์ข‹์€ ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = [
    'you know I want your love',
    'I like you',
    'what should I do ',    
]

tfidfv = TfidfVectorizer().fit(corpus)
print(tfidfv.transform(corpus).toarray())
print(tfidfv.vocabulary_)


## ๐Ÿ“Œ ๊ฒฐ๊ณผ ##

[[0.         0.46735098 0.         0.46735098 0.         0.46735098 0.         0.35543247 0.46735098]
 [0.         0.         0.79596054 0.         0.         0.         0.         0.60534851 0.        ]
 [0.57735027 0.         0.         0.         0.57735027 0.         0.57735027 0.         0.        ]]

{'you': 7, 'know': 1, 'want': 5, 'your': 8, 'love': 3, 'like': 2, 'what': 6, 'should': 4, 'do': 0}

 

 

 

 

 

๐Ÿ’ก BoW ๋ฒกํ„ฐํ™”๋ฅผ ์œ„ํ•œ ํฌ์†Œ ํ–‰๋ ฌ 

 

  • ํฌ์†Œํ–‰๋ ฌ์€ ๋„ˆ๋ฌด ๋งŽ์€ ๋ถˆํ•„์š”ํ•œ 0 ๊ฐ’์ด ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์— ํ• ๋‹น๋˜์–ด ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์ด ๋งŽ์ด ํ•„์š”ํ•˜๊ณ , ํ–‰๋ ฌ์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์„œ ์—ฐ์‚ฐ ์‹œ์—๋„ ๋ฐ์ดํ„ฐ ์—‘์„ธ์Šค๋ฅผ ์œ„ํ•œ ์‹œ๊ฐ„์ด ๋งŽ์ด ์†Œ๋ชจ๋œ๋‹ค. 
  • ํฌ์†Œํ–‰๋ ฌ์„ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์ ์€ ๋ฉ”๋ชจ๋ฆฌ ๊ณต๊ฐ„์„ ์ฐจ์ง€ํ•˜๋„๋ก ๋ณ€ํ™˜ํ•˜๋Š” ๋ฐฉ๋ฒ• : COO ํ˜•์‹ , CSR ํ˜•์‹ 
  • ์ผ๋ฐ˜์ ์œผ๋กœ CSR ํ˜•์‹์ด ๋” ์ˆ˜ํ–‰ ๋Šฅ๋ ฅ์ด ๋›ฐ์–ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์— COO๋ณด๋‹จ CSR ๋ฅผ ํ™œ์šฉํ•œ๋‹ค. 

 

โœ” ํฌ์†Œํ–‰๋ ฌ 

 

 

 

 

โœ” COO

 

  • 0์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ๋งŒ ๋ณ„๋„์˜ ๋ฐฐ์—ด์— ์ €์žฅํ•˜๊ณ  ๊ทธ ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ€๋ฆฌํ‚ค๋Š” ํ–‰๊ณผ ์—ด์˜ ์œ„์น˜๋ฅผ ๋ณ„๋„์˜ ๋ฐฐ์—ด๋กœ ์ €์žฅํ•˜๋Š” ๋ฐฉ์‹ 

 

 

โœ” CSR

 

  • COO ํ˜•์‹์ด ํ–‰๊ณผ ์—ด์˜ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด ๋ฐ˜๋ณต์ ์ธ ์œ„์น˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š” ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•œ ๋ฐฉ์‹ 

 

 

from scipy import sparse

sparse.coo_matrix(data)
sparse.csr_matrix(data)

 

 

 

4๏ธโƒฃ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ์‹ค์Šต 


๐Ÿ“Œ ๋‰ด์Šค ๊ทธ๋ฃน ๋ถ„๋ฅ˜ ์‹ค์Šต 

 

๐Ÿ’ก ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ 

 

โœ” ํŠน์ • ๋ฌธ์„œ์˜ ๋ถ„๋ฅ˜๋ฅผ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•ด ๋ชจ๋ธ์„ ์ƒ์„ฑํ•œ ๋’ค , ์ด ํ•™์Šต ๋ชจ๋ธ์„ ์ด์šฉํ•ด ๋‹ค๋ฅธ ๋ฌธ์„œ์˜ ๋ถ„๋ฅ˜๋ฅผ ์˜ˆ์ธก 

 

โœ” ํฌ์†Œํ–‰๋ ฌ์— ๋ถ„๋ฅ˜๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ 

 

  • ๋กœ์ง€์Šคํ‹ฑํšŒ๊ท€
  • SVM
  • ๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ 

 

๐Ÿƒ‍โ™€๏ธ Countvectorizer , TfidfVectorizer

 

๐Ÿƒ‍โ™€๏ธ model : ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ 

 

๐Ÿƒ‍โ™€๏ธ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ : GridSearchCV,  Pipeline

 

 

 

 

5๏ธโƒฃ ๊ฐ์„ฑ๋ถ„์„ 


๐Ÿ‘€ ๊ฐœ์š” 

 

๐Ÿ’ก ๊ฐ์„ฑ๋ถ„์„

 

โœ” ๋ฌธ์„œ์˜ ์ฃผ๊ด€์ ์ธ ๊ฐ์„ฑ/์˜๊ฒฌ/๊ฐ์ •/๊ธฐ๋ถ„ ๋“ฑ์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ• 

 

โœ” ์†Œ์…œ๋ฏธ๋””์–ด, ์—ฌ๋ก ์กฐ์‚ฌ, ์˜จ๋ผ์ธ ๋ฆฌ๋ทฐ, ํ”ผ๋“œ๋ฐฑ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ํ™œ์šฉ๋จ 

 

โœ” ํ…์ŠคํŠธ์˜ ๋‹จ์–ด์™€ ๋ฌธ๋งฅ ๐Ÿ‘‰ ๊ฐ์„ฑ ์ˆ˜์น˜๋ฅผ ๊ณ„์‚ฐ ๐Ÿ‘‰ ๊ธ์ • ๊ฐ์„ฑ ์ง€์ˆ˜์™€ ๋ถ€์ • ๊ฐ์„ฑ ์ง€์ˆ˜ 

 

์ง€๋„ํ•™์Šต 

 

  • ํ•™์Šต ๋ฐ์ดํ„ฐ + ํƒ€๊นƒ ๋ ˆ์ด๋ธ” ๊ฐ’ 
  • ํ…์ŠคํŠธ ๋ถ„๋ฅ˜์™€ ์œ ์‚ฌํ•จ

 

๋น„์ง€๋„ํ•™์Šต

 

  • Lexicon ์ด๋ผ๋Š” ์ผ์ข…์˜ ๊ฐ์„ฑ ์–ดํœ˜ ์‚ฌ์ „์„ ์ด์šฉํ•œ๋‹ค. (ํ•œ๊ธ€ ์ง€์› X) 
  • ๊ฐ์„ฑ ์–ดํœ˜ ์‚ฌ์ „์„ ์ด์šฉํ•ด ๋ฌธ์„œ์˜ ๊ธ์ •์ , ๋ถ€์ •์  ๊ฐ์„ฑ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•œ๋‹ค. 
  • ๋ณดํ†ต ๋งŽ์€ ๊ฐ์„ฑ ๋ถ„์„์šฉ ๋ฐ์ดํ„ฐ๋Š” ๊ฒฐ์ •๋œ ๋ ˆ์ด๋ธ” ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š๋‹ค. 

 

๐Ÿ’จ Polarity score ๊ฐ์„ฑ ์ง€์ˆ˜ 

 

  • ๊ฐ์„ฑ์‚ฌ์ „์€ ๊ธ์ • ๋˜๋Š” ๋ถ€์ • ๊ฐ์„ฑ์˜ ์ •๋„๋ฅผ ์˜๋ฏธํ•˜๋Š” ์ˆ˜์น˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. 
  • ๊ฐ์„ฑ์ง€์ˆ˜๋Š” ๋‹จ์–ด์˜ ์œ„์น˜๋‚˜ ์ฃผ๋ณ€ ๋‹จ์–ด, ๋ฌธ๋งฅ, POS (๋ช…์‚ฌ/ํ˜•์šฉ์‚ฌ/๋™์‚ฌ ๋“ฑ์˜ ๋ฌธ๋ฒ• ์š”์†Œ ํƒœ๊น…) ๋“ฑ์„ ์ฐธ๊ณ ํ•ด ๊ฒฐ์ •๋œ๋‹ค. 
  • NLTK ์˜ Lexicon ๋ชจ๋“ˆ 

 

 

๐Ÿ’ก ๊ฐ์„ฑ์‚ฌ์ „ 

 

๐Ÿ’จ WordNet 

 

  • NLP ํŒจํ‚ค์ง€์—์„œ ์ œ๊ณตํ•˜๋Š” ๋ฐฉ๋Œ€ํ•œ "์‹œ๋งจํ‹ฑ ๋ถ„์„"์„ ์ œ๊ณตํ•˜๋Š” ์˜์–ด ์–ดํœ˜ ์‚ฌ์ „
  • ์‹œ๋งจํ‹ฑ semantic : ๋ฌธ๋งฅ์ƒ์˜ ์˜๋ฏธ → ๋™์ผํ•œ ๋‹จ์–ด๋‚˜ ๋ฌธ์žฅ์ด๋ผ๋„ ๋‹ค๋ฅธ ํ™˜๊ฒฝ๊ณผ ๋ฌธ๋งฅ์—์„œ๋Š” ๋‹ค๋ฅด๊ฒŒ ํ‘œํ˜„๋˜๊ฑฐ๋‚˜ ์ดํ•ด๋  ์ˆ˜ ์žˆ๋‹ค.
  • WordNet ์€ ๋‹ค์–‘ํ•œ ์ƒํ™ฉ์—์„œ ๊ฐ™์€ ์–ดํœ˜๋ผ๋„ ๋‹ค๋ฅด๊ฒŒ ์‚ฌ์šฉ๋˜๋Š” ์–ดํœ˜์˜ ์‹œ๋งจํ‹ฑ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•œ๋‹ค ๐Ÿ‘‰ ํ’ˆ์‚ฌ๋กœ ๊ตฌ์„ฑ๋œ ๊ฐœ๋ณ„ ๋‹จ์–ด๋ฅผ Synset ์ด๋ผ๋Š” ๊ฐœ๋…์„ ์ด์šฉํ•ด ํ‘œํ˜„ํ•œ๋‹ค. 
  • NLTK ์—์„œ ์ œ๊ณตํ•˜๋Š” ๊ฐ์„ฑ์‚ฌ์ „์˜ ์˜ˆ์ธก์„ฑ๋Šฅ์€ ๊ทธ๋ฆฌ ์ข‹์ง„ ์•Š๋‹ค. ์‹ค์ œ ์—…๋ฌด ์ ์šฉ์—์„œ๋Š” NLTK ํŒจํ‚ค์ง€๊ฐ€ ์•„๋‹Œ ๋‹ค๋ฅธ ๊ฐ์„ฑ ์‚ฌ์ „์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋‹ค. 

 

๐Ÿ“‘ ๋Œ€ํ‘œ์ ์ธ ๊ฐ์„ฑ์‚ฌ์ „ 

 

  • NLTK ์˜ WordNet 
  • SentiWordNet : ๊ฐ์„ฑ ๋‹จ์–ด ์ „์šฉ์˜ WordNet ์„ ๊ตฌํ˜„ํ•œ ๊ฒƒ 
  • VADER : ์†Œ์…œ ๋ฏธ๋””์–ด์˜ ํ…์ŠคํŠธ์— ๋Œ€ํ•œ ๊ฐ์„ฑ๋ถ„์„์„ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•œ ํŒจํ‚ค์ง€ 
  • Pattern : ์˜ˆ์ธก์„ฑ๋Šฅ์—์„œ ๊ฐ€์žฅ ์ฃผ๋ชฉ๋ฐ›๋Š” ํŒจํ‚ค์ง€. ํŒŒ์ด์ฌ 2.X ๋ฒ„์ „์—์„œ๋งŒ ๋™์ž‘ํ•œ๋‹ค. 

 

 

 

 

 

โž• ํ•œ๊ธ€ ๊ฐ์„ฑ ์–ดํœ˜ ์‚ฌ์ „

 

โ—พ KNU : http://dilab.kunsan.ac.kr/knusl.html

โ—พ KOSAC : http://word.snu.ac.kr/kosac/https://github.com/mrlee23/KoreanSentimentAnalyzer 

 

 

๐Ÿ“Œ ์‹ค์Šต

 

from nltk.corpus import wordnet as wn 
from nltk.corpus import sentiwordnet as swn 
from nltk.sentiment.vader import SentimentIntensityAnalyzer

 

๐Ÿ’จ ์ง€๋„ํ•™์Šต 

 

ํ…์ŠคํŠธ ํด๋ Œ์ง•(์ •๊ทœํ‘œํ˜„์‹)  → train/test dataset → ํ…์ŠคํŠธ ๋ฒกํ„ฐํ™” (Count, TF-IDF) + ๋ถ„๋ฅ˜ ๋ชจ๋ธ๋ง (๋กœ์ง€์Šคํ‹ฑํšŒ๊ท€) by Pipeline 

 

 

๐Ÿ’จ ๋น„์ง€๋„ํ•™์Šต 

 

 

๐Ÿƒ‍โ™€๏ธ WordNet

 

from nltk.corpus import wordnet as wn 

term = 'present' 

synsets = wn.synsets(term) 
print('๋ฐ˜ํ™˜ type' , type(synsets)) 
print('๋ฐ˜ํ™˜ ๊ฐ’ ๊ฐœ์ˆ˜', len(synsets)) 
print('๋ฐ˜ํ™˜ ๊ฐ’', synsets) #๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜


###########๊ฒฐ๊ณผ##################

๋ฐ˜ํ™˜ type <class 'list'>
๋ฐ˜ํ™˜ ๊ฐ’ ๊ฐœ์ˆ˜ 18
๋ฐ˜ํ™˜ ๊ฐ’ [Synset('present.n.01'), Synset('present.n.02'), Synset('present.n.03'), Synset('show.v.01'), Synset('present.v.02'), Synset('stage.v.01'), Synset('present.v.04'), Synset('present.v.05'), Synset('award.v.01'), Synset('give.v.08'), Synset('deliver.v.01'), Synset('introduce.v.01'), Synset('portray.v.04'), Synset('confront.v.03'), Synset('present.v.12'), Synset('salute.v.06'), Synset('present.a.01'), Synset('present.a.02')]

 

โ—พ present.n.01

  • n : ๋ช…์‚ฌ
  • 01 : present ๊ฐ€ ๋ช…์‚ฌ๋กœ์„œ ๊ฐ€์ง€๋Š” ์—ฌ๋Ÿฌ ์˜๋ฏธ๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•œ ์ธ๋ฑ์Šค

 

โ—พ synset : ํ•˜๋‚˜์˜ ๋‹จ์–ด๊ฐ€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์‹œ๋งจํ‹ฑ ์ •๋ณด๋ฅผ ๊ฐœ๋ณ„ ํด๋ž˜์Šค๋กœ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ

 

print(synsets[0].name()) 
print(synsets[0].lexname()) # POS
print(synsets[0].definition()) # ์ •์˜ 
print(synsets[0].lemma_names()) # ๋ถ€๋ช…์ œ 

##################๊ฒฐ๊ณผ#################

present.n.01
noun.time
the period of time that is happening now; any continuous stretch of time including the moment of speech
['present', 'nowadays']

 

โ—พ  WordNet ์€ ์–ด๋–ค ์–ดํœ˜์™€ ๋‹ค๋ฅธ ์–ดํœ˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์œ ์‚ฌ๋„๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค : path_similarity()

 

 

๐Ÿƒ‍โ™€๏ธ SentiWordNet 

 

from nltk.corpus import sentiwordnet as swn 

senti_synsets = list(swn.senti_synsets('slow')) 
print(type(senti_synsets)) 
print(len(senti_synsets)) 
print(senti_synsets)


#############๊ฒฐ๊ณผ################

<class 'list'>
11
[SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'), SentiSynset('slow.v.03'), SentiSynset('slow.a.01'), SentiSynset('slow.a.02'), SentiSynset('dense.s.04'), SentiSynset('slow.a.04'), SentiSynset('boring.s.01'), SentiSynset('dull.s.08'), SentiSynset('slowly.r.01'), SentiSynset('behind.r.03')]

 

โ—พ SentiSynset ๊ฐ์ฒด๋Š” ๋‹จ์–ด์˜ ๊ฐ์„ฑ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ์„ฑ์ง€์ˆ˜์™€ ๊ฐ๊ด€์„ฑ (๊ฐ์„ฑ๊ณผ ๋ฐ˜๋Œ€) ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ๊ด€์„ฑ ์ง€์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ๋‹ค. ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ์ „ํ˜€ ๊ฐ์„ฑ์ ์ด์ง€ ์•Š์œผ๋ฉด ๊ฐ๊ด€์„ฑ ์ง€์ˆ˜๋Š” 1์ด ๋˜๊ณ , ๊ฐ์„ฑ ์ง€์ˆ˜๋Š” ๋ชจ๋‘ 0์ด ๋œ๋‹ค.

 

father = swn.senti_synset('father.n.01')
print(father.pos_score()) 
print(father.neg_score()) 
print(father.obj_score())

#########๊ฒฐ๊ณผ##########
0.0
0.0
1.0

 

fabulous = swn.senti_synset('fabulous.a.01') #๊ต‰์žฅํ•œ ์ด๋ผ๋Š” ๋œป์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ 
print(fabulous.pos_score()) 
print(fabulous.neg_score()) 
print(fabulous.obj_score())

##############๊ฒฐ๊ณผ###################
0.875
0.125
0.0

 

โ—พ IMDB ๊ฐ์„ฑํ‰ ๊ฐ์„ฑ ๋ถ„์„ (์ฝ”๋žฉ ๋…ธํŠธ๋ถ ์ฐธ๊ณ ) 

 

(1) ๋ฌธ์„œ๋ฅผ ๋ฌธ์žฅ ๋‹จ์œ„๋กœ ๋ถ„ํ•ด
(2) ๋ฌธ์žฅ์„ ๋‹จ์–ด ๋‹จ์œ„๋กœ ํ† ํฐํ™”ํ•˜๊ณ  ํ’ˆ์‚ฌํƒœ๊น…
๐Ÿ‘‰ (1), (2) ๋ฒˆ์€ WordNet ์„ ์ด์šฉ
(3) ํ’ˆ์‚ฌ ํƒœ๊น…๋œ ๋‹จ์–ด ๊ธฐ๋ฐ˜์œผ๋กœ synset ๊ฐ์ฒด์™€ senti_synset ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑ
(4) Senti_synsets ์—์„œ ๊ฐ์ • ๊ธ์ •/๋ถ€์ • ์ง€์ˆ˜๋ฅผ ๊ตฌํ•˜๊ณ  ์ด๋ฅผ ๋ชจ๋‘ ํ•ฉ์‚ฐํ•ด ํŠน์ • ์ž„๊ณ„์น˜ ๊ฐ’ ์ด์ƒ์ผ ๋•Œ ๊ธ์ •๊ฐ์„ฑ์œผ๋กœ, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ๋ถ€์ • ๊ฐ์„ฑ์œผ๋กœ ๋ถ„๋ฅ˜

 

 

๐Ÿƒ‍โ™€๏ธ Vader 

 

โ—ฝ ์†Œ์…œ ๋ฏธ๋””์–ด์˜ ๊ฐ์„ฑ๋ถ„์„ ๋ชฉ์ 

 

from nltk.sentiment.vader import SentimentIntensityAnalyzer

senti_analyzer = SentimentIntensityAnalyzer()
senti_scores = senti_analyzer.polarity_scores(review_df['review'][0]) # IMDB ๋ฆฌ๋ทฐ ํ•˜๋‚˜๋งŒ ๊ฐ€์ง€๊ณ  ๊ฐ์„ฑ๋ถ„์„ ์ˆ˜ํ–‰ํ•ด๋ณด๊ธฐ 
print(senti_scores)

########๊ฒฐ๊ณผ##########
{'neg': 0.13, 'neu': 0.743, 'pos': 0.127, 'compound': -0.7943}

 

  • polarity_scores() ๋ฉ”์„œ๋“œ๋ฅผ ํ˜ธ์ถœํ•ด ๊ฐ์„ฑ ์ ์ˆ˜๋ฅผ ๊ตฌํ•œ๋‹ค. ๋‹จ์–ด์˜ ์œ„์น˜, ์ฃผ๋ณ€๋‹จ์–ด, ๋ฌธ๋งฅ, ํ’ˆ์‚ฌ POS ๋“ฑ์„ ์ฐธ๊ณ ํ•ด ๊ฒฐ์ •๋œ๋‹ค. 
    • neg ๋ถ€์ •, neu ์ค‘๋ฆฝ, pos ๊ธ์ •์ง€์ˆ˜
    • compound : neg, neu, pos ๋ฅผ ์ ์ ˆํžˆ ์กฐํ•ฉํ•ด -1 ์—์„œ 1 ์‚ฌ์ด์˜ ๊ฐ์„ฑ ์ง€์ˆ˜๋ฅผ ํ‘œํ˜„ํ•œ ๊ฐ’์ด๋‹ค. ์ด ๊ฐ’์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ธ๋ถ€์ • ์—ฌ๋ถ€๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค. ๋ณดํ†ต 0.1 ์ด์ƒ์ด๋ฉด ๊ธ์ •์œผ๋กœ, ๊ทธ ์ดํ•˜๋ฉด ๋ถ€์ •์œผ๋กœ ํŒ๋‹จํ•˜๋Š”๋ฐ, ์ƒํ™ฉ์— ๋”ฐ๋ผ ์ž„๊ณ„๊ฐ’์„ ์ ์ ˆํžˆ ์กฐ์ •ํ•ด ์˜ˆ์ธก ์„ฑ๋Šฅ์„ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋‹ค.

 

def vader_polarity(review,threshold=0.1):
    analyzer = SentimentIntensityAnalyzer()
    scores = analyzer.polarity_scores(review)
    
    # compound ๊ฐ’์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ threshold ์ž…๋ ฅ๊ฐ’๋ณด๋‹ค ํฌ๋ฉด 1, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด 0์„ ๋ฐ˜ํ™˜ 
    agg_score = scores['compound']
    final_sentiment = 1 if agg_score >= threshold else 0
    return final_sentiment

# apply lambda ์‹์„ ์ด์šฉํ•˜์—ฌ ๋ ˆ์ฝ”๋“œ๋ณ„๋กœ vader_polarity( )๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ 'vader_preds'์— ์ €์žฅ
review_df['vader_preds'] = review_df['review'].apply( lambda x : vader_polarity(x, 0.1) )
y_target = review_df['sentiment'].values
vader_preds = review_df['vader_preds'].values

print('#### VADER ์˜ˆ์ธก ์„ฑ๋Šฅ ํ‰๊ฐ€ ####')
get_clf_eval(y_target, vader_preds)

# ์ •ํ™•๋„๊ฐ€ SentiWordNet ๋ณด๋‹ค ํ–ฅ์ƒ๋จ (์žฌํ˜„์œจ์ด ํŠนํžˆ ํ–ฅ์ƒ๋จ)

 

 

 

6๏ธโƒฃ ํ† ํ”ฝ๋ชจ๋ธ๋ง - ๋‰ด์Šค๊ทธ๋ฃน 


๐Ÿ‘€ ๊ฐœ์š” 

 

๐Ÿ’ก ํ† ํ”ฝ ๋ชจ๋ธ๋ง

 

 

โœ” ๋ฌธ์„œ ์ง‘ํ•ฉ์— ์ˆจ์–ด์žˆ๋Š” ์ฃผ์ œ๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ (๋น„์ง€๋„ํ•™์Šต) 

โœ” ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ํ† ํ”ฝ ๋ชจ๋ธ๋ง์€ ์ˆจ๊ฒจ์ง„ ์ฃผ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ์ค‘์‹ฌ ๋‹จ์–ด๋ฅผ ํ•จ์ถ•์ ์œผ๋กœ ์ถ”์ถœํ•œ๋‹ค. 

โœ” LSA (Latent semantic analysis) ์™€ LDA ๊ธฐ๋ฒ•์ด ์ž์ฃผ ์‚ฌ์šฉ๋œ๋‹ค. 

 

 

๐Ÿ’ก LDA 

 

โœ” Latent Dirichlet Allocation 

 

 * ์ฐจ์›์ถ•์†Œ์˜ LDA ์™€ ์•ฝ์–ด๋Š” ๊ฐ™์ง€๋งŒ ์„œ๋กœ ๋‹ค๋ฅธ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ž„์— ์ฃผ์˜ํ•  ๊ฒƒ 

 

 

  • ์‚ฌ์ดํ‚ท๋Ÿฐ LatentDiricheltAllocation ๋ชจ๋“ˆ ์ง€์› 
  • LDA ๋Š” Count ๊ธฐ๋ฐ˜์˜ ๋ฒกํ„ฐํ™”๋งŒ ์‚ฌ์šฉํ•œ๋‹ค. 

 

from sklearn.decomposition import LatentDirichletAllocation

 

โ—ฝ components_ : ๊ฐœ๋ณ„ ํ† ํ”ฝ๋ณ„๋กœ ๊ฐ word ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋งŽ์ด ํ• ๋‹น๋˜์—ˆ๋Š”์ง€์— ๊ด€ํ•œ ์ˆ˜์น˜๋ฅผ ๊ฐ€์ง„๋‹ค. ์ด ๊ฐ’์ด ๋†’์„์ˆ˜๋ก ํ•ด๋‹น ํ† ํ”ฝ์˜ ์ค‘์‹ฌ ๋‹จ์–ด๊ฐ€ ๋œ๋‹ค. 

 

lda = LatentDirichletAllocation(n_components=8, random_state=0) 
# โญ n_components : ํ† ํ”ฝ ๊ฐœ์ˆ˜

lda.fit(feat_vect)
# feat_vect : ๋‹จ์–ด ๋ฒกํ„ฐ ํ–‰๋ ฌ

print(lda.components_.shape)
lda.components_ # ๊ฐœ๋ณ„ ํ† ํ”ฝ (8๊ฐœ) ๋ณ„๋กœ ๊ฐ word ํ”ผ์ฒ˜๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ทธ ํ† ํ”ฝ์— ํ• ๋‹น๋˜์—ˆ๋Š”์ง€์— ๋Œ€ํ•œ ์ˆ˜์น˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. 
# ๐Ÿ‘‰ ๋†’์€ ๊ฐš์ผ์ˆ˜๋ก word ํ”ผ์ฒ˜๋Š” ๊ทธ ํ† ํ”ฝ์˜ ์ค‘์‹ฌ ๋‹จ์–ด๊ฐ€ ๋œ๋‹ค.

 

lda.components_

 

โ—ฝ 8๊ฐœ์˜ ํ† ํ”ฝ ์ฃผ์ œ, 1000๊ฐœ์˜ ๋‹จ์–ด

 

def display_topics(model, feature_names, no_top_words):
    for topic_index, topic in enumerate(model.components_):
        print('Topic #',topic_index)
        # topic_index : #0 ~ #7
        # topic : ๊ฐ ํ† ํ”ฝ ์ธ๋ฑ์Šค์— ํ•ด๋‹นํ•˜๋Š” 1000๊ฐœ ๋‹จ์–ด ๊ฐ’ 

        # components_ array์—์„œ ๊ฐ€์žฅ ๊ฐ’์ด ํฐ ์ˆœ์œผ๋กœ ์ •๋ ฌํ–ˆ์„ ๋•Œ, ๊ทธ ๊ฐ’์˜ array index๋ฅผ ๋ฐ˜ํ™˜. 
        topic_word_indexes = topic.argsort()[::-1]
        top_indexes=topic_word_indexes[:no_top_words]
        
        # top_indexes๋Œ€์ƒ์ธ index๋ณ„๋กœ feature_names์— ํ•ด๋‹นํ•˜๋Š” word feature ์ถ”์ถœ ํ›„ join์œผ๋กœ concat
        feature_concat = ' '.join([feature_names[i] for i in top_indexes])                
        print(feature_concat)

# CountVectorizer๊ฐ์ฒด๋‚ด์˜ ์ „์ฒด word๋“ค์˜ ๋ช…์นญ์„ get_features_names( )๋ฅผ ํ†ตํ•ด ์ถ”์ถœ
feature_names = count_vect.get_feature_names()

# Topic๋ณ„ ๊ฐ€์žฅ ์—ฐ๊ด€๋„๊ฐ€ ๋†’์€ word๋ฅผ 15๊ฐœ๋งŒ ์ถ”์ถœ
display_topics(lda, feature_names, 15)

# ์ผ๋ฐ˜์ ์ธ ๋‹จ์–ด๊ฐ€ ์ฃผ๋ฅผ ์ด๋ฃจ๋Š” ๊ฒฝ์šฐ์˜ ํ† ํ”ฝ๋„ ์žˆ๊ณ , topic #2 ์ฒ˜๋Ÿผ ์ปดํ“จํ„ฐ ๊ทธ๋ž˜ํ”ฝ์Šค์™€ ๊ด€๋ จ๋œ ์ฃผ์ œ์–ด๊ฐ€ ์ถ”์ถœ๋œ ํ† ํ”ฝ๋„ ์กด์žฌํ•œ๋‹ค.

 

 

 

๐Ÿ’ก LSA  

 

โœ” Latent Semantic Analysis

 

  • BOW ๊ธฐ๋ฐ˜์˜ ์ž„๋ฒ ๋”ฉ์€ ๋‹จ์–ด์˜ ๋นˆ๋„์ˆ˜๋ฅผ ์ด์šฉํ•˜์ง€๋งŒ ๋‹จ์–ด์˜ ์˜๋ฏธ๋Š” ๊ณ ๋ คํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. 
  • ์ด์˜ ๋Œ€์•ˆ์œผ๋กœ Truncated SVD ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฐจ์›์„ ์ถ•์†Œ์‹œํ‚ค๊ณ  ๋‹จ์–ด์˜ ์ž ์žฌ๋œ ์˜๋ฏธ๋ฅผ ๋Œ์–ด๋‚ด๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

728x90

๋Œ“๊ธ€