๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“— NLP

ํ…์ŠคํŠธ ๋ถ„์„ โ‘ก

by isdawell 2022. 5. 17.
728x90

๐Ÿ“Œ ํŒŒ์ด์ฌ ๋จธ์‹ ๋Ÿฌ๋‹ ์™„๋ฒฝ๊ฐ€์ด๋“œ ๊ณต๋ถ€ ๋‚ด์šฉ ์ •๋ฆฌ 

 

 

๐Ÿ“Œ ์‹ค์Šต ์ฝ”๋“œ 

 

https://colab.research.google.com/drive/1aMlFfX927tDFnPUisw2M3tB6NwGy5c7q?usp=sharing 

 

08. ํ…์ŠคํŠธ ๋ถ„์„(2).ipynb

Colaboratory notebook

colab.research.google.com

 

 

1๏ธโƒฃ ๋ฌธ์„œ ๊ตฐ์ง‘ํ™” 


๐Ÿ’ก ๋ฌธ์„œ ๊ตฐ์ง‘ํ™”

 

โœ” ๊ฐœ๋… 

 

  • ๋น„์Šทํ•œ ํ…์ŠคํŠธ ๊ตฌ์„ฑ์˜ ๋ฌธ์„œ๋ฅผ ๊ตฐ์ง‘ํ™” ํ•˜๋Š” ๊ฒƒ
  • ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ๊ธฐ๋ฐ˜์˜ ๋ฌธ์„œ ๋ถ„๋ฅ˜๋Š” ์‚ฌ์ „์— target category ๊ฐ’์ด ํ•„์š”ํ•˜์ง€๋งŒ, ์ด ์—†์ด๋„ ๋น„์ง€๋„ ํ•™์Šต ๊ธฐ๋ฐ˜์œผ๋กœ ๋™์ž‘ ๊ฐ€๋Šฅํ•˜๋‹ค. 

 

1. ํ…์ŠคํŠธ ํ† ํฐํ™” & ๋ฒกํ„ฐํ™” 
2. ๊ตฐ์ง‘ํ™” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ ์šฉ  : Kmeans
3. cluster_centers_ ๋กœ ๊ตฐ์ง‘๋ณ„ ํ•ต์‹ฌ ๋‹จ์–ด ์ถ”์ถœํ•˜๊ธฐ 

 

 

โœ” ์‹ค์Šต - ์ƒํ’ˆ/์„œ๋น„์Šค ๋ฆฌ๋ทฐ ๋ฐ์ดํ„ฐ ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„ ๊ตฐ์ง‘ํ™”

 

 

# ๐Ÿ“Œ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ๋งŒ๋“ค๊ธฐ 
document_df = pd.DataFrame({'filename' : filename_list, 'opinion_text' : opinion_text})  


# ๐Ÿ“Œ ํ† ํฐํ™” ํ•จ์ˆ˜ ๋งŒ๋“ค๊ธฐ
from nltk.stem import WordNetLemmatizer
import nltk

remove_punct_dict = dict((ord(punct), None) for punct in string.punctuation)
lemmar = WordNetLemmatizer()

def LemTokens(tokens):
    return [lemmar.lemmatize(token) for token in tokens]

def LemNormalize(text):
    return LemTokens(nltk.word_tokenize(text.lower().translate(remove_punct_dict)))
    

#๐Ÿ“Œ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™” 
from sklearn.feature_extraction.text import TfidfVectorizer 

tfidf_vect = TfidfVectorizer(tokenizer = LemNormalize, stop_words = 'english', ngram_range=(1,2), min_df = 0.05, max_df = 0.85) 

feature_vect = tfidf_vect.fit_transform(document_df['opinion_text'])

 

# ๐Ÿ“Œ K-means clustering : (์ž๋™์ฐจ-์ „์ž์ œํ’ˆ-ํ˜ธํ…”) 

from sklearn.cluster import KMeans 

km_cluster = KMeans(n_clusters=5, max_iter = 10000, random_state=0) 
km_cluster.fit(feature_vect) # โœจ
cluster_label = km_cluster.labels_ 
cluster_centers = km_cluster.cluster_centers_  

document_df['cluster_label'] = cluster_label

 

 

 

# ๐Ÿ“Œ ๊ตฐ์ง‘๋ณ„ ํ•ต์‹ฌ ๋‹จ์–ด ์ถ”์ถœํ•˜๊ธฐ 

cluster_centers = km_cluster.cluster_centers_ #โญ
print('cluster_centers shape :',cluster_centers.shape)
print(cluster_centers)

# ํ–‰ : ๊ฐœ๋ณ„ ๊ตฐ์ง‘ --> 3๊ฐœ์˜ ๊ตฐ์ง‘
# ์—ด : ๊ฐœ๋ณ„ ํ”ผ์ฒ˜ --> 4611๊ฐœ์˜ word feature 
# ๊ฐ’ : 0~1 ์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์ค‘์‹ฌ๊ณผ ๊ฐ€๊นŒ์šด ๊ฐ’์„ ์˜๋ฏธํ•œ๋‹ค. 


cluster_centers shape : (3, 4611)
[[0.         0.00099499 0.00174637 ... 0.         0.00183397 0.00144581]
 [0.01005322 0.         0.         ... 0.00706287 0.         0.        ]
 [0.         0.00092551 0.         ... 0.         0.         0.        ]]

 

๐Ÿ’จ  cluster_model.cluster_centers_.argsort()[:,::-1]   → array ๊ฐ’์ด ํฐ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌ๋œ index ๊ฐ’์„ ๋ฐ˜ํ™˜ 

 

 

 

 

 

2๏ธโƒฃ ๋ฌธ์„œ ์œ ์‚ฌ๋„ 


๐Ÿ‘€ ๊ฐœ์š” 

 

๐Ÿ’ก ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„

 

โœ” ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ cosθ

 

  • ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๋ณด๋‹ค๋Š” ๋ฒกํ„ฐ์˜ ์ƒํ˜ธ ๋ฐฉํ–ฅ์„ฑ์ด ์–ผ๋งˆ๋‚˜ ์œ ์‚ฌํ•œ์ง€์— ๊ธฐ๋ฐ˜ํ•œ๋‹ค ๐Ÿ‘‰ ๋‘ ๋ฒกํ„ฐ ์‚ฌ์ด์˜ ์‚ฌ์ž‡๊ฐ์„ ๊ตฌํ•ด์„œ ์–ผ๋งˆ๋‚˜ ์œ ์‚ฌํ•œ์ง€ ์ˆ˜์น˜๋กœ ์ ์šฉํ•œ๋‹ค. 
  • 90โ—ฆ ์ดํ•˜ : ์œ ์‚ฌ ๋ฒกํ„ฐ๋“ค 
  • 90โ—ฆ : ๊ด€๋ จ์„ฑ์ด ์—†๋Š” ๋ฒกํ„ฐ๋“ค 
  • 90~180โ—ฆ : ๋ฐ˜๋Œ€ ๊ด€๊ณ„์ธ ๋ฒกํ„ฐ๋“ค

 

 

โœ” ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๊ฐ€ ๋ฌธ์„œ ์œ ์‚ฌ๋„๋กœ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ์ด์œ  

 

  • ๋ฌธ์„œ๋ฅผ ํ”ผ์ฒ˜๋ฒกํ„ฐํ™”ํ•œ ํ–‰๋ ฌ์€ ํฌ์†Œํ–‰๋ ฌ์ผ ๊ฒฝ์šฐ๊ฐ€ ๋†’์€๋ฐ, ํฌ์†Œํ–‰๋ ฌ ๊ธฐ๋ฐ˜์˜ ๋ฌธ์„œ์™€ ๋ฌธ์„œ ๋ฒกํ„ฐ ๊ฐ„์˜ 'ํฌ๊ธฐ' ์— ๊ธฐ๋ฐ˜ํ•œ ์œ ์‚ฌ๋„ ์ง€ํ‘œ (์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ ๊ธฐ๋ฐ˜) ๋Š” ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์ง€๊ธฐ ์‰ฝ๋‹ค.
  • ๋ฌธ์„œ์˜ ๊ธธ์ด๊ฐ€ ๊ธด ๊ฒฝ์šฐ์—๋Š” ๋นˆ๋„์ˆ˜(ํฌ๊ธฐ) ์— ๊ธฐ๋ฐ˜ํ•œ ๋น„๊ต๊ฐ€ ์–ด๋ ต๋‹ค. ๊ฐ€๋ น ์„ธ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋œ B ๋ฌธ์„œ์— ํ•œ๊ธ€์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ 3๋ฒˆ ๋“ฑ์žฅํ–ˆ๊ณ , 30๊ฐœ ๋ฌธ์žฅ์œผ๋กœ ๊ตฌ์„ฑ๋œ A ๋ฌธ์„œ์— ํ•œ๊ธ€์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ 5๋ฒˆ ๋“ฑ์žฅํ–ˆ์„ ๋•Œ, B ๋ฌธ์„œ๊ฐ€ ํ•œ๊ธ€๊ณผ ๋” ๋ฐ€์ ‘ํ•˜๊ฒŒ ๊ด€๋ จ๋œ ๋ฌธ์„œ๋ผ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

โœ” ์‚ฌ์ดํ‚ท๋Ÿฐ ๋ชจ๋“ˆ

 

from sklearn.metrics.pairwise import cosine_similarity
  • ํฌ์†Œํ–‰๋ ฌ, ๋ฐ€์ง‘ํ–‰๋ ฌ ๋ชจ๋‘ ๊ฐ€๋Šฅ
  • ํ–‰๋ ฌ ๋˜๋Š” ๋ฐฐ์—ด ๋ชจ๋‘ ๊ฐ€๋Šฅ 

 

 

from sklearn.feature_extraction.text import TfidfVectorizer

doc_list = ['if you take the blue pill, the story ends' ,
            'if you take the red pill, you stay in Wonderland',
            'if you take the red pill, I show you how deep the rabbit hole goes']

tfidf_vect_simple = TfidfVectorizer()
feature_vect_simple = tfidf_vect_simple.fit_transform(doc_list)
print(feature_vect_simple.shape)  


(3,18)
# 3๊ฐœ ๋ฌธ์žฅ, 18๊ฐœ์˜ (์ค‘๋ณต์—†๋Š”) ๋‹จ์–ด ๋ฒกํ„ฐ

 

 

from sklearn.metrics.pairwise import cosine_similarity

#๐Ÿ“Œ first parameter : ๋น„๊ต ๊ธฐ์ค€์ด ๋˜๋Š” ๋ฌธ์„œ์˜ ํ”ผ์ฒ˜ํ–‰๋ ฌ 
#๐Ÿ“Œ second parameter : ๋น„๊ต๋˜๋Š” ๋ฌธ์„œ์˜ ํ”ผ์ฒ˜ํ–‰๋ ฌ 


# ์ฒซ๋ฒˆ์งธ ๋ฌธ์„œ ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋จธ์ง€ ๋ฌธ์„œ์™€ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐํ•ด๋ณด๊ธฐ
pair = cosine_similarity(feature_vect_simple[0], feature_vect_simple)
print(pair)


[[1.         0.40207758 0.40425045]]

 

# ์ „์ฒด ๋ฌธ์„œ์— ๋Œ€ํ•ด ์œ ์‚ฌ๋„ ๊ตฌํ•ด๋ณด๊ธฐ

sim_cos = cosine_similarity(feature_vect_simple, feature_vect_simple) 
print(sim_cos) 
print(sim_cos.shape)



[[1.         0.40207758 0.40425045]
 [0.40207758 1.         0.45647296]
 [0.40425045 0.45647296 1.        ]]
 
 
(3, 3)

 

 

 

๐Ÿ‘€ ๋ฌธ์„œ ์œ ์‚ฌ๋„ ๊ฒฐ๊ณผ ์‹œ๊ฐํ™” ์˜ˆ์‹œ - Opinion review data 

 

 

 

 

 

 

 

 

3๏ธโƒฃ ํ•œ๊ธ€ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ 


๐Ÿ“Œ ๋„ค์ด๋ฒ„ ์˜ํ™” ํ‰์  ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ๊ฐ์„ฑ๋ถ„์„ 

 

๐Ÿ’ก ํ•œ๊ธ€ NLP ์ฒ˜๋ฆฌ์˜ ์–ด๋ ค์›€ 

 

 

โœ” ๋„์–ด์“ฐ๊ธฐ

 

  • ์˜์–ด์˜ ๊ฒฝ์šฐ ๋„์–ด์“ฐ๊ธฐ๋ฅผ ์ž˜๋ชปํ•˜๋ฉด ์—†๋Š”/์ž˜๋ชป๋œ ๋‹จ์–ด๋กœ ์ธ์‹๋œ๋‹ค. 
  • ๊ทธ๋Ÿฌ๋‚˜ ํ•œ๊ตญ์–ด๋Š” ๋„์–ด์“ฐ๊ธฐ์— ๋”ฐ๋ผ ์˜๋ฏธ๊ฐ€ ๋‹ฌ๋ผ์ง€๋Š” ๊ฒฝ์šฐ๊ฐ€ ์กด์žฌํ•œ๋‹ค. 
    • ex. ์•„๋ฒ„์ง€ ๊ฐ€๋ฐฉ์— ๋“ค์–ด๊ฐ€์‹ ๋‹ค, ์•„๋ฒ„์ง€๊ฐ€ ๋ฐฉ์— ๋“ค์–ด๊ฐ€์‹ ๋‹ค

 

โœ” ๋‹ค์–‘ํ•œ ์กฐ์‚ฌ 

 

  • ์›Œ๋‚™ ๊ฒฝ์šฐ์˜ ์ˆ˜๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ์–ด๊ทผ ์ถ”์ถœ ๋“ฑ์˜ ์ „์ฒ˜๋ฆฌ ์‹œ ์ œ๊ฑฐํ•˜๊ธฐ๊ฐ€ ์–ด๋ ต๋‹ค. 
  • ๋ผํ‹ด์–ด ๊ณ„์—ด์˜ ์–ธ์–ด๋ณด๋‹ค NLP ์ฒ˜๋ฆฌ๊ฐ€ ์–ด๋ ต๋‹ค. 

 

 

๐Ÿ’ก KoNLPy

 

 

โœ” ๋Œ€ํ‘œ์ ์ธ ํ•œ๊ธ€ ํ˜•ํƒœ์†Œ ํŒจํ‚ค์ง€ 

 

  • ํ˜•ํƒœ์†Œ ๋ถ„์„ : ๋ง๋ญ‰์น˜๋ฅผ ํ˜•ํƒœ์†Œ (๋‹จ์–ด๋กœ์„œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๋Š” ์ตœ์†Œ ๋‹จ์œ„) ์–ด๊ทผ ๋‹จ์œ„๋กœ ์ชผ๊ฐœ๊ณ  ๊ฐ ํ˜•ํƒœ์†Œ์— ํ’ˆ์‚ฌ ํƒœ๊น… (POS tagging) ์„ ๋ถ€์ฐฉํ•˜๋Š” ์ž‘์—…์„ ์ง€์นญํ•œ๋‹ค. 
  • Kkma, Hannanum, Komoran, Mecab, Twitter ํ˜•ํƒœ์†Œ ๋ถ„์„ ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

๐Ÿ’ก ์‹ค์Šต

 

 

๐Ÿƒ‍โ™€๏ธ ๊ฐ์ • ๋ ˆ์ด๋ธ” ๋ถ„ํฌ ํ™•์ธ 

 

 

๐Ÿƒ‍โ™€๏ธ ํ…์ŠคํŠธ ์ „์ฒ˜๋ฆฌ 

 

# ๋„๊ฐ’์€ ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜ 

import re 
train_df = train_df.fillna(' ') 

# ๋ฌธ์„œ์— ์ˆซ์ž๋Š” ๊ณต๋ฐฑ์œผ๋กœ ๋ณ€ํ™˜ : \d ๋Š” ์ •๊ทœํ‘œํ˜„์‹์œผ๋กœ ์ˆซ์ž๋ฅผ ์˜๋ฏธ 
train_df['document'] = train_df['document'].apply(lambda x : re.sub(r'\d+', " ", x))


# test set ์—๋„ ๋™์ผํ•˜๊ฒŒ ์ ์šฉํ•˜๊ธฐ 

test_df = pd.read_csv('ratings_test.txt', sep='\t')
test_df = test_df.fillna(' ')
test_df['document'] = test_df['document'].apply( lambda x : re.sub(r"\d+", " ", x) )

 

๐Ÿƒ‍โ™€๏ธ ํ† ํฐํ™” + TfidfVectorizer  

 

# ๐Ÿ“Œ ํ•œ๊ธ€ ํ˜•ํƒœ์†Œ ๋ถ„์„์„ ํ†ตํ•ด ํ˜•ํƒœ์†Œ ๋‹จ์–ด๋กœ ํ† ํฐํ™”ํ•˜๊ธฐ 
# โœจ SNS ๋ถ„์„์— ์ ํ•ฉํ•œ Twitter class 

from konlpy.tag import Okt
okt = Okt() 
def tw_tokenizer(text) : 
  tokens_ko = okt.morphs(text) # ํ† ํฐํ™” 
  return tokens_ko 
  
  
  
# ๐Ÿ“Œ ํ…์ŠคํŠธ ๋ฒกํ„ฐํ™” 
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.linear_model import LogisticRegression  
from sklearn.model_selection import GridSearchCV 

# โญ ์œ„์—์„œ ์ •์˜ํ•œ ํ† ํฐํ™”๋ฅผ Tfidf tokenizer ์ธ์ž๋กœ ๋„˜๊ธฐ๊ธฐ 

## ์ˆ˜ํ–‰์‹œ๊ฐ„ 10๋ถ„ ์ด์ƒ ์†Œ์š”ํ•จ 
tfidf_vect = TfidfVectorizer(tokenizer = tw_tokenizer, ngram_range =(1,2), min_df = 3, max_df = 0.9)  
tfidf_vect.fit(train_df['document']) 
tfidf_matrix_train = tfidf_vect.transform(train_df['document'])

 

๐Ÿƒ‍โ™€๏ธ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ํ›ˆ๋ จ : GridSearchCV

 

# ๐Ÿ“Œ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๊ฐ์„ฑ ๋ถ„์„ ๋ถ„๋ฅ˜ ์ˆ˜ํ–‰ : GridSearchCV 

lg_clf = LogisticRegression(random_state=0)  

params = {
    'C' : [1,3.5,4.5,5.5,10]
}

grid_cv = GridSearchCV(lg_clf, param_grid = params, cv = 3, scoring='accuracy', verbose=1)  
grid_cv.fit(tfidf_matrix_train, train_df['label'])  


print(grid_cv.best_params_, round(grid_cv.best_score_,4)) 


# C = 3.5์ผ ๋•Œ ์ตœ์ 

 

๐Ÿƒ‍โ™€๏ธ test data set ์˜ˆ์ธก ์ˆ˜ํ–‰ 

 

# ๐Ÿ“Œ ํ…Œ์ŠคํŠธ ์„ธํŠธ๋กœ ์ตœ์ข… ๊ฐ์„ฑ ๋ถ„์„ ์˜ˆ์ธก ์ˆ˜ํ–‰ํ•˜๊ธฐ 

from sklearn.metrics import accuracy_score 

tfidf_matrix_test = tfidf_vect.transform(test_df['document']) 

best_estimator = grid_cv.best_estimator_ # โญโญ
preds = best_estimator.predict(tfidf_matrix_test)  # โญโญ

print('๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ์ •ํ™•๋„ : ', accuracy_score(test_df['label'], preds))

# 0.86172

 

 

 

 

 

4๏ธโƒฃ ์‹ค์Šต 


๐Ÿ‘€ ๋Œ€ํ˜• ์˜จ๋ผ์ธ ์‡ผํ•‘๋ชฐ Mercari ์ œํ’ˆ ๊ฐ€๊ฒฉ ์˜ˆ์ธก 

 

๐Ÿ’ก ๋Œ€ํšŒ ๋ชฉํ‘œ, ๋ฐ์ดํ„ฐ์…‹ ์†Œ๊ฐœ 

 

โœ” ์ œํ’ˆ์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ ์†์„ฑ ๋ฐ ์ œํ’ˆ ์„ค๋ช… ๋“ฑ์˜ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ ๐Ÿ‘‰ ์ œํ’ˆ ์˜ˆ์ƒ ๊ฐ€๊ฒฉ์„ ํŒ๋งค์ž๋“ค์—๊ฒŒ ์ œ๊ณต

 

train_id ๋ฐ์ดํ„ฐ id
name ์ œํ’ˆ๋ช… - text
item_condition_id ํŒ๋งค์ž๊ฐ€ ์ œ๊ณตํ•˜๋Š” ์ œํ’ˆ์ƒํƒœ (ํ…์ŠคํŠธ ๋ฒ”์ฃผ)
category_name ์นดํ…Œ๊ณ ๋ฆฌ ๋ช… - text
brand_name ๋ธŒ๋žœ๋“œ ์ด๋ฆ„ -text
price ์ œํ’ˆ ๊ฐ€๊ฒฉ, ์˜ˆ์ธก์„ ์œ„ํ•œ ํƒ€๊นƒ ์†์„ฑ์— ํ•ด๋‹น
shipping ๋ฐฐ์†ก๋น„ ๋ฌด๋ฃŒ ์—ฌ๋ถ€ (1์ด๋ฉด ๋ฌด๋ฃŒ๋กœ ํŒ๋งค์ž๊ฐ€ ์ง€๋ถˆ, 0์ด๋ฉด ๊ตฌ๋งค์ž ์ง€๋ถˆ)
item_description ์ œํ’ˆ์— ๋Œ€ํ•œ ์„ค๋ช… -text

 

๐Ÿ‘‰ ํ…์ŠคํŠธ + ์ •ํ˜• (๋ฒ”์ฃผํ˜•) ๋ฐ์ดํ„ฐ๋ฅผ ์ ์šฉํ•ด ํšŒ๊ท€๋ถ„์„์„ ์ˆ˜ํ–‰ 

 

 

๐Ÿ’ก ๋ถ„์„ ํ๋ฆ„ 

 

1๏ธโƒฃ ์ „์ฒ˜๋ฆฌ 

 

๐Ÿƒ‍โ™€๏ธ ์นผ๋Ÿผ๋ณ„ ๋ฐ์ดํ„ฐ ์œ ํ˜•/๋„๊ฐ’ ํ™•์ธ 

s.info() 


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1482535 entries, 0 to 1482534
Data columns (total 8 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   train_id           1482535 non-null  int64  
 1   name               1482535 non-null  object 
 2   item_condition_id  1482535 non-null  int64  
 3   category_name      1476208 non-null  object 
 4   brand_name         849853 non-null   object 
 5   price              1482535 non-null  float64
 6   shipping           1482535 non-null  int64  
 7   item_description   1482531 non-null  object 
dtypes: float64(1), int64(3), object(4)
memory usage: 90.5+ MB

 

๐Ÿ’จ brand_name ์€ ์ƒํ’ˆ ๋ธŒ๋žœ๋“œ ์ด๋ฆ„์œผ๋กœ ๊ฐ€๊ฒฉ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ฃผ์š” ์š”์ธ์œผ๋กœ ํŒ๋‹จ๋˜๋‚˜, ๋„๊ฐ’์ด ๋„ˆ๋ฌด ๋งŽ๋‹ค. 

๐Ÿ’จ category_name, item_descriptoin ๋„ ๋„๊ฐ’์ด ์กด์žฌํ•จ

 

 

๐Ÿƒ‍โ™€๏ธ target ๋ถ„ํฌ ํ™•์ธ 

# price ๊ฐ’ ๋ถ„ํฌ ์‚ดํŽด๋ณด๊ธฐ 

import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline 

y_train_df = s['price'] 
plt.figure(figsize=(6,4)) 
sns.distplot(y_train_df, kde = False) # ๊ฐ€๊ฒฉ์ด ๋‚ฎ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์Œ ๐Ÿ‘‰ ๋กœ๊ทธ๋ณ€ํ™˜ ๋’ค ๋‹ค์‹œ ์‚ดํŽด๋ณด๊ธฐ

 

 

import numpy as np
y_train_df = np.log1p(y_train_df) 
sns.distplot(y_train_df, kde=False) 
# ๋กœ๊ทธ๋ณ€ํ™˜ ๋’ค, ๊ฐ’์ด ๋น„๊ต์  ์ •๊ทœ๋ถ„ํฌ์— ๊ฐ€๊นŒ์šด ๋ฐ์ดํ„ฐ๋ฅผ ์ด๋ฃจ๊ฒŒ ๋จ ๐Ÿ‘‰ ์›๋ž˜ ๋ฐ์ดํ„ฐ์—๋„ ๋กœ๊ทธ ๋ณ€ํ™˜์„ ์ ์šฉํ•˜๊ธฐ

s['price'] = np.log1p(s['price'])

 

 

๐Ÿƒ‍โ™€๏ธ ์นผ๋Ÿผ๋ณ„๋กœ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๊ธฐ - int64 ํƒ€์ž…์ธ shipping ์นผ๋Ÿผ๊ณผ item_condition_id ์นผ๋Ÿผ 

 

print('shipping ๊ฐ’ ์œ ํ˜• \n', s.shipping.value_counts())  # ๋ถ„ํฌ๊ฐ€ ๋น„๊ต์  ๊ท ์ผ 

print('item_condition_id ๊ฐ’ ์œ ํ˜•', s.item_condition_id.value_counts()) # ํŒ๋งค์ž๊ฐ€ ์ œ๊ณตํ•˜๋Š” ์ œํ’ˆ ์ƒํƒœ๋กœ, ์ž์„ธํžˆ ๊ฐ ๊ฐ’์˜ ์˜๋ฏธ๋ฅผ ์•Œ ์ˆ˜ ์—†์œผ๋‚˜, 1,2,3 ๊ฐ’์ด ์œ„์ฃผ๋กœ ์ด๋ฃจ์–ด์ง

shipping ๊ฐ’ ์œ ํ˜• 
0    819435
1    663100
Name: shipping, dtype: int64
item_condition_id ๊ฐ’ ์œ ํ˜• 1    640549
3    432161
2    375479
4     31962
5      2384
Name: item_condition_id, dtype: int64

 

๐Ÿƒ‍โ™€๏ธ ์นผ๋Ÿผ๋ณ„๋กœ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๊ธฐ - object ํƒ€์ž…์ธ item_description ์นผ๋Ÿผ ์‚ดํŽด๋ณด๊ธฐ 

 

# ์ œํ’ˆ ์„ค๋ช… ์นผ๋Ÿผ์€ Null ๊ฐ’์ด ๋ณ„๋กœ ์—†๊ธด ํ•˜์ง€๋งŒ, ๋ณ„๋„ ์„ค๋ช…์ด ์—†๋Š” ๊ฒฝ์šฐ 'No description yet' ์œผ๋กœ ๋˜์–ด์žˆ๋‹ค ๐Ÿ‘‰ ์–ผ๋งˆ๋‚˜ ๋งŽ์€์ง€ ์‚ดํŽด๋ณด๊ธฐ 

boolean_cond = s['item_description'] == 'No description yet' 

s[boolean_cond]['item_description'].count() # 82489๊ฑด : ์˜๋ฏธ์žˆ๋Š” ์†์„ฑ๊ฐ’์œผ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์—†์Œ ๐Ÿ‘‰ ์ ์ ˆํ•œ ๊ฐ’์œผ๋กœ์˜ ๋ณ€๊ฒฝ์ด ํ•„์š”

 

๐Ÿƒ‍โ™€๏ธ ์นผ๋Ÿผ๋ณ„๋กœ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๊ธฐ - object ํƒ€์ž…์ธ categort_name ์นผ๋Ÿผ ์‚ดํŽด๋ณด๊ธฐ 

 

# category_name ์นผ๋Ÿผ ์‚ดํŽด๋ณด๊ธฐ 
# Men/Top/T-shirts ์ฒ˜๋Ÿผ
# / ๋กœ ๋ถ„๋ฆฌ๋œ ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ํ•˜๋‚˜์˜ ๋ฌธ์ž์—ด๋กœ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ๋‹ค ๐Ÿ‘‰ ๋‹จ์–ด๋ฅผ ํ† ํฐํ™” ํ•˜์—ฌ ๋ณ„๋„์˜ ํ”ผ์ฒ˜๋กœ ์ €์žฅํ•˜๊ธฐ 

def split_cat(category_name) : 
  try:
    return category_name.split('/') 
  except : 
    return ['Other_Null', 'Other_Null', 'Other_Null']  # Null ์ผ ๊ฒฝ์šฐ ๋ฐ˜ํ™˜๊ฐ’ 

# ๋Œ€๋ถ„๋ฅ˜/์ค‘๋ถ„๋ฅ˜/์†Œ๋ถ„๋ฅ˜
s['cat_dae'], s['cat_jung'], s['cat_so'] = zip(*s['category_name'].apply(lambda x : split_cat(x)))  

# โœจ zip ๊ณผ * ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด, ์ƒ์„ฑ๋œ ๋ฆฌ์ŠคํŠธ์˜ ๊ฐ ์š”์†Œ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์นผ๋Ÿผ์œผ๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ฐ’์„ ๋ถ„๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. 

print('๋Œ€๋ถ„๋ฅ˜ ์œ ํ˜• : \n', s['cat_dae'].value_counts()) 
print('์ค‘๋ถ„๋ฅ˜ ๊ฐœ์ˆ˜ :', s['cat_jung'].nunique()) 
print('์†Œ๋ถ„๋ฅ˜ ๊ฐœ์ˆ˜ :', s['cat_so'].nunique()) 


๋Œ€๋ถ„๋ฅ˜ ์œ ํ˜• : 
Women                     664385
Beauty                    207828
Kids                      171689
Electronics               122690
Men                        93680
Home                       67871
Vintage & Collectibles     46530
Other                      45351
Handmade                   30842
Sports & Outdoors          25342
Other_Null                  6327
Name: cat_dae, dtype: int64
์ค‘๋ถ„๋ฅ˜ ๊ฐœ์ˆ˜ : 114
์†Œ๋ถ„๋ฅ˜ ๊ฐœ์ˆ˜ : 871

 

๐Ÿƒ‍โ™€๏ธ ๋„๊ฐ’ ์ฑ„์›Œ๋„ฃ๊ธฐ 

 

# brand_name, category_name, itme_desription ์นผ๋Ÿผ์˜ Null ๊ฐ’์€ Other Null ๋กœ ์ฑ„์šฐ๊ธฐ  

s.brand_name = s.brand_name.fillna(value='Other_Null')
s.category_name = s.category_name.fillna(value='Other_Null')
s.item_description = s.item_description.fillna(value='Other_Null')

 

 

 

2๏ธโƒฃ ํ”ผ์ฒ˜ ์ธ์ฝ”๋”ฉ , ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”

 

โ—พ ์„ ํ˜• ํšŒ๊ท€์˜ ๊ฒฝ์šฐ ์› ํ•ซ ์ธ์ฝ”๋”ฉ ์ ์šฉ์ด ํ›จ์”ฌ ์„ ํ˜ธ๋œ๋‹ค ๐Ÿ‘‰ ์ธ์ฝ”๋”ฉํ•  ํ”ผ์ฒ˜๋Š” ๋ชจ๋‘ ์›ํ•ซ์ธ์ฝ”๋”ฉ ์ ์šฉํ•˜๊ธฐ
โ—พ ํ”ผ์ฒ˜๋ฒกํ„ฐํ™”๋Š” ์งง์€ ํ…์ŠคํŠธ์˜ ๊ฒฝ์šฐ Count ๊ธฐ๋ฐ˜์˜ ๋ฒกํ„ฐํ™”, ๊ธด ํ…์ŠคํŠธ๋Š” TF-IDF ๊ธฐ๋ฐ˜์˜ ๋ฒกํ„ฐํ™”๋ฅผ ์ ์šฉ

 

1๏ธโƒฃ brand_name ์ƒํ’ˆ์˜ ๋ธŒ๋žœ๋“œ๋ช… : 4810 ๊ฐœ์˜ ์œ ํ˜•์ด๋‚˜, ๋น„๊ต์  ๋ช…๋ฃŒํ•œ ๋ฌธ์ž์—ด๋กœ ๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ ๋ณ„๋„์˜ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™” ์—†์ด ์ธ์ฝ”๋”ฉ ๋ณ€ํ™˜์„ ์ ์šฉํ•˜๊ธฐ
2๏ธโƒฃ name ์ƒํ’ˆ๋ช… : ๊ฑฐ์˜ ๊ฐœ๋ณ„์ ์ธ ๊ณ ์œ ํ•œ ์ƒํ’ˆ๋ช…์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ , ์ ์€ ๋‹จ์–ด ์œ„์ฃผ์˜ ํ…์ŠคํŠธ ํ˜•ํƒœ ๐Ÿ‘‰ Count ๊ธฐ๋ฐ˜ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™” ๋ณ€ํ™˜ ์ ์šฉํ•˜๊ธฐ 
3๏ธโƒฃ category_name : ๋Œ€์ค‘์†Œ ๋ถ„๋ฅ˜ ์„ธ ๊ฐœ์˜ ์นผ๋Ÿผ์œผ๋กœ ๋ถ„๋ฆฌ๋˜์—ˆ๋Š”๋ฐ, ๊ฐ ์นผ๋Ÿผ๋ณ„๋กœ ์›ํ•ซ์ธ์ฝ”๋”ฉ ์ ์šฉํ•˜๊ธฐ 
4๏ธโƒฃ shipping , item_condition_id ๋‘ ์นผ๋Ÿผ ๋ชจ๋‘ ์œ ํ˜•๊ฐ’์˜ ๊ฒฝ์šฐ๊ฐ€ 2๊ฐœ, 5๊ฐœ๋กœ ์ ์œผ๋ฏ€๋กœ ์›ํ•ซ์ธ์ฝ”๋”ฉ ์ˆ˜ํ–‰ 
5๏ธโƒฃ item_description : ํ‰๊ท  ๋ฌธ์ž์—ด ํฌ๊ธฐ๊ฐ€ 145๋กœ ๋น„๊ต์  ํฌ๋ฏ€๋กœ Tfidf ๋ฒกํ„ฐํ™”ํ•˜๊ธฐ

 

๐Ÿƒ‍โ™€๏ธ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™” 

 

cnt_vect = CountVectorizer() 
X_name = cnt_vect.fit_transform(s.name) 

tfidf = TfidfVectorizer(max_features = 50000, ngram_range = (1,3), stop_words = 'english') 
X_descp = tfidf.fit_transform(s.item_description) 

print(X_name.shape) # ํฌ์†Œํ–‰๋ ฌ : (1482535, 105757)
print(X_descp.shape) # ํฌ์†Œํ–‰๋ ฌ : (1482535, 50000)

 

๐Ÿƒ‍โ™€๏ธ ํ”ผ์ฒ˜ ์ธ์ฝ”๋”ฉ : LabelBinarizer ์›ํ•ซ์ธ์ฝ”๋”ฉ์„ ์ด์šฉํ•ด ํฌ์†Œํ–‰๋ ฌ ํ˜•ํƒœ์˜ ์›ํ•ซ์ธ์ฝ”๋”ฉ ๋„์ถœ 

 

# ๐Ÿ“Œ ๋ชจ๋“  ์ธ์ฝ”๋”ฉ ๋Œ€์ƒ ์นผ๋Ÿผ์€ LabelBinarizer ๋ฅผ ์ด์šฉํ•ด ํฌ์†Œํ–‰๋ ฌ ํ˜•ํƒœ์˜ ์›ํ•ซ์ธ์ฝ”๋”ฉ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ธฐ : sparse_output = True

from sklearn.preprocessing import LabelBinarizer  

# brand_name, item_condition_id, shipping ํ”ผ์ฒ˜๋“ค์„ ํฌ์†Œํ–‰๋ ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ์œผ๋กœ ๋ณ€ํ™˜ 
lb_brand_name = LabelBinarizer(sparse_output = True) 
X_band = lb_brand_name.fit_transform(s.brand_name)

lb_item_cond_id = LabelBinarizer(sparse_output = True) 
X_item_cond_id = lb_item_cond_id.fit_transform(s.item_condition_id)

lb_shipping = LabelBinarizer(sparse_output = True) 
X_shipping = lb_shipping.fit_transform(s.shipping)


# ๋Œ€์ค‘์†Œ ํ”ผ์ฒ˜๋“ค์„ ํฌ์†Œํ–‰๋ ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ์œผ๋กœ ๋ณ€ํ™˜ 
lb_cat_dae = LabelBinarizer(sparse_output = True) 
X_cat_dae = lb_cat_dae.fit_transform(s.cat_dae)

lb_cat_so = LabelBinarizer(sparse_output = True) 
X_cat_so = lb_cat_so.fit_transform(s.cat_so)

lb_cat_jung = LabelBinarizer(sparse_output = True) 
X_cat_jung = lb_cat_jung.fit_transform(s.cat_jung)

 

 

print(type(X_band)) # csr_matrix ํƒ€์ž… 
print('X_brand shape : ', X_band.shape)  # ์ธ์ฝ”๋”ฉ ์นผ๋Ÿผ์ด ๋ฌด๋ ค 4810 ๊ฐœ ์ด์ง€๋งŒ, ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”๋กœ ํ…์ŠคํŠธ ํ˜•ํƒœ์˜ ๋ฌธ์ž์—ด์ด ๊ฐ€์ง€๋Š” ๋ฒกํ„ฐ ํ˜•ํƒœ์˜ ๋งค์šฐ ๋งŽ์€ ์นผ๋Ÿผ๊ณผ ํ•จ๊ป˜ ๊ฒฐํ•ฉ๋˜๋ฏ€๋กœ ํฌ๊ฒŒ ๋ฌธ์ œ๋  ๊ฒƒ์€ ์—†์Œ


<class 'scipy.sparse.csr.csr_matrix'>
X_brand shape :  (1482535, 4810)

 

 

 

X_name + X_descp + (์›ํ•ซ์ธ์ฝ”๋”ฉ ํ”ผ์ฒ˜๋“ค์„ ํฌ์†Œ ํ–‰๋ ฌ๋กœ ๋ณ€ํ™˜ ํ›„ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”๋œ ํฌ์†Œ ํ–‰๋ ฌ๋“ค) ๊ฒฐํ•ฉํ•˜๊ธฐ! ๐Ÿ‘‰ hstack()

 

๐Ÿƒ‍โ™€๏ธ ํ•˜๋‚˜์˜ feature ํ–‰๋ ฌ X ๋„์ถœํ•˜๊ธฐ 

from scipy.sparse import hstack  
import gc
sparse_matrix_list = (X_name, X_descp, X_band, X_item_cond_id, X_shipping, X_cat_dae, X_cat_jung, X_cat_so)  
X_features_sparse = hstack(sparse_matrix_list).tocsr() 


print(type(X_features_sparse), X_features_sparse.shape)


<class 'scipy.sparse.csr.csr_matrix'> (1482535, 161569)

 

 

 

 

3๏ธโƒฃ ํšŒ๊ท€ ๋ชจ๋ธ๋ง

 

ํ‰๊ฐ€์ง€ํ‘œ : RMSLE

 

def rmsle(y , y_pred):
    # underflow, overflow๋ฅผ ๋ง‰๊ธฐ ์œ„ํ•ด log๊ฐ€ ์•„๋‹Œ log1p๋กœ rmsle ๊ณ„์‚ฐ 
    return np.sqrt(np.mean(np.power(np.log1p(y) - np.log1p(y_pred), 2)))

def evaluate_org_price(y_test , preds): 
    
    # ์›๋ณธ ๋ฐ์ดํ„ฐ๋Š” log1p๋กœ ๋ณ€ํ™˜๋˜์—ˆ์œผ๋ฏ€๋กœ exmpm1์œผ๋กœ ์›๋ณต ํ•„์š”. 
    preds_exmpm = np.expm1(preds)
    y_test_exmpm = np.expm1(y_test)
    
    # rmsle๋กœ RMSLE ๊ฐ’ ์ถ”์ถœ
    rmsle_result = rmsle(y_test_exmpm, preds_exmpm)
    return rmsle_result

 

 

import gc 
from  scipy.sparse import hstack

def model_train_predict(model,matrix_list):
    # scipy.sparse ๋ชจ๋“ˆ์˜ hstack ์„ ์ด์šฉํ•˜์—ฌ sparse matrix ๊ฒฐํ•ฉ
    X= hstack(matrix_list).tocsr()     #โญ
    
    X_train, X_test, y_train, y_test=train_test_split(X, s['price'], 
                                                      test_size=0.2, random_state=156)
    
    # ๋ชจ๋ธ ํ•™์Šต ๋ฐ ์˜ˆ์ธก
    model.fit(X_train , y_train)
    preds = model.predict(X_test)
    
    del X , X_train , X_test , y_train 
    gc.collect()
    
    return preds , y_test

 

๐Ÿƒ‍โ™€๏ธ ๋ฆฟ์ง€ํšŒ๊ท€ 

 

linear_model = Ridge(solver = "lsqr", fit_intercept=False)

sparse_matrix_list = (X_name, X_band, X_item_cond_id,
                      X_shipping, X_cat_dae, X_cat_jung, X_cat_so)
linear_preds , y_test = model_train_predict(model=linear_model ,matrix_list=sparse_matrix_list)
print('Item Description์„ ์ œ์™ธํ–ˆ์„ ๋•Œ rmsle ๊ฐ’:', evaluate_org_price(y_test , linear_preds))

sparse_matrix_list = (X_descp, X_name, X_band, X_item_cond_id,
                      X_shipping, X_cat_dae, X_cat_jung, X_cat_so)
linear_preds , y_test = model_train_predict(model=linear_model , matrix_list=sparse_matrix_list)
print('Item Description์„ ํฌํ•จํ•œ rmsle ๊ฐ’:',  evaluate_org_price(y_test ,linear_preds)) 




Item Description์„ ์ œ์™ธํ–ˆ์„ ๋•Œ rmsle ๊ฐ’: 0.5023727038010544
Item Description์„ ํฌํ•จํ•œ rmsle ๊ฐ’: 0.4712195143433641

โญ ํฌํ•จํ–ˆ์„ ๋•Œ rmsle ๊ฐ’์ด ๋งŽ์ด ๊ฐ์†Œ ๐Ÿ‘‰ ์˜ํ–ฅ์ด ์ค‘์š”ํ•จ

 

 

 

  • (์ˆ˜ํ–‰์‹œ๊ฐ„์ด ๊ธธ์–ด์„œ ํŒจ์Šคํ–ˆ์ง€๋งŒ) LightGBM ์˜ rmsle ๋Š” 0.455, ๋‘ ๋ชจ๋ธ์„ ์•™์ƒ๋ธ”( lgbm ๊ฒฐ๊ณผ๊ฐ’์— 0.45๋ฅผ ๊ณฑํ•˜๊ณ  ์„ ํ˜•๋ชจ๋ธ์—” 0.55๋ฅผ ๊ณฑํ•จ) ํ•˜์—ฌ ๊ตฌํ•œ ๊ฒฐ๊ณผ๋Š” 0.450์œผ๋กœ ๋„์ถœ

 

728x90

๋Œ“๊ธ€