๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“• ๋จธ์‹ ๋Ÿฌ๋‹

[05. ํด๋Ÿฌ์Šคํ„ฐ๋ง] K-means, ํ‰๊ท ์ด๋™, GMM, DBSCAN

by isdawell 2022. 5. 7.
728x90

1๏ธโƒฃ K-means clustering 


๐Ÿ‘€ ๊ฐœ์š” 

 

๐Ÿ’ก k-means clustering 

 

โœ” ๊ตฐ์ง‘ํ™”์—์„œ ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ 

 

โœ” Centroid = ๊ตฐ์ง‘ ์ค‘์‹ฌ์  ์ด๋ผ๋Š” ํŠน์ •ํ•œ ์ง€์ ์„ ์„ ํƒํ•ด ํ•ด๋‹น ์ค‘์‹ฌ์— ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ํฌ์ธํŠธ๋“ค์„ ์„ ํƒํ•˜๋Š” ๊ตฐ์ง‘ํ™” ๊ธฐ๋ฒ•์ด๋‹ค. 

 

 

https://www.blopig.com/blog/2020/07/k-means-clustering-made-simple/

 

1. k ๊ฐœ์˜ ๊ตฐ์ง‘ ์ค‘์‹ฌ์ ์„ ์„ค์ • 
2. ๊ฐ ๋ฐ์ดํ„ฐ๋Š” ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ค‘์‹ฌ์ ์— ์†Œ์† 
3. ์ค‘์‹ฌ์ ์— ํ• ๋‹น๋œ ๋ฐ์ดํ„ฐ๋“ค์„ ๋Œ€์ƒ์œผ๋กœ ํ‰๊ท ๊ฐ’์„ ๊ตฌํ•˜๊ณ  ๊ทธ๊ฒƒ์„ ์ƒˆ๋กœ์šด ์ค‘์‹ฌ์ ์œผ๋กœ ์„ค์ • 
4. ๊ฐ ๋ฐ์ดํ„ฐ๋Š” ์ƒˆ๋กœ์šด ์ค‘์‹ฌ์ ์„ ๊ธฐ์ค€์œผ๋กœ ๋‹ค์‹œ ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ์ค‘์‹ฌ์ ์— ์†Œ์†๋จ 

๐Ÿ‘‰ ์ค‘์‹ฌ์ ์˜ ์ด๋™์ด ๋”์ด์ƒ ์—†์„ ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต 

 

 

๐Ÿ’ก ์žฅ๋‹จ์  

 

๐Ÿ’จ ์žฅ์  

 

โœ” ์ผ๋ฐ˜์ ์ธ ๊ตฐ์ง‘ํ™”์—์„œ ๊ฐ€์žฅ ๋งŽ์ด ํ™œ์šฉ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ 

 

โœ” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์‰ฝ๊ณ  ๊ฐ„๊ฒฐํ•จ 

 

 

๐Ÿ’จ ๋‹จ์  

 

โœ” ๊ฑฐ๋ฆฌ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์†์„ฑ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์œผ๋ฉด ๊ตฐ์ง‘ํ™” ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์ง„๋‹ค. ๋”ฐ๋ผ์„œ PCA ์ฐจ์›์ถ•์†Œ๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. 

 

โœ” ๋ฐ˜๋ณต ํšŸ์ˆ˜๊ฐ€ ๋งŽ์„ ๊ฒฝ์šฐ ์ˆ˜ํ–‰ ํšŸ์ˆ˜๊ฐ€ ๋Š๋ ค์ง„๋‹ค. 

 

โœ” ๊ตฐ์ง‘ ๊ฐœ์ˆ˜ k ๋ฅผ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต๋‹ค (trial)

 

 

 

๐Ÿ’ก ์‚ฌ์ดํ‚ท๋Ÿฐ k-means 

 

 

๐Ÿ’จ ํŒŒ๋ผ๋ฏธํ„ฐ 

 

from sklearn.cluster import KMeans 
Kmeans(n_clusters=k, init='k-means++', max_iter = n, random_state=0)

 

n_clusters = ๊ตฐ์ง‘ํ™”ํ•  ๊ฐœ์ˆ˜ 

init = ์ดˆ๊ธฐ ๊ตฐ์ง‘ ์ค‘์‹ฌ์  ์ขŒํ‘œ๋ฅผ ์„ค์ •ํ•  ๋ฐฉ์‹์„ ๋งํ•˜๋ฉฐ ์ผ๋ฐ˜์ ์œผ๋กœ 'k-means++' ๋กœ ์„ค์ •ํ•จ 

max_iter = ์ตœ๋Œ€ ๋ฐ˜๋ณต ํšŸ์ˆ˜๋กœ ์ด ํšŸ์ˆ˜ ์ด์ „์— ๋ชจ๋“  ๋ฐ์ดํ„ฐ์˜ ์ค‘์‹ฌ์  ์ด๋™์ด ์—†์œผ๋ฉด ์ข…๋ฃŒํ•œ๋‹ค. 

 

 

๐Ÿ’จ ์ฃผ์š” ์†์„ฑ ์ •๋ณด 

 

→ .labels_ : ๊ฐ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๊ฐ€ ์†ํ•œ ๊ตฐ์ง‘ ์ค‘์‹ฌ์  ๋ ˆ์ด๋ธ” 

→ .cluster_centers_ : ๊ฐ ๊ตฐ์ง‘ ์ค‘์‹ฌ์  ์ขŒํ‘œ ์ด๋ฅผ ์ด์šฉํ•ด ๊ตฐ์ง‘ ์ค‘์‹ฌ์  ์ขŒํ‘œ๋ฅผ ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

โญ K-means clutering ์€ ๊ฐœ๋ณ„ ๊ตฐ์ง‘ ๋‚ด์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์›ํ˜•์œผ๋กœ ํฉ์–ด์ ธ ์žˆ๋Š” ๊ฒฝ์šฐ ๋งค์šฐ ํšจ๊ณผ์ ์œผ๋กœ ๊ตฐ์ง‘ํ™”๊ฐ€ ์ˆ˜ํ–‰๋  ์ˆ˜ ์žˆ๋‹ค. 

 


 

๐Ÿ“Œ ์‹ค์Šต iris data

 

๐Ÿƒ‍โ™€๏ธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ž„ํฌํŠธ + ๋ฐ์ดํ„ฐ ๋กœ๋“œ 

 

from sklearn.preprocessing import scale
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline

iris = load_iris() 

irisdf = pd.DataFrame(data=iris.data, columns = ['sepal_length','sepal_width','petal_length','petal_width'])
irisdf.head(3)

 

 

๐Ÿƒ‍โ™€๏ธ ํด๋Ÿฌ์Šคํ„ฐ๋ง fit 

 

kmeans = KMeans(n_clusters=3, init='k-means++', max_iter = 300, random_state = 0) 
kmeans.fit(irisdf)

print(kmeans.labels_)

 

 

 

๐Ÿƒ‍โ™€๏ธ ์›๋ž˜ ์ •๋‹ต target label ๊ณผ ๋น„๊ต 

 

# ๊ฒฐ๊ณผ ๋น„๊ตํ•ด๋ณด๊ธฐ 
irisdf['target'] = iris.target 
irisdf['cluster'] = kmeans.labels_ 
iris_result = irisdf.groupby(['target', 'cluster'])['sepal_length'].count()  
# sepal_length ์—ด์„ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ target ๊ธฐ์ค€ ๊ทธ๋ฃนํ™” ์ดํ›„ ์„ธ๋ถ€์ ์œผ๋กœ cluster ๋กœ ๊ทธ๋ฃนํ™”๋ฅผ ์ง„ํ–‰ 
print(iris_result)

 

 

 

๐Ÿƒ‍โ™€๏ธ PCA ๋ฅผ ์‚ฌ์šฉํ•ด 2์ฐจ์› ๋ฐ์ดํ„ฐ๋กœ ์‹œ๊ฐํ™” (๊ตฐ์ง‘ํ™” ๊ฒฐ๊ณผ ์‹œ๊ฐํ™”) 

 

from sklearn.decomposition import PCA
pca = PCA(n_components=2) 
pca_transformed = pca.fit_transform(iris.data) 


irisdf['pca_x'] = pca_transformed[:,0]
irisdf['pca_y'] = pca_transformed[:,1]

# cluster ๊ฐ’์ด 0, 1, 2 ์ธ ๊ฒฝ์šฐ๋งˆ๋‹ค ๋ณ„๋„์˜ Index๋กœ ์ถ”์ถœ
marker0_ind = irisdf[irisdf['cluster']==0].index
marker1_ind = irisdf[irisdf['cluster']==1].index
marker2_ind = irisdf[irisdf['cluster']==2].index

# cluster๊ฐ’ 0, 1, 2์— ํ•ด๋‹นํ•˜๋Š” Index๋กœ ๊ฐ cluster ๋ ˆ๋ฒจ์˜ pca_x, pca_y ๊ฐ’ ์ถ”์ถœ. o, s, ^ ๋กœ marker ํ‘œ์‹œ
plt.scatter(x=irisdf.loc[marker0_ind,'pca_x'], y=irisdf.loc[marker0_ind,'pca_y'], marker='o') 
plt.scatter(x=irisdf.loc[marker1_ind,'pca_x'], y=irisdf.loc[marker1_ind,'pca_y'], marker='s')
plt.scatter(x=irisdf.loc[marker2_ind,'pca_x'], y=irisdf.loc[marker2_ind,'pca_y'], marker='^')

plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.title('3 Clusters Visualization by 2 PCA Components')
plt.show()

 

 

 

 

 

 

2๏ธโƒฃ ๊ตฐ์ง‘ ํ‰๊ฐ€ 


๐Ÿ‘€ ๊ฐœ์š” 

 

๐Ÿ’ก ๊ตฐ์ง‘ํ‰๊ฐ€ 

 

โœ” ๋Œ€๋ถ€๋ถ„์˜ ๊ตฐ์ง‘ํ™” ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” target ๋ณ€์ˆ˜๊ฐ€ ์—†๋Š” ๊ฒฝ์šฐ๊ฐ€ ํ›จ์”ฌ ๋งŽ๋‹ค. 

 

โœ” ๋ฐ์ดํ„ฐ ๋‚ด์— ์ˆจ์–ด์žˆ๋Š” ๋ณ„๋„์˜ ๊ทธ๋ฃน์„ ์ฐพ์•„๋‚ด ์˜๋ฏธ๋ฅผ ๋ถ€์—ฌํ•˜๊ฑฐ๋‚˜, ๋™์ผํ•œ ๋ถ„๋ฅ˜๊ฐ’์ด๋”๋ผ๋„ ๊ทธ ์•ˆ์—์„œ ๋” ์„ธ๋ถ„ํ™”๋œ ๊ตฐ์ง‘ํ™”๋ฅผ ์ถ”๊ตฌํ•˜๊ฑฐ๋‚˜, ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ„๋ฅ˜๊ฐ’์˜ ๋ฐ์ดํ„ฐ๋ผ๋„ ๋” ๋„“์€ ๊ตฐ์ง‘ํ™” ๋ ˆ๋ฒจํ™” ๋“ฑ์˜ ์˜์—ญ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. 

 

โœ” ๋น„์ง€๋„ํ•™์Šต์ด๋ผ๋Š” ํŠน์„ฑ์ƒ ์ •ํ™•ํ•˜๊ฒŒ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ๋Š” ์–ด๋ ค์šฐ๋‚˜, ๋Œ€ํ‘œ์ ์œผ๋กœ ์‹ค๋ฃจ์—ฃ ๋ถ„์„์„ ์‚ฌ์šฉํ•œ๋‹ค. 

 

 

๐Ÿ’ก ์‹ค๋ฃจ์—ฃ ๋ถ„์„ 

 

๐Ÿ’จ ์‹ค๋ฃจ์—ฃ ๋ถ„์„์ด๋ž€ 

 

โœ” ๊ฐ ๊ตฐ์ง‘ ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ์–ผ๋งˆ๋‚˜ ํšจ์œจ์ ์œผ๋กœ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ๋Š”๊ฐ€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ

  • ๋‹ค๋ฅธ ๊ตฐ์ง‘๊ณผ์˜ ๊ฑฐ๋ฆฌ๋Š” ๋–จ์–ด์ ธ ์žˆ๊ณ , ๋™์ผ ๊ตฐ์ง‘๋ผ๋ฆฌ์˜ ๋ฐ์ดํ„ฐ๋Š” ์„œ๋กœ ๊ฐ€๊น๊ฒŒ ์ž˜ ๋ญ‰์ณ ์žˆ๋‹ค๋Š” ์˜๋ฏธ 
  • ๊ตฐ์ง‘ํ™”๊ฐ€ ์ž˜ ๋ ์ˆ˜๋ก ๊ฐœ๋ณ„ ๊ตฐ์ง‘์€ ๋น„์Šทํ•œ ์ •๋„์˜ ์—ฌ์œ ๊ณต๊ฐ„์„ ๊ฐ€์ง€๊ณ  ๋–จ์–ด์ ธ ์žˆ์„ ๊ฒƒ

 

 

๐Ÿ’ก ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ 

 

โœ” "๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ" ๊ฐ€ ๊ฐ–๋Š” ๊ตฐ์ง‘ํ™” ์ง€ํ‘œ 

 

  • ํ•ด๋‹น ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ™์€ ๊ตฐ์ง‘ ๋‚ด์˜ ๋ฐ์ดํ„ฐ์™€๋Š” ์–ผ๋งˆ๋‚˜ ๊ฐ€๊น๊ฒŒ ๊ตฐ์ง‘ํ™” ๋˜์–ด์žˆ๊ณ , ๋‹ค๋ฅธ ๊ตฐ์ง‘์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ์™€๋Š” ์–ผ๋งˆ๋‚˜ ๋ฉ€๋ฆฌ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ๋Š”์ง€๋ฅผ ํƒ€๋‚˜๋‚ด๋Š” ์ง€ํ‘œ์ด๋‹ค. 

 

์‹ค๋ฃจ์—ฃ๊ณ„์ˆ˜

 

  • a(i) : ํ•ด๋‹น ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์™€ '๊ฐ™์€' ๊ตฐ์ง‘ ๋‚ด์— ์žˆ๋Š” ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์™€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ํ‰๊ท ํ•œ ๊ฐ’
  • b(i) : ํ•ด๋‹น ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๊ฐ€ '์†ํ•˜์ง€ ์•Š์€' ๊ตฐ์ง‘ ์ค‘ '๊ฐ€์žฅ ๊ฐ€๊นŒ์šด' ๊ตฐ์ง‘๊ณผ์˜ ํ‰๊ท  ๊ฑฐ๋ฆฌ 
  • b(i) - a(i) : ๋‘ ๊ตฐ์ง‘๊ฐ„์˜ ๊ฑฐ๋ฆฌ ๐Ÿ‘‰ ๊ฐ’์„ ์ •๊ทœํ™”ํ•ด์ฃผ๊ธฐ ์œ„ํ•ด MAX(a(i), b(i)) ๋กœ ๋‚˜๋ˆ” 
  • s(i) : i๋ฒˆ์งธ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ 

 

โœ” ๊ฐ’์˜ ํ•ด์„

  • -1~1 ์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ€์ง
  • 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๊ทผ์ฒ˜์˜ ๊ตฐ์ง‘๊ณผ ๋” ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ๋‹ค๋Š” ๊ฒƒ ๐Ÿ‘‰ good 
  • 0์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ๊ทผ์ฒ˜์˜ ๊ตฐ์ง‘๊ณผ ๊ฐ€๊นŒ์›Œ ์ง€๋Š” ๊ฒƒ 
  • ์Œ์ˆ˜์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’์€ ์•„์˜ˆ ๋‹ค๋ฅธ ๊ตฐ์ง‘์— ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๊ฐ€ ํ• ๋‹น ๋˜์—ˆ๋‹ค๋Š” ๋œป 

 

โœ” ์‚ฌ์ดํ‚ท๋Ÿฐ ๋ฉ”์„œ๋“œ 

 

sklearn.metrics.silhouette_samples( X, labels, metric = 'euclidean' )
  • labels : ๊ตฐ์ง‘ํ™”๋œ ๋ ˆ์ด๋ธ” ๊ฐ’ 
  • ๊ฐ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ด ๋ฐ˜ํ™˜ํ•œ๋‹ค. 

 

 

sklearn.metrics.silhouette_score( X, labels, metric = 'euclidean' , sample_size=None) 
  • ์ „์ฒด ๋ฐ์ดํ„ฐ์˜ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฐ’์„ 'ํ‰๊ท ํ•ด' ๋ฐ˜ํ™˜ํ•œ๋‹ค. 
  • ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฐ’์ด ๋†’์„์ˆ˜๋ก ๊ตฐ์ง‘ํ™”๊ฐ€ ์–ด๋Š์ •๋„ ์ž˜ ๋˜์—ˆ๋‹ค๊ณ  ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋ฌด์กฐ๊ฑด ๊ฐ’์ด ๋†’๋‹ค๊ณ  ํ•ด์„œ ๊ตฐ์ง‘ํ™”๊ฐ€ ์˜ณ๊ฒŒ ๋˜์—ˆ๋‹ค๊ณ ๋Š” ํŒ๋‹จํ•˜๊ธฐ ์–ด๋ ต๋‹ค. 

 

 

 

๐Ÿ’ก ์ข‹์€ ๊ตฐ์ง‘ํ™”๊ฐ€ ๋  ์กฐ๊ฑด 

 

๐Ÿ’จ silhouette_score ๊ฐ’์ด 0~1 ์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ€์ง€๋ฉฐ, 1์— ๊ฐ€๊นŒ์šธ์ˆ˜๋ก ์ข‹๋‹ค. 

 

๐Ÿ’จ ์ „์ฒด ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜์˜ ํ‰๊ท ๊ฐ’๊ณผ ๋”๋ถˆ์–ด ๊ฐœ๋ณ„ ๊ตฐ์ง‘์˜ ํ‰๊ท ๊ฐ’์˜ ํŽธ์ฐจ๊ฐ€ ํฌ์ง€ ์•Š์•„์•ผ ํ•œ๋‹ค. ์ฆ‰, ๊ฐœ๋ณ„ ๊ตฐ์ง‘์˜ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ํ‰๊ท ๊ฐ’์ด ์ „์ฒด ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜์˜ ํ‰๊ท ๊ฐ’์—์„œ ํฌ๊ฒŒ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. 

 

 

 

๐Ÿ’ก ๊ตฐ์ง‘๋ณ„ ํ‰๊ท  ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜์˜ ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•œ ์ตœ์ ์˜ K ์ฐพ๊ธฐ 

 

๐Ÿ’จ ์ „์ฒด ๋ฐ์ดํ„ฐ์˜ ํ‰๊ท  ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฐ’์ด ๋†’๋‹ค๊ณ  ํ•˜์—ฌ ๋ฐ˜๋“œ์‹œ ์ตœ์ ์˜ ๊ตฐ์ง‘ ๊ฐœ์ˆ˜๋กœ ๊ตฐ์ง‘ํ™”๊ฐ€ ์ž˜ ๋˜์—ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์—†๋‹ค. 

๐Ÿ’จ ํŠน์ • ๊ตฐ์ง‘ ๋‚ด ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๋งŒ ๋„ˆ๋ฌด ๋†’๊ณ , ๋‹ค๋ฅธ ๊ตฐ์ง‘์€ ๋‚ด๋ถ€ ๋ฐ์ดํ„ฐ๋ผ๋ฆฌ์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๋„ˆ๋ฌด ๋–จ์–ด์ ธ ์žˆ์–ด ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๊ฐ€ ๋‚ฎ์•„์ ธ๋„, ํ‰๊ท ์ ์œผ๋กœ ๋†’์€ ๊ฐ’์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค. 

๐Ÿ’จ ๊ฐœ๋ณ„ ๊ตฐ์ง‘๋ณ„๋กœ ์ ๋‹นํžˆ ๋ถ„๋ฆฌ๋œ ๊ฑฐ๋ฆฌ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ ๊ตฐ์ง‘ ๋‚ด ๋ฐ์ดํ„ฐ๊ฐ€ ์„œ๋กœ ๋ญ‰์ณ ์žˆ๋Š” ๊ฒฝ์šฐ์— ์ ์ ˆํ•œ ๊ตฐ์ง‘๊ฐœ์ˆ˜ K ๊ฐ€ ์„ค์ •๋˜์—ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

โœ” k=2 ์ธ ๊ฒฝ์šฐ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฒฐ๊ณผ ํ•ด์„ 

 

  • ๋นจ๊ฐ„์ƒ‰ ์ ์„  ๐Ÿ‘‰ '์ „์ฒด' ํ‰๊ท  ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฐ’ 
  • 1๋ฒˆ ๊ตฐ์ง‘์˜ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋Š” ํ‰๊ท  ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฐ’ ์ด์ƒ์ด์ง€๋งŒ, 2๋ฒˆ ๊ตฐ์ง‘์˜ ๊ฒฝ์šฐ๋Š” ํ‰๊ท ๋ณด๋‹ค ์ ์€ ๋ฐ์ดํ„ฐ ๊ฐ’์ด ๋งค์šฐ ๋งŽ๋‹ค. 
  • ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐํ™”ํ•ด๋ณธ ๊ฒฐ๊ณผ์—์„œ๋„, 1๋ฒˆ ๊ตฐ์ง‘์˜ ๊ฒฝ์šฐ์—๋Š” 0๋ฒˆ ๊ตฐ์ง‘๊ณผ ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ๊ณ  ๋‚ด๋ถ€ ๋ฐ์ดํ„ฐ๋ผ๋ฆฌ๋„ ์ž˜ ๋ญ‰์ณ์ ธ ์žˆ์œผ๋‚˜, 0๋ฒˆ ๊ตฐ์ง‘์˜ ๊ฒฝ์šฐ์—๋Š” ๋‚ด๋ถ€ ๋ฐ์ดํ„ฐ๋ผ๋ฆฌ ๋งŽ์ด ๋–จ์–ด์ ธ ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

โœ” k=3 ์ธ ๊ฒฝ์šฐ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฒฐ๊ณผ ํ•ด์„ 

1,2 ๋ฒˆ ๊ตฐ์ง‘์€ ํ‰๊ท ๋ณด๋‹ค ๋†’์€ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฐ’์„ ๊ฐ€์ง€์ง€๋งŒ, 0๋ฒˆ ๊ตฐ์ง‘์˜ ๊ฒฝ์šฐ ๋ชจ๋‘ ํ‰๊ท ๋ณด๋‹ค ๋‚ฎ๋‹ค. ์˜ค๋ฅธ์ชฝ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋ฅผ ํ™•์ธํ•ด๋ณด์•„๋„ 0๋ฒˆ์˜ ๊ฒฝ์šฐ ๋‚ด๋ถ€ ๋ฐ์ดํ„ฐ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€๊ณ  2๋ฒˆ ๊ตฐ์ง‘๊ณผ๋„ ๊ฐ€๊นŒ์ด ์œ„์น˜ํ•˜๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

โœ” k=4 ์ธ ๊ฒฝ์šฐ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฒฐ๊ณผ ํ•ด์„ 

๊ฐœ๋ณ„ ๊ตฐ์ง‘์˜ ํ‰๊ท  ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฐ’์ด ๋น„๊ต์  ๊ท ์ผํ•˜๊ฒŒ ์œ„์น˜ํ•จ. 1๋ฒˆ ๊ตฐ์ง‘์€ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ํ‰๊ท ๋ณด๋‹ค ๋†’์€ ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ณ , 3๋ฒˆ์€ 1/3 ์ •๋„๊ฐ€ ํ‰๊ท ๋ณด๋‹ค ๋†’์€ ๊ณ„์ˆ˜ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค. k=2 ์ธ ๊ฒฝ์šฐ๋ณด๋‹ค ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฐ’์ด ์ž‘๊ธด ํ•˜๋‚˜, ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ  ๊ฐ€์žฅ ์ด์ƒ์ ์ธ ๊ตฐ์ง‘ํ™” ๊ฐœ์ˆ˜๋กœ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

 

๐Ÿ’ก ์ด๋„ˆ์…”์™€ ์—˜๋ณด์šฐ 

 

๐Ÿ’จ ์ตœ์ ์˜ ๊ตฐ์ง‘ ๊ฐœ์ˆ˜ k ์ฐพ๊ธฐ (K-means clustering) 

 

 

.inertia_ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ์ด๋„ˆ์…” ๊ฐ’์„ ๋ถˆ๋Ÿฌ์˜ด

โœ” mode.inertia_ ๋ฅผ ํ†ตํ•ด ์‹œ๊ฐํ™” 

 

 

 

๐Ÿ’ก AIC ์™€ BIC 

 

๐Ÿ’จ k-means ์—์„œ ์‚ฌ์šฉํ•˜๋Š” ์ด๋„ˆ์…”, ์‹ค๋ฃจ์—ฃ ๋“ฑ์˜ ๋ฐฉ์‹์€ ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ํƒ€์›ํ˜•์ด๊ฑฐ๋‚˜ ํฌ๊ธฐ๊ฐ€ ๋‹ค๋ฅด๋ฉด ์•ˆ์ •์ ์ด์ง€ ์•Š์œผ๋ฉฐ ๊ฐ€์šฐ์‹œ์•ˆ ํ˜ผํ•ฉ ๋ชจ๋ธ์—์„œ๋Š” ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๋‹ค. 

๐Ÿ’จ

AIC, BIC ๊ฐ’์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ตฐ์ง‘๊ฐœ์ˆ˜ k ๋ฅผ ์„ ํƒํ•œ๋‹ค. ๐Ÿ’จ GMM ๋ชจ๋ธ์—๋„ ์ ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ์ตœ์ ์˜ n_components ๋„์ถœํ•˜๊ธฐ 

 

 

โœ” model.bic , model.aic ๋กœ AIC, BIC ๊ฐ’์„ ๋ถˆ๋Ÿฌ์™€ ์‹œ๊ฐํ™” ํ•˜๊ธฐ 

 

 

 

 

๐Ÿ’ก ๊ฐ€๋Šฅ๋„ํ•จ์ˆ˜ 

 

 

โœ” P(X|θ) = ํŒŒ๋ผ๋ฏธํ„ฐ θ ๋ฅผ ์•Œ๊ณ ์žˆ์„ ๋•Œ, ๊ทธ ๋ชจ๋ธ์ด x๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ๊ฒƒ์ด ์–ผ๋งˆ๋‚˜ ๊ทธ๋Ÿด ๋“ฏ ํ•œ๊ฐ€ = x์˜ ํ•จ์ˆ˜ 

 

โœ” L(θ|X) = ๊ด€์ธก๋œ ๋ฐ์ดํ„ฐ x๋ฅผ ์•Œ๊ณ  ์žˆ์„ ๋•Œ, ํŠน์ • ํŒŒ๋ผ๋ฏธํ„ฐ θ ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๊ทธ๋Ÿด ๋“ฏ ํ•œ๊ฐ€ = θ ์˜ ํ•จ์ˆ˜ 

 

โœ” ๊ฐ€๋Šฅ๋„๊ฐ€ ์ตœ๋Œ€๊ฐ€ ๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ θ ๊ฐ€ ๋˜๋Š” ๊ฒฝ์šฐ์˜ ํด๋Ÿฌ์Šคํ„ฐ ๊ฐœ์ˆ˜๋ฅผ ์ฐพ๋Š” ๊ฒƒ 

 

 


 

๐Ÿ“Œ ์‹ค์Šต iris data 

 

๐Ÿƒ‍โ™€๏ธ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฐ’ ๊ณ„์‚ฐํ•˜๊ธฐ

 

from sklearn.preprocessing import scale
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans


# โญ ์‹ค๋ฃจ์—ฃ ๋ถ„์„ metric 
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
%matplotlib inline


iris = load_iris() 
feature_name =  ['sepal_length','sepal_width','petal_length','petal_width'] 
irisdf = pd.DataFrame(data=iris.data, columns = feature_name) 
kmeans = KMeans(n_clusters=3, init = 'k-means++', max_iter=300, random_state=0).fit(irisdf)   

irisdf['cluster'] = kmeans.labels_ 

# ๋ชจ๋“  ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๊ฐ’ ๊ตฌํ•˜๊ธฐ 
score_samples = silhouette_samples(iris.data, irisdf['cluster']) # X, labels 
print('์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ return ๊ฐ’์˜ shape', score_samples.shape) 

irisdf['์‹ค๋ฃจ์—ฃ๊ณ„์ˆ˜'] = score_samples 

# ๋ชจ๋“  ๋ฐ์ดํ„ฐ์˜ ํ‰๊ท  ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๊ฐ’ ๊ตฌํ•˜๊ธฐ 
average_score = silhouette_score(iris.data, irisdf['cluster']) 
print('๋ถ—๊ฝƒ ๋ฐ์ดํ„ฐ์…‹ ์‹ค๋ฃจ์—ฃ๊ณ„์ˆ˜ ์ ์ˆ˜ : {0:.3f}'.format(average_score)) 

irisdf.head(3) 

#๐Ÿ“Œ cluster 1 ์— ํ•ด๋‹นํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๊ฐ’์€ 0.8๋กœ ๋†’์ง€๋งŒ, ๋‹ค๋ฅธ ๊ตฐ์ง‘์˜ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฐ’์ด ํ‰๊ท ๋ณด๋‹ค ๋‚ฎ์•„ 
# ์ „์ฒด ๋ฐ์ดํ„ฐ์˜ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ์ตœ์ข… ํ‰๊ท  ๊ฐ’์€ 0.553 
#๐Ÿ“Œ ๊ตฐ์ง‘๋ณ„๋กœ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ๊ฐ’์„ ์‚ดํŽด๋ณผ ํ•„์š”๊ฐ€ ์žˆ๋‹ค.

 

 

 

 

irisdf.groupby('cluster')['์‹ค๋ฃจ์—ฃ๊ณ„์ˆ˜'].mean()

#๐Ÿ“Œ ๊ตฐ์ง‘๋ณ„๋กœ ํ‰๊ท ์˜ ์ฐจ์ด๊ฐ€ ํฌ๊ฒŒ ๋ฐœ์ƒํ•˜๊ณ  ์žˆ๋‹ค.

 

 

 

๐Ÿƒ‍โ™€๏ธ ์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜ ์‹œ๊ฐํ™” 

 

def viz_silhouette(cluster_lists, X_features) : 

  from sklearn.datasets import make_blobs 
  from sklearn.cluster import KMeans
  from sklearn.metrics import silhouette_samples, silhouette_score

  import matplotlib.pyplot as plt
  import matplotlib.cm as cm
  import math

  n_cols = len(cluster_lists)

  fig,axs = plt.subplots(figsize=(4*n_cols, 4), nrows=1, ncols=n_cols) 

  for ind, n_cluster in enumerate(cluster_lists) : 

    clusterer = KMeans(n_clusters = n_cluster, max_iter = 500, random_state=0) 
    cluster_labels = clusterer.fit_predict(X_features) 

    sil_avg = silhouette_score(X_features, cluster_labels) 
    sil_values = silhouette_samples(X_features, cluster_labels) 

    y_lower = 10 

    axs[ind].set_title('Number of Cluster : '+ str(n_cluster)+'\n' \
                          'Silhouette Score :' + str(round(sil_avg,3)) )
    axs[ind].set_xlabel("The silhouette coefficient values")
    axs[ind].set_ylabel("Cluster label")
    axs[ind].set_xlim([-0.1, 1])
    axs[ind].set_ylim([0, len(X_features) + (n_cluster + 1) * 10])
    axs[ind].set_yticks([])  # Clear the yaxis labels / ticks
    axs[ind].set_xticks([0, 0.2, 0.4, 0.6, 0.8, 1])

    for i in range(n_cluster) : 
      ith_cluster_sil_values = sil_values[cluster_labels==i] 
      ith_cluster_sil_values.sort() 

      size_cluster_i = ith_cluster_sil_values.shape[0] 
      y_upper = y_lower + size_cluster_i

      color = cm.nipy_spectral(float(i) / n_cluster)
      axs[ind].fill_betweenx(np.arange(y_lower, y_upper), 0, ith_cluster_sil_values, \
                                facecolor=color, edgecolor=color, alpha=0.7)
      axs[ind].text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
      y_lower = y_upper + 10
    
    axs[ind].axvline(x=sil_avg, color="red", linestyle="--")

 

from sklearn.datasets import load_iris

iris=load_iris()
viz_silhouette([ 2, 3, 4,5 ], iris.data)

 

2๊ฐœ๊ฐ€ ์ ์ ˆํ•œ ๊ตฐ์ง‘๊ฐœ์ˆ˜!

 

๐Ÿ‘€ ๋ฐ์ดํ„ฐ ์–‘์ด ๋Š˜์–ด๋‚˜๋ฉด ์ˆ˜ํ–‰ ์‹œ๊ฐ„์ด ํฌ๊ฒŒ ๋Š˜์–ด๋‚œ๋‹ค๋Š” ๋‹จ์  ์กด์žฌ

 

 

 

 

 

 

 

3๏ธโƒฃ ํ‰๊ท  ์ด๋™ 


๐Ÿ‘€ ๊ฐœ์š” 

 

๐Ÿ’ก ํ‰๊ท ์ด๋™ 

 

โœ” K-means clustering ๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ๊ตฐ์ง‘์˜ ์ค‘์‹ฌ์œผ๋กœ ์ค‘์‹ฌ์„ ์ง€์†์ ์œผ๋กœ ์›€์ง์ด๋ฉฐ ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค. 

 

โœ” ์ด๋•Œ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ชจ์—ฌ์žˆ๋Š” ๋ฐ€๋„๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๊ณณ์œผ๋กœ ์ค‘์‹ฌ์„ ์ด๋™์‹œํ‚จ๋‹ค ๐Ÿ‘‰ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋„ (ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜) ๋ฅผ ์ด์šฉํ•ด ๊ตฐ์ง‘ ์ค‘์‹ฌ์ ์„ ์ฐพ๋Š”๋‹ค 

 

โœ” ๊ตฐ์ง‘ ์ค‘์‹ฌ = ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜๊ฐ€ ํ”ผํฌ์ธ ์ง€์  = ๋ฐ์ดํ„ฐ๊ฐ€ ๊ฐ€์žฅ ๋ชจ์—ฌ์žˆ๋Š” ์ง€์  

 

โœ” ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด KDE ๋ฅผ ์ด์šฉํ•œ๋‹ค : ํŠน์ • ๋ฐ์ดํ„ฐ์™€ ์ฃผ๋ณ€ ๋ฐ์ดํ„ฐ์™€์˜ ๊ฑฐ๋ฆฌ ๊ฐ’์„ KDE ํ•จ์ˆ˜ ๊ฐ’์œผ๋กœ ์ž…๋ ฅํ•œ ๋’ค ๋ฐ˜ํ™˜๊ฐ’์„ ํ˜„์žฌ ์œ„์น˜์—์„œ ์—…๋ฐ์ดํŠธํ•˜๋ฉฐ ์ด๋™ํ•˜๋Š” ๋ฐฉ์‹

 

 

from sklearn.cluster import MeanShift

 

  • .cluster_centers_ ์†์„ฑ์œผ๋กœ ๊ตฐ์ง‘ ๊ฒฐ๊ณผ ์‹œ๊ฐํ™” ์‹œ ๊ตฐ์ง‘ ์ค‘์‹ฌ์˜ ์ขŒํ‘œ๋ฅผ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

meanshift = MeanShift(bandwidth=0.8) 
cluster_labels = meanshift.fit_predict(X) 

print('cluster labels ์œ ํ˜• :' , np.unique(cluster_labels))

 

 

 

๐Ÿคธ‍โ™€๏ธ KDE ๊ฐœ๋…์ดํ•ด ์ฐธ๊ณ  ์ž๋ฃŒ : https://darkpgmr.tistory.com/147 

   → ๊ด€์ธก๋œ ๋ฐ์ดํ„ฐ ๊ฐ๊ฐ์— ์ปค๋„ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•œ ๊ฐ’์„ ๋ชจ๋‘ ๋”ํ•˜๊ณ  ๋ฐ์ดํ„ฐ ๊ฑด์ˆ˜๋กœ ๋‚˜๋ˆ  ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜๋ฅผ ์ถ”์ •ํ•œ๋‹ค. 

   → Non-parametic ํ•œ ๋ฐ€๋„์ถ”์ • ๋ฐฉ๋ฒ•์œผ๋กœ ํžˆ์Šคํ† ๊ทธ๋žจ ๋ฐฉ๋ฒ•์˜ ๋ฌธ์ œ์ ์„ ๊ฐœ์„  

   → ๋Œ€ํ‘œ์ ์ธ ์ปค๋„ ํ•จ์ˆ˜ : ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ ํ•จ์ˆ˜ 

 

 

K = ์ปค๋„ํ•จ์ˆ˜, x = ํ™•๋ฅ ๋ณ€์ˆ˜๊ฐ’, xi = ๊ด€์ธก๊ฐ’, h = ๋Œ€์—ญํญ

 

 

๐Ÿคธ‍โ™€๏ธ ์ปค๋„ํ•จ์ˆ˜๋ž€ 

์œ„์˜ 3๊ฐ€์ง€ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ํ•จ์ˆ˜

 

 

 

๐Ÿ’ก ๋Œ€์—ญํญ h 

 

โœ” KDE ํ˜•ํƒœ๋ฅผ ๋ถ€๋“œ๋Ÿฌ์šด ๋˜๋Š” ๋พฐ์กฑํ•œ ํ˜•ํƒœ๋กœ ํ‰ํ™œํ™” ํ•˜๋Š”๋ฐ ์ ์šฉ๋œ๋‹ค. 

 

โœ” h ๊ฐ’์— ๋”ฐ๋ผ ํ™•๋ฅ ๋ฐ€๋„ ์ถ”์ • ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ์ขŒ์šฐํ•  ์ˆ˜ ์žˆ๋‹ค. ์ž‘์€ h (1.0) ๊ฐ’์€ ์ข๊ณ  ๋พฐ์กฑํ•œ KDE ๋ฅผ ๊ฐ€์ง€๋ฉฐ ๊ณผ์ ํ•ฉ ๋˜๊ธฐ ์‰ฝ๋‹ค. ๋ฐ˜๋ฉด ๋งค์šฐ ํฐ h (10) ๊ฐ’์€ ๊ณผ๋„ํ™”๊ฒŒ ํ‰ํ™œํ™” smoothing ๋˜์–ด ์ง€๋‚˜์น˜๊ฒŒ ๋‹จ์ˆœํ™”๋œ ๋ฐฉ์‹์œผ๋กœ ํ™•๋ฅ ๋ฐ€๋„ํ•จ์ˆ˜๋ฅผ ์ถ”์ •ํ•˜๋ฏ€๋กœ ๊ณผ์†Œ์ ํ•ฉ ๋˜๊ธฐ ์‰ฝ๋‹ค. 

 

โœ” ์ ์ ˆํ•œ ๋Œ€์—ญํญ h ๋ฅผ ๋„์ถœํ•ด๋‚ด๋Š” ๊ฒƒ์ด KDE ๊ธฐ๋ฐ˜์˜ ํ‰๊ท ์ด๋™ ๊ตฐ์ง‘ํ™”์—์„œ ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. 

 

์™ผ์ชฝ์œ„ h=1, ์˜ค๋ฅธ์ชฝ ์•„๋ž˜ h=10

 

 

  • ๋Œ€์—ญํญ h ๊ฐ€ ํด์ˆ˜๋ก ํ‰ํ™œํ™”๋œ KDE ๋กœ ์ธํ•ด ์ ์€ ์ˆ˜์˜ ๊ตฐ์ง‘ ์ค‘์‹ฌ์ ์„ ๊ฐ€์ง„๋‹ค. 
  • ํ‰๊ท ์ด๋™ ๊ตฐ์ง‘ํ™”๋Š” ๋ฏธ๋ฆฌ ๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜๋ฅผ ์ง€์ •ํ•˜์ง€ ์•Š์œผ๋ฉฐ, ์˜ค์ง ๋Œ€์—ญํญ์˜ ํฌ๊ธฐ์— ๋”ฐ๋ผ ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ ๐Ÿ‘‰ ์ตœ์ ์˜ h ๋ฅผ ์„ค์ •ํ•ด์ฃผ๋Š” ๊ฒƒ์ด ๋งค์šฐ ์ค‘์š” ๐Ÿ‘‰ estimate_bandwidth 

 

from sklearn.cluster import estimate_bandwidth

 

bandwidth = estimate_bandwidth(X) 
print('์ตœ์ ์˜ bandwidth ๊ฐ’ : ' , bandwidth)

 

๐Ÿ’ก ์žฅ๋‹จ์  

 

๐Ÿ’จ ์žฅ์ 

 

โœ” ๋ฐ์ดํ„ฐ ์„ธํŠธ์˜ ํ˜•ํƒœ๋ฅผ ํŠน์ •ํ•œ ํ˜•ํƒœ๋กœ ๊ฐ€์ •ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์œ ์—ฐํ•œ ๊ตฐ์ง‘ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค. 

โœ” ์ด์ƒ์น˜์˜ ์˜ํ–ฅ๋ ฅ์ด ํฌ์ง€ ์•Š๋‹ค.

โœ” ๋ฏธ๋ฆฌ ๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜๋„ ์ •ํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค. 

 

๐Ÿ’จ ๋‹จ์  

 

โœ” ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ˆ˜ํ–‰์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆฐ๋‹ค. 

โœ” h ์— ๋”ฐ๋ฅธ ๊ตฐ์ง‘ํ™” ์˜ํ–ฅ๋„๊ฐ€ ๋งค์šฐ ํฌ๋‹ค. 

 

๐Ÿ‘‰ ํ‰๊ท ์ด๋™ ๊ตฐ์ง‘ํ™” ๊ธฐ๋ฒ•์€ ๋ถ„์„ ์—…๋ฌด ๊ธฐ๋ฐ˜์˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ณด๋‹จ ์ปดํ“จํ„ฐ ๋น„์ „ ์˜์—ญ (object detection, motion detection) ์—์„œ ๋›ฐ์–ด๋‚œ ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. 

 

 

 


 

๐Ÿ“Œ ์‹ค์Šต iris data 

 

๐Ÿƒ‍โ™€๏ธ ์ตœ์ ์˜ bandwidth h ๊ฐ’ ๋„์ถœ 

 

import pandas as pd 
import numpy as np 
from sklearn.datasets import load_iris
from sklearn.cluster import MeanShift 
from sklearn.cluster import estimate_bandwidth 

iris = load_iris() 
feature_name =  ['sepal_length','sepal_width','petal_length','petal_width'] 
irisdf = pd.DataFrame(data=iris.data, columns = feature_name) 

bandwidth = estimate_bandwidth(iris.data) 
print('๋Œ€์—ญํญ ๊ฐ’ : ', bandwidth)

## ๋Œ€์—ญํญ ๊ฐ’ :  1.2020768127998687

 

๐Ÿƒ‍โ™€๏ธ ์ตœ์ ์˜ bandwidth ๋กœ ํ‰๊ท ์ด๋™ ๊ตฐ์ง‘ํ™” 

 

irisdf['target'] = iris.target 

meanshift = MeanShift(bandwidth = bandwidth) 
cluster_labels = meanshift.fit_predict(iris.data)  

print('cluster labels ์œ ํ˜• : ', np.unique(cluster_labels))

## cluster labels ์œ ํ˜• :  [0 1] ๐Ÿ“Œ 2๊ฐœ์˜ ๊ตฐ์ง‘

 

 

๐Ÿƒ‍โ™€๏ธ ์‹œ๊ฐํ™” 

 

 

 

๐Ÿƒ‍โ™€๏ธ target ๋ถ„ํฌ์™€ ๋น„๊ต 

 

print(irisdf.groupby('target')['meanshift_label'].value_counts())

์ฐธ๊ณ ) ์› ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋Š” 0,1,2 label ๊ฐ๊ฐ 50๊ฐœ์”ฉ ์ด์—ˆ์Œ

 

 

 

 

 

4๏ธโƒฃ GMM 


๐Ÿ‘€ ๊ฐœ์š” 

 

๐Ÿ’ก Gaussian Mixture Model 

 

 

 

โœ” ๊ตฐ์ง‘ํ™”๋ฅผ ์ ์šฉํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ์ธ ๊ฐ€์šฐ์‹œ์•ˆ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ๋“ค์ด ์„ž์—ฌ์„œ ์ƒ์„ฑ๋œ ๊ฒƒ์ด๋ผ๋Š” ๊ฐ€์ •ํ•˜์— ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ์‹ ๐Ÿ‘‰ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ๊ฐ€ ์„ž์ธ ๊ฒƒ์œผ๋กœ ๊ฐ„์ฃผํ•œ๋‹ค. 

 

โœ” ๊ฐ€์šฐ์‹œ์•ˆ๋ถ„ํฌ (์ •๊ทœ๋ถ„ํฌ) ๐Ÿ‘‰ ์ขŒ์šฐ๋Œ€์นญ์˜ ์ข… ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง„ ์—ฐ์† ํ™•๋ฅ  ํ•จ์ˆ˜ 

 

โœ” ์ „์ฒด ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์ •๊ทœ๋ถ„ํฌ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ํ™•๋ฅ  ๋ถ„ํฌ ๊ณก์„ ์œผ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, ์ด๋ ‡๊ฒŒ ์„œ๋กœ ๋‹ค๋ฅธ ์ •๊ทœ๋ถ„ํฌ์— ๊ธฐ๋ฐ˜ํ•ด ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•จ ๐Ÿ‘‰ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ ๋ถ€ํ„ฐ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ •๊ทœ๋ถ„ํฌ๊ณก์„ ์„ ์ถ”์ถœํ•˜๊ณ  ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ๊ฐ€ ์ด ์ค‘ ์–ด๋–ค ์ •๊ทœ๋ถ„ํฌ์— ์†ํ•˜๋Š”์ง€ ๊ฒฐ์ •ํ•œ๋‹ค. 

 

 

๐Ÿ’จ ๋ชจ์ˆ˜์ถ”์ • 

 

โ‘  ๊ฐœ๋ณ„ ์ •๊ทœ ๋ถ„ํฌ์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ

โ‘ก ๊ฐ ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ค ์ •๊ทœ๋ถ„ํฌ์— ํ•ด๋‹น๋˜๋Š”์ง€์˜ ํ™•๋ฅ  

 

 

๐Ÿ’จ EM ์•Œ๊ณ ๋ฆฌ์ฆ˜ (Iterative ํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜) ์œผ๋กœ ๋ชจ์ˆ˜ ์ถ”์ •์„ ์ง„ํ–‰ : ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ GaussianMixture ํด๋ž˜์Šค ์ง€์› 

 

 

โœ”  GMM, EM ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ฐธ๊ณ ์ž๋ฃŒ : https://angeloyeo.github.io/2021/02/08/GMM_and_EM.html 

   → ์ •๊ทœ๋ถ„ํฌ๋ฅผ ๊ฐ€์ •ํ•œ ์ฑ„, ๋žœ๋คํ•˜๊ฒŒ ๋ชจ์ˆ˜๋ฅผ ์ฃผ์–ด์ค€ ๋’ค ๋ผ๋ฒจ์„ ์–ป๊ณ , ๊ทธ ๋ผ๋ฒจ๋“ค์„ ์ด์šฉํ•ด ๋‹ค์‹œ ๋ถ„ํฌ๋ฅผ ์–ป๋Š” ๋ฐฉ์‹์œผ๋กœ clustering์„ ์ˆ˜ํ–‰

 

from sklearn.mixture import GaussianMixture

 

โญ n_components : ๊ฐ€์šฐ์‹œ์•ˆ ๋ชจ๋ธ์˜ ์ด ๊ฐœ์ˆ˜ (K-means clustering ์˜ K ์— ํ•ด๋‹นํ•˜๋Š” ๋ถ€๋ถ„) 

 

 

๐Ÿ’จ kmeans vs GMM 

 

GMM ์„ ์ ์šฉํ•œ ๊ฒƒ์ด ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ„ํฌ๋œ ๋ฐฉํ–ฅ์— ๋”ฐ๋ผ ์ •ํ™•ํžˆ ๊ตฐ์ง‘ํ™”๋œ ๊ฒƒ์„ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

๐Ÿ’จ ์ด์ƒ์น˜ ํƒ์ง€์—์„œ์˜ ํ™œ์šฉ : ๋ฐ€๋„ ์ž„๊ณ„๊ฐ’์„ ์ •ํ•ด ๋ฐ€๋„๊ฐ€ ๋‚ฎ์€ ์ง€์—ญ์— ์žˆ๋Š” ๋ชจ๋“  ์ƒ˜ํ”Œ์„ ์ด์ƒ์น˜๋กœ ํŒ๋‹จํ•œ๋‹ค. 

 

 

 

๐Ÿ’ก Bayesian Gaussian Mixture Model 

 

โœ” ๋ถˆํ•„์š”ํ•œ ํด๋Ÿฌ์Šคํ„ฐ์˜ ๊ฐ€์ค‘์น˜๋ฅผ 0 ๋˜๋Š” 0์— ๊ฐ€๊น๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉ์‹ 

โœ” ์ตœ์ ์˜ ํด๋Ÿฌ์Šคํ„ฐ ๊ฐœ์ˆ˜๋ฅผ ์ˆ˜๋™์œผ๋กœ ์ฐพ์„ ํ•„์š”๊ฐ€ ์—†๋‹ค. 

โœ” ์‚ฌ์ดํ‚ท๋Ÿฐ์˜ BayesianGaussianMixture ํด๋ž˜์Šค๋ฅผ ์‚ฌ์šฉ 

โœ” n_components ๋ฅผ ์ตœ์ ์˜ ํด๋Ÿฌ์Šคํ„ฐ ๊ฐœ์ˆ˜๋ณด๋‹ค ํฌ๊ฒŒ ์ง€์ •ํ•˜์—ฌ๋„ ์ž๋™์œผ๋กœ ๋ถˆํ•„์š”ํ•œ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์ œ๊ฑฐํ•˜์—ฌ ์ค€๋‹ค. 

 

from sklearn.mixture import BayesianGaussianMixture

bgm = BayesianGaussianMixture(n_components = 10, n_init = 10, random_state=0) 
bgm.fit(X) 

bgm_label = bgm.fit_predict(X) 
clusterdf['bgm_label] = bgm_label 

# ์‹œ๊ฐํ™”

 

 

np.round(bgm.weights_, 2) 

## array([0.34, 0.34, 0.33, 0. , 0., 0., 0., 0., 0., 0.])

 

→  .weights_ ๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ํ™•์ธํ•ด๋ณด๋ฉด ์ง€์ •ํ•œ 10๊ฐœ์˜ ํด๋Ÿฌ์Šคํ„ฐ ์ค‘ 3๊ฐœ๊ฐ€ ํ•„์š”ํ•จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ๋‚˜๋จธ์ง€ ํด๋Ÿฌ์Šคํ„ฐ์˜ ๊ฐ€์ค‘์น˜๋Š” ๋ชจ๋‘ 0์ด ๋จ 

 

 

 

 


 

๐Ÿ“Œ ์‹ค์Šต iris data 

 

๐Ÿƒ‍โ™€๏ธ GMM clustering 

 

from sklearn.mixture import GaussianMixture  
gmm = GaussianMixture(n_components=3, random_state=0).fit(iris.data) 
gmm_cluster_labels = gmm.predict(iris.data) 

irisdf['gmm_cluster'] = gmm_cluster_labels 
irisdf['target'] = iris.target 

iris_result = irisdf.groupby(['target'])['gmm_cluster'].value_counts() 
print(iris_result)

 

 

๐Ÿƒ‍โ™€๏ธ ์‹œ๊ฐํ™” 

 

 

 

๐Ÿ‘‰ K-means ๊ฒฐ๊ณผ๋ณด๋‹ค ๋” ํšจ๊ณผ์ ์ธ ๋ถ„๋ฅ˜ ๊ฒฐ๊ณผ๊ฐ€ ๋„์ถœ๋จ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์ด๋Š” ์–ด๋–ค ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋” ๋›ฐ์–ด๋‚˜๋‹ค๋Š” ์˜๋ฏธ๊ฐ€ ์•„๋‹Œ, ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ์„ธํŠธ๊ฐ€ GMM ๊ตฐ์ง‘ํ™”์— ๋” ํšจ๊ณผ์ ์œผ๋กœ ์ž‘์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. 

 

โญ GMM ์˜ ๊ฒฝ์šฐ๋Š” Kmeasn (→ ์›ํ˜• ๋ชจ์–‘ ์™ธ์—, ์˜ˆ๋ฅผ๋“ค์–ด ํƒ€์› ๋ชจ์–‘ ๋ถ„ํฌ ๋ฐ์ดํ„ฐ ์…‹์—๋Š” ์ž˜ ์ž‘๋™ํ•˜์ง€ ์•Š์Œ) ๋ณด๋‹ค ์œ ์—ฐํ•˜๊ฒŒ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์ž˜ ์ ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ˆ˜ํ–‰ ์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆฐ๋‹ค. 

 

 

 

 

 

 

5๏ธโƒฃ DBSCAN 


๐Ÿ‘€ ๊ฐœ์š” 

 

 

์ž…์‹ค๋ก  ์ฃผ๋ณ€ ์˜์—ญ์˜ ์ตœ์†Œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜๋ฅผ ํฌํ•จํ•˜๋Š” ๋ฐ€๋„ ๊ธฐ์ค€์„ ์ถฉ์กฑํ•˜๋Š” core point๋ฅผ ์—ฐ๊ฒฐํ•˜๋ฉฐ ๊ตฐ์ง‘ํ™”

 

DBSCAN ์€ ๋‹ค์–‘ํ•œ ๊ธฐํ•˜ํ•™์  ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋„ ์ž˜ ๊ตฐ์ง‘ํ™”ํ•ด๋‚ธ๋‹ค.

 

 

๐Ÿ’ก DBSCAN

 

โœ” ๋ฐ€๋„ ๊ธฐ๋ฐ˜ ๊ตฐ์ง‘ํ™”์˜ ๋Œ€ํ‘œ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜

 

โœ” ํŠน์ • ๊ณต๊ฐ„ ๋‚ด์— ๋ฐ์ดํ„ฐ ๋ฐ€๋„ ์ฐจ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ ๊ธฐํ•˜ํ•™์ ์œผ๋กœ ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—๋„ ํšจ๊ณผ์ ์ธ ๊ตฐ์ง‘ํ™”๊ฐ€ ๊ฐ€๋Šฅ 

 

 

 

๐Ÿ’ก ํŒŒ๋ผ๋ฏธํ„ฐ

 

from sklearn.cluster import DBSCAN 

 

โœ” epsilon : ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ์ž…์‹ค๋ก  ๋ฐ˜๊ฒฝ์„ ๊ฐ€์ง€๋Š” ์›ํ˜•์˜ ์˜์—ญ ๐Ÿ‘‰ eps

   → ๋ณดํ†ต 1 ์ดํ•˜์˜ ๊ฐ’์„ ์„ค์ •ํ•œ๋‹ค. 

   → ๊ฐ’์„ ํฌ๊ฒŒํ•˜๋ฉด ๋ฐ˜๊ฒฝ์ด ์ปค์ ธ ํฌํ•จํ•˜๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์•„์ง€๋ฏ€๋กœ ๋…ธ์ด์ฆˆ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜๊ฐ€ ์ž‘์•„์ง 

 

โœ” min points : core point ๊ฐ€ ๋˜๊ธฐ ์œ„ํ•ด ์ž…์‹ค๋ก  ์ฃผ๋ณ€ ์˜์—ญ ๋‚ด ํฌํ•จ๋˜์–ด์•ผ ํ•  ๋ฐ์ดํ„ฐ์˜ ์ตœ์†Œ๊ฐœ์ˆ˜ ๐Ÿ‘‰ min_samples

   → ๊ฐ’์„ ํฌ๊ฒŒ ํ•˜๋ฉด ๋ฐ˜๊ฒฝ ๋‚ด์— ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จ์‹œ์ผœ์•ผ ํ•˜๋ฏ€๋กœ ๋…ธ์ด์ฆˆ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜๊ฐ€ ์ปค์ง 

 

minpoints = 5 ๋ผ ๊ฐ€์ •

 

 

 

๐Ÿ’ก ์ฃผ์š” ๊ฐœ๋… 

 

โ—ฝ  Core point : ์ฃผ๋ณ€ ์˜์—ญ ๋‚ด ์ตœ์†Œ ๋ฐ์ดํ„ฐ๊ฐœ์ˆ˜ ์ด์ƒ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๊ฒฝ์šฐ (์  A) 

   ๐Ÿ‘‰ ํŠน์ • core point ์—์„œ ์ง์ ‘ ์ ‘๊ทผ์ด ๊ฐ€๋Šฅํ•œ ๋‹ค๋ฅธ core point ๋ฅผ ์„œ๋กœ ์—ฐ๊ฒฐํ•˜๋ฉด์„œ ๊ตฐ์ง‘ํ™”๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค. 

 

โ—ฝ  Neighbor point : ์ฃผ๋ณ€ ์˜์—ญ ๋‚ด์— ์œ„์น˜ํ•œ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ 

 

โ—ฝ  Border point : ์ฃผ๋ณ€ ์˜์—ญ ๋‚ด ์ตœ์†Œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ ์ด์ƒ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์ง„ ์•Š์œผ๋‚˜, core point ๋ฅผ ์ด์›ƒ ํฌ์ธํŠธ๊ณ  ๋ผ์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ (์  B) 

   ๐Ÿ‘‰ ๊ตฐ์ง‘์˜ ์™ธ๊ณฝ์„ ํ˜•์„ฑํ•œ๋‹ค. 

 

โ—ฝ  Noise point : ์ตœ์†Œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ ์ด์ƒ์˜ ์ด์›ƒ ๋ฐ์ดํ„ฐ๊ฐœ์ˆ˜๊ฐ€ ์—†์œผ๋ฉฐ core point ๋„ ์ด์›ƒ ํฌ์ธํŠธ๋กœ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ (์  C)

 

 

 

 

 


 

๐Ÿ“Œ ์‹ค์Šต 

 

๐Ÿƒ‍โ™€๏ธ DBSCAN ๊ตฐ์ง‘ํ™” 

 

from sklearn.cluster import DBSCAN 

dbscan = DBSCAN(eps=0.6, min_samples = 8, metric = 'euclidean') 
dbscan_label = dbscan.fit_predict(iris.data) 
irisdf['dbscan_clutser'] = dbscan_label  
iris_result = irisdf.groupby(['target'])['dbscan_clutser'].value_counts()  

print(iris_result) 

#๐Ÿ“Œ 0๊ณผ 1 2๊ฐœ์˜ ๊ตฐ์ง‘์œผ๋กœ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ 

#๐Ÿ“Œ -1 : ๋…ธ์ด์ฆˆ์— ์†ํ•˜๋Š” ๊ตฐ์ง‘์„ ์˜๋ฏธํ•œ๋‹ค. 

#๐Ÿ“Œ DBSCAN ์€ ๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜๋ฅผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๋”ฐ๋ผ ์ž๋™์œผ๋กœ ์ง€์ •ํ•ด์คŒ

 

 

๐Ÿƒ‍โ™€๏ธ ์‹œ๊ฐํ™” 

 

๐Ÿ‘‰ ์„ธ๋ชจ๊ฐ’์€ ๋…ธ์ด์ฆˆ์— ํ•ด๋‹นํ•˜๋Š” ๋ฐ์ดํ„ฐ๋กœ, ์ด์ƒ์น˜๋ฅผ ํ•œ๋ˆˆ์— ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋‹ค. 

๐Ÿ‘‰ eps ๋ฅผ 0.8๋กœ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด ๋…ธ์ด์ฆˆ ๊ฐœ์ˆ˜๊ฐ€ ์ค„์–ด๋“ ๋‹ค. 

 

 

 

 

 

 

 

 

6๏ธโƒฃ ๋ณ‘ํ•ฉ ๊ตฐ์ง‘ 


๐Ÿ‘€ ๊ฐœ์š” 

 

๐Ÿ’ก ๋ณ‘ํ•ฉ๊ตฐ์ง‘ Agglomerative clustering 

 

โœ” ๊ฐ๊ฐ์˜ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋ฅผ ํ•˜๋‚˜์˜ ํด๋Ÿฌ์Šคํ„ฐ๋กœ ์ง€์ •ํ•˜๊ณ , ์ง€์ •๋œ ๊ฐœ์ˆ˜์˜ ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ๋‚จ์„ ๋•Œ๊นŒ์ง€ ๊ฐ€์žฅ ๋น„์Šทํ•œ ๋‘ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ํ•ฉ์ณ ๋‚˜๊ฐ€๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ 

 

 

 

โœ” ๋‘ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ํ•ฉ์ณ ๋‚˜๊ฐ€๋Š” ๋ฐฉ์‹ 

 

a. Ward : ๋ชจ๋“  ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด์˜ ๋ถ„์‚ฐ์„ ๊ฐ€์žฅ ์ž‘๊ฒŒ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๋‘ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ํ•ฉ์น˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํฌ๊ธฐ๊ฐ€ ๋น„๊ต์  ๋น„์Šทํ•œ ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ๋งŒ๋“ค์–ด์ง„๋‹ค. 

 

b. Average : ํด๋Ÿฌ์Šคํ„ฐ ํฌ์ธํŠธ ์‚ฌ์ด์˜ ํ‰๊ท  ๊ฑฐ๋ฆฌ๊ฐ€ ๊ฐ€์žฅ ์งง์€ ๋‘ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ํ•ฉ์น˜๋Š” ๋ฐฉ์‹ 

 

c. Complete  : ํด๋Ÿฌ์Šคํ„ฐ ํฌ์ธํŠธ ์‚ฌ์ด์˜ ์ตœ๋Œ€ ๊ฑฐ๋ฆฌ๊ฐ€ ๊ฐ€์žฅ ์งง์€ ๋‘ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ํ•ฉ์น˜๋Š” ๋ฐฉ์‹ 

 

๐Ÿ’จ ํด๋Ÿฌ์Šคํ„ฐ์— ์†ํ•œ ํฌ์ธํŠธ ์ˆ˜๊ฐ€ ๋งŽ์ด ๋‹ค๋ฅผ ๋•Œ (ํ•˜๋‚˜์˜ ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ๋‹ค๋ฅธ ๊ฒƒ๋ณด๋‹ค ๋งค์šฐ ํด๋•Œ) average ๋‚˜ complete ๋ฐฉ๋ฒ•์ด ๋” ์ข‹๋‹ค. 

 

โœ” ๋ณ‘ํ•ฉ๊ตฐ์ง‘์€ ๊ณ„์ธต์  ๊ตฐ์ง‘์„ ๋งŒ๋“ ๋‹ค. ์ฆ‰, ์ž‘์€ ํด๋Ÿฌ์Šคํ„ฐ๋“ค์ด ๋ชจ์—ฌ ํฐ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์ด๋ฃจ๋Š” ๊ณ„์ธต์  ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์ด๋‹ค. 

 

 

 

๐Ÿ’ก ๊ณ„์ธต์  ๊ตฐ์ง‘ 

 

โœ” ๊ณ„์ธต์  ํŠธ๋ฆฌ ๋ชจํ˜•์„ ์ด์šฉํ•ด ๊ฐœ๋ณ„ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋“ค์„ ์ˆœ์ฐจ์ , ๊ณ„์ธต์ ์œผ๋กœ ์œ ์‚ฌํ•œ ํด๋Ÿฌ์Šคํ„ฐ๋กœ ํ†ตํ•ฉํ•˜์—ฌ ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ 

 

โœ” ์‚ฌ์ „์— ํด๋Ÿฌ์Šคํ„ฐ ๊ฐœ์ˆ˜๋ฅผ ์ •ํ•˜์ง€ ์•Š์•„๋„ ํ•™์Šต ์ˆ˜ํ–‰์ด ๊ฐ€๋Šฅํ•˜๋‹ค. 

 

โœ” ๋ด๋“œ๋กœ๊ทธ๋žจ : ๊ณ„์ธต ๊ตฐ์ง‘์„ ์‹œ๊ฐํ™” ํ•˜๋Š” ๋„๊ตฌ๋กœ ๋‹ค์ฐจ์› ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

๊ฐ€์ง€๊ธธ์ด๋Š” ํ•ฉ์ณ์ง„ ํด๋Ÿฌ์Šคํ„ฐ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋ฉ€๋ฆฌ ๋–จ์–ด์ ธ ์žˆ๋Š”์ง€ ๋ณด์—ฌ์ค€๋‹ค.

 

 

  • ๋ด๋“œ๋กœ๊ทธ๋žจ์„ ๊ฐ€๋กœ์„ ์œผ๋กœ ๋ถ„ํ• ํ•˜๋ฉด ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์ž„์˜๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰ ๊ตฐ์ง‘ํ™” ํ•˜๊ธฐ ์ „ ํด๋Ÿฌ์Šคํ„ฐ ๊ฐœ์ˆ˜๋ฅผ ์ •ํ•ด์ฃผ์–ด์•ผ ํ•˜๋Š” k-means ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ๋‹ฌ๋ฆฌ ๋ณ‘ํ•ฉ ๊ตฐ์ง‘ ๊ฐ™์€ ๊ณ„์ธต์  ๊ตฐ์ง‘ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ตฐ์ง‘ํ™”๋ฅผ ์™„๋ฃŒํ•œ ํ›„์— ์‚ฌ์šฉ์ž๊ฐ€ ์‹œ๊ฐํ™”๋œ ๋ด๋“œ๋กœ๊ทธ๋žจ์„ ๋ณด๊ณ  ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค. 
  • ๊ทธ๋Ÿฌ๋‚˜ kmeans ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ฐ์ดํ„ฐ ๊ฐ„ ๊ฑฐ๋ฆฌ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ณต์žกํ•œ ํ˜•์ƒ์˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋Š” ๊ตฌ๋ถ„ํ•˜์ง€ ๋ชปํ•œ๋‹ค. 

 

 

 

 


 

 

 

 

๐Ÿงญ ๊ตฐ์ง‘ํ™” ๋‚ด์šฉ ์ •๋ฆฌ 

1. Kmeans 
๊ฑฐ๋ฆฌ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฐ์ง‘ ์ค‘์‹ฌ์ ์„ ์ด๋™์‹œํ‚ค๋ฉฐ ๊ตฐ์ง‘ํ™” 
๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ์„ธํŠธ์—๋Š” ์ ์šฉํ•˜๊ธฐ ์–ด๋ ค์›€ + ์ ์ ˆํ•œ ๊ตฐ์ง‘ ๊ฐœ์ˆ˜ K ์ตœ์ ํ™” ์–ด๋ ค์›€ 
์‹ค๋ฃจ์—ฃ ๊ณ„์ˆ˜๋กœ ๊ตฐ์ง‘ ๊ฒฐ๊ณผ๋ฅผ ํ‰๊ฐ€ 
* ์—˜๋ณด์šฐ๋ฅผ ํ†ตํ•ด ์ตœ์ ์˜ k ๋„์ถœ 

2. Mean Shift 
๋ฐ์ดํ„ฐ๊ฐ€ ๋ชจ์—ฌ์žˆ๋Š” ๋ฐ€๋„๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ์ชฝ์œผ๋กœ ๊ตฐ์ง‘ ์ค‘์‹ฌ์ ์„ ์ด๋™ํ•˜๋ฉฐ ๊ตฐ์ง‘ํ™” ์ˆ˜ํ–‰ 
KDE ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ ๋ฐ€์ง‘๋„๋ฅผ ๊ณ„์‚ฐ 
์ •ํ˜• ๋ฐ์ดํ„ฐ์„ธํŠธ๋ณด๋‹จ ์ปดํ“จํ„ฐ ๋น„์ „ ์˜์—ญ์—์„œ ์ด๋ฏธ์ง€๋‚˜ ์˜์ƒ ๋ฐ์ดํ„ฐ์—์„œ ๊ฐœ์ฒด๊ตฌ๋ถ„/์›€์ง์ž„ ์ถ”์ ์— ํ™œ์šฉ๋˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. 

3. GMM 
์—ฌ๋Ÿฌ๊ฐœ์˜ ๊ฐ€์šฐ์‹œ์•ˆ ๋ถ„ํฌ ๋ชจ๋ธ์„ ์„ž์–ด ์ƒ์„ฑ๋œ ๋ชจ๋ธ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ •ํ•ด ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰
Kmeans ๋ณด๋‹ค ์œ ์—ฐํ•˜๊ฒŒ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ์ž˜ ์ ์šฉ
์ˆ˜ํ–‰์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆฌ๋Š” ๋‹จ์  

4. DBSCAN 
๋ฐ€๋„๊ธฐ๋ฐ˜ ๊ตฐ์ง‘ํ™”์˜ ๋Œ€ํ‘œ์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ 
์ž…์‹ค๋ก  ๊ธฐ๋ฐ˜์˜ ์ตœ์†Œ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ ์ถฉ์กฑ ์—ฌ๋ถ€์— ๋”ฐ๋ผ core point ๋“ฑ์„ ๊ตฌ๋ถ„ํ•˜๊ณ  core point ๋ผ๋ฆฌ ์—ฐ๊ฒฐํ•˜๋ฉฐ ๊ตฐ์ง‘ํ™”๋ฅผ ๊ตฌ์„ฑ 
๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ ๊ธฐํ•˜ํ•™์ ์œผ๋กœ ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ์—๋„ ํšจ๊ณผ์ ์ธ ๊ตฐ์ง‘ํ™”๊ฐ€ ๊ฐ€๋Šฅ 

 

728x90

๋Œ“๊ธ€