λ³Έλ¬Έ λ°”λ‘œκ°€κΈ°

1️⃣ AI•DS/πŸ“• λ¨Έμ‹ λŸ¬λ‹12

Uplift modeling μ°Έκ³  아티클1 μ°Έκ³  아티클2 • μ—…λ¦¬ν”„νŠΈ λͺ¨λΈμ€ treatment λ°˜μ‘μœΌλ‘œ 얻을 수 μžˆλŠ” 점진적 κ°€μΉ˜ ( incremental value ) λ₯Ό μ˜ˆμΈ‘ν•œλ‹€. https://pylift.readthedocs.io/en/latest/index.html Welcome to pylift’s documentation! — pylift 0.1.3 documentation Welcome to pylift’s documentation! pylift is an uplift library that provides, primarily, (1) fast uplift modeling implementations and (2) evaluation tools. While other packages and more exact meth.. 2023. 6. 6.
λ°μ΄ν„°λ§ˆμ΄λ‹ Classification (decision tree) 1. Basic Concepts β‘  Definition • Classification task • Given a collection of records (training set), we find a model for the class attribute as a function of the values of other attributes. Each record contains a set of attributes, and one of the attributes is the class. • Previously unseen records (test set) should be assigned a class as accurately as possible β†ͺ A test set is used to determine .. 2023. 4. 15.
λ°μ΄ν„°λ§ˆμ΄λ‹ Association analysis 1. Basic Concepts β‘  Overview • Motivation : finding inherent regularities in data β†ͺ ν•¨κ»˜ κ΅¬λ§€λ˜λŠ” μƒν’ˆμ΄ 무엇이 μžˆμ„κΉŒ β†ͺ PC λ₯Ό μ‚¬κ³ λ‚œ 직후에 κ΅¬λ§€ν•˜λŠ” μƒν’ˆμ€ 뭐가 μžˆμ„κΉŒ β†ͺ μƒˆλ‘œμš΄ 약에 λ―Όκ°ν•œ DNAλŠ” μ–΄λ–€ μ’…λ₯˜κ°€ μžˆμ„κΉŒ β†ͺ μ›Ήλ¬Έμ„œλ₯Ό μžλ™μœΌλ‘œ λΆ„λ₯˜ν•  수 μžˆμ„κΉŒ ⇨ μ—°κ΄€λœ κ·œμΉ™μ„ μ°Ύμ•„λ³΄μž • Application β†ͺ Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis β‘‘ Association Rule Mining • Given a set of .. 2023. 3. 29.
λ°μ΄ν„°λ§ˆμ΄λ‹ Preprocessing β‘’ 1. Data Cleaning β‘  Data quality → preprocess λ₯Ό ν•˜λŠ” 이유 • Accuracy, Completeness, Consistency, Timeliness, Believability, Interpretability β‘‘ Data Cleansing • Data in the real world is dirty • Incomplete, Noisy, Inconsistent, Intentional β‘’ Incomplete: lacking attribute values, lacking certain attributes of interest or containing only aggregate data ex. missing data β†ͺ μ„Όμ„œκΈ°κΈ°κ°€ κ³ μž₯λ‚¬κ±°λ‚˜, 정보가 μ‰½κ²Œ λͺ¨μ΄μ§€ μ•Šκ±°λ‚˜, 아이듀.. 2023. 3. 29.
λ°μ΄ν„°λ§ˆμ΄λ‹ Preprocessing β‘‘ 1. Types of data sets • Relational Records β†ͺ collections of records, each of which consists of a fixed set of attributes • Data matrix: record data β†ͺ If the data objects have the same fixed set of numeric attributes, the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute β†ͺ m by n matrix, m rows, n columns • Document data: r.. 2023. 3. 15.
λ°μ΄ν„°λ§ˆμ΄λ‹ Preprocessing β‘  1. GOAL of the course • λ°μ΄ν„°λ§ˆμ΄λ‹μ˜ κΈ°λ³Έ κ°œλ…λ“€ • 데이터 μ „μ²˜λ¦¬ (data preprocessing) • Association, correlation, and frequent pattern analysis • Classification • Cluster and outlier analysis • Data Mining: Industry efforts and social impacts 2. Technology Trend • Explosive growth of data : from terabytes to petabytes β†ͺ Big data, Internet of Things, Web2.0, Scientific simulation • Motivation of Data Mining β†ͺ In.. 2023. 3. 15.
[05. ν΄λŸ¬μŠ€ν„°λ§] K-means, 평균이동, GMM, DBSCAN 1️⃣ K-means clustering πŸ‘€ κ°œμš” πŸ’‘ k-means clustering βœ” κ΅°μ§‘ν™”μ—μ„œ κ°€μž₯ 일반적으둜 μ‚¬μš©λ˜λŠ” μ•Œκ³ λ¦¬μ¦˜ βœ” Centroid = ꡰ집 쀑심점 μ΄λΌλŠ” νŠΉμ •ν•œ 지점을 선택해 ν•΄λ‹Ή 쀑심에 κ°€μž₯ κ°€κΉŒμš΄ ν¬μΈνŠΈλ“€μ„ μ„ νƒν•˜λŠ” ꡰ집화 기법이닀. 1. k 개의 ꡰ집 쀑심점을 μ„€μ • 2. 각 λ°μ΄ν„°λŠ” κ°€μž₯ κ°€κΉŒμš΄ 쀑심점에 μ†Œμ† 3. 쀑심점에 ν• λ‹Ήλœ 데이터듀을 λŒ€μƒμœΌλ‘œ 평균값을 κ΅¬ν•˜κ³  그것을 μƒˆλ‘œμš΄ μ€‘μ‹¬μ μœΌλ‘œ μ„€μ • 4. 각 λ°μ΄ν„°λŠ” μƒˆλ‘œμš΄ 쀑심점을 κΈ°μ€€μœΌλ‘œ λ‹€μ‹œ κ°€μž₯ κ°€κΉŒμš΄ 쀑심점에 μ†Œμ†λ¨ πŸ‘‰ μ€‘μ‹¬μ μ˜ 이동이 더이상 없을 λ•ŒκΉŒμ§€ 반볡 πŸ’‘ μž₯단점 πŸ’¨ μž₯점 βœ” 일반적인 κ΅°μ§‘ν™”μ—μ„œ κ°€μž₯ 많이 ν™œμš©λ˜λŠ” μ•Œκ³ λ¦¬μ¦˜ βœ” μ•Œκ³ λ¦¬μ¦˜μ΄ 쉽고 간결함 πŸ’¨ 단점 βœ” 거리기반 μ•Œκ³ λ¦¬μ¦˜μœΌλ‘œ μ†μ„±μ˜ κ°œμˆ˜κ°€ 많으면 ꡰ집화 .. 2022. 5. 7.
[06. μ°¨μ›μΆ•μ†Œ] PCA, LDA, SVD, NMF 01. μ°¨μ›μΆ•μ†Œ πŸ‘€ κ°œμš” 맀우 λ§Žμ€ ν”Όμ²˜λ‘œ κ΅¬μ„±λœ 닀차원 데이터 μ„ΈνŠΈμ˜ 차원을 μΆ•μ†Œν•΄ μƒˆλ‘œμš΄ μ°¨μ›μ˜ 데이터 μ„ΈνŠΈλ₯Ό μƒμ„±ν•˜λŠ” 것 PCA, LDA, SVD, NMF μ•Œκ³ λ¦¬μ¦˜ 차원이 큰 경우 차원이 μž‘μ€ 경우 sparse ν•œ ꡬ쑰 πŸ‘‰ 예츑 신뒰도 ν•˜λ½ μ‹œκ°ν™”κ°€ κ°€λŠ₯ν•΄ μ§κ΄€μ μœΌλ‘œ 데이터λ₯Ό ν•΄μ„ν•˜λŠ” 것이 κ°€λŠ₯해진닀. ν”Όμ²˜λ³„ 상관관계가 높을 수 있음 πŸ‘‰ 닀쀑 곡선성 문제둜 예츑 μ„±λŠ₯ μ €ν•˜ ν•™μŠ΅μ— ν•„μš”ν•œ 처리 λŠ₯λ ₯을 쀄일 수 μžˆλ‹€. πŸ“Œ ν”Όμ²˜ 선택 vs ν”Όμ²˜ μΆ”μΆœ ν”Όμ²˜ 선택 : νŠΉμ • ν”Όμ²˜μ— 쒅속성이 κ°•ν•œ λΆˆν•„μš”ν•œ ν”Όμ²˜λ₯Ό μ•„μ˜ˆ μ œκ±°ν•˜μ—¬ λ°μ΄ν„°μ˜ νŠΉμ§•μ„ 잘 λ‚˜νƒ€λ‚΄λŠ” μ£Όμš”ν•œ ν”Όμ²˜λ§Œ μ„ νƒν•˜λŠ” 방식 ν”Όμ²˜ μΆ”μΆœ : κΈ°μ‘΄ ν”Όμ²˜λ₯Ό μ €μ°¨μ›μ˜ μ€‘μš” ν”Όμ²˜λ‘œ μ••μΆ•ν•΄μ„œ μΆ”μΆœ πŸ‘‰ λ‹¨μˆœν•œ 압좕이 μ•„λ‹Œ, ν”Όμ²˜λ₯Ό ν•¨μΆ•μ μœΌλ‘œ 더 잘 μ„€λͺ…ν•  수 μžˆλŠ” .. 2022. 4. 24.
[05. νšŒκ·€] μ„ ν˜•νšŒκ·€, λ‹€ν•­νšŒκ·€, κ·œμ œνšŒκ·€, λ‘œμ§€μŠ€ν‹±νšŒκ·€, νšŒκ·€νŠΈλ¦¬ πŸ‘€ νšŒκ·€λΆ„μ„ - 데이터 값이 평균과 같은 μΌμ •ν•œ κ°’μœΌλ‘œ λŒμ•„κ°€λ €λŠ” κ²½ν–₯을 μ΄μš©ν•œ 톡계학 기법 - μ—¬λŸ¬ 개의 λ…λ¦½λ³€μˆ˜μ™€ ν•œ 개의 μ’…μ†λ³€μˆ˜ κ°„μ˜ 상관관계λ₯Ό λͺ¨λΈλ§ ν•˜λŠ” 기법을 ν†΅μΉ­ν•œλ‹€. - μ’…μ†λ³€μˆ˜λŠ” μˆ«μžκ°’(연속값) 이닀. - λ¨Έμ‹ λŸ¬λ‹ νšŒκ·€ 예츑의 핡심은 '졜적의 νšŒκ·€κ³„μˆ˜' λ₯Ό μ°Ύμ•„λ‚΄λŠ” 것! λ…λ¦½λ³€μˆ˜μ˜ 개수 νšŒκ·€ κ³„μˆ˜μ˜ κ²°ν•© 1개 : λ‹¨μΌνšŒκ·€ μ„ ν˜• : μ„ ν˜• νšŒκ·€ μ—¬λŸ¬κ°œ : 닀쀑 νšŒκ·€ λΉ„μ„ ν˜• : λΉ„μ„ ν˜• νšŒκ·€ - νšŒκ·€ λΆ„μ„μ˜ objective : RSS (μ˜€μ°¨μ œκ³±ν•©) 을 μ΅œμ†Œλ‘œν•˜λŠ” νšŒκ·€ λ³€μˆ˜ (w) μ°ΎκΈ° 03. κ²½μ‚¬ν•˜κ°•λ²• πŸ“Œ κ°œμš” πŸ’‘ κ²½μ‚¬ν•˜κ°•λ²• 데이터λ₯Ό 기반으둜 μ•Œκ³ λ¦¬μ¦˜μ΄ 슀슀둜 ν•™μŠ΅ν•œλ‹€λŠ” κ°œλ…μ„ κ°€λŠ₯ν•˜κ²Œ λ§Œλ“€μ–΄μ€€ 핡심 기법 μ μ§„μ μœΌλ‘œ 반볡적인 계산을 톡해 W νŒŒλΌλ―Έν„° 값을 μ—…λ°μ΄νŠΈν•˜λ©΄μ„œ 였λ₯˜ 값이 μ΅œμ†Œκ°€ λ˜λŠ”.. 2022. 3. 25.
[04. λΆ„λ₯˜] LightGBM, μŠ€νƒœν‚Ή 앙상블, Catboost 07. LightGBM πŸ“Œ κ°œμš” πŸ’‘ LightGBM XGBoost 와 예츑 μ„±λŠ₯은 λΉ„μŠ·ν•˜μ§€λ§Œ ν•™μŠ΅μ— κ±Έλ¦¬λŠ” μ‹œκ°„μ΄ 훨씬 적으며 λ‹€μ–‘ν•œ κΈ°λŠ₯을 λ³΄μœ ν•˜κ³  μžˆλ‹€. μΉ΄ν…Œκ³ λ¦¬ν˜• ν”Όμ²˜μ˜ μžλ™ λ³€ν™˜(원핫인코딩을 ν•˜μ§€ μ•Šμ•„λ„ 됨) κ³Ό 졜적 λΆ„ν•  μˆ˜ν–‰ κ· ν˜• 트리 λΆ„ν•  방식이 μ•„λ‹Œ 리프 쀑심 트리 λΆ„ν•  방식을 μ‚¬μš©ν•œλ‹€. κ·ΈλŸ¬λ‚˜ 적은 데이터 μ„ΈνŠΈ (10,000건 μ΄ν•˜) 에 μ μš©ν•  경우 과적합이 λ°œμƒν•˜κΈ° 쉽닀. 리프쀑심 트리 λΆ„ν•  Leaf wise : 트리의 κ· ν˜•μ„ λ§žμΆ”μ§€ μ•Šκ³  μ΅œλŒ€ 손싀값을 κ°€μ§€λŠ” λ¦¬ν”„λ…Έλ“œλ₯Ό μ§€μ†μ μœΌλ‘œ λΆ„ν• ν•œλ‹€. ν•™μŠ΅μ˜ λ°˜λ³΅μ„ 톡해 κ²°κ΅­ κ· ν˜•νŠΈλ¦¬ λΆ„ν•  방식보닀 예츑 였λ₯˜ 손싀을 μ΅œμ†Œν™”ν•  수 있게 λœλ‹€. πŸ“Œ ν•˜μ΄νΌ νŒŒλΌλ―Έν„° LightGBM 은 XGBoost 와 νŒŒλΌλ―Έν„°κ°€ 맀우 μœ μ‚¬ν•˜μ§€λ§Œ, μ£Όμ˜ν• μ μ€ λ¦¬ν”„λ…Έλ“œκ°€.. 2022. 3. 20.
728x90