1. Data Cleaning
โ Data quality → preprocess ๋ฅผ ํ๋ ์ด์
• Accuracy, Completeness, Consistency, Timeliness, Believability, Interpretability
โก Data Cleansing
• Data in the real world is dirty
• Incomplete, Noisy, Inconsistent, Intentional
โข Incomplete: lacking attribute values, lacking certain attributes of interest or containing only aggregate data
ex. missing data
โช ์ผ์๊ธฐ๊ธฐ๊ฐ ๊ณ ์ฅ๋ฌ๊ฑฐ๋, ์ ๋ณด๊ฐ ์ฝ๊ฒ ๋ชจ์ด์ง ์๊ฑฐ๋, ์์ด๋ค์๊ฒ๋ ์ฐ๊ฐ ์๋์ ์์งํ ์ ์๋ ๊ฒ์ฒ๋ผ ๋ชจ๋ ์ผ์ด์ค์์ ์์งํ ์ ์๋ ์์ฑ์ด ์๋๊ฑฐ๋ โโโ
โช need to be inferred
• Missing data Handing
โช 1. Ignoring the tuple: ๊ฒฐ์ธก์น๊ฐ ๋งค์ฐ ์ ์(few) ์ํฉ์ด๋ผ๋ฉด ๋จ์ํ ์ ๊ฑฐํ๋ ๋ฐฉ๋ฒ์ ์ ์ ํ์ง ์์ ์ ์์
โช 2. Filling the missing value manually: tedious, infeasible
โช 3. Filling in it automatically : with a global constant (ex. unknown), the attribute mean, The most probable value inference based such as a Bayesian formula or decision tree
โฃ Noisy: containing noise, errors, or outliers
ex. salary = '-10'
• Noise : random error or variance
• ๋ฐ์ดํฐ ์์ง ๊ธฐ๊ธฐ์ ์ค๋ฅ์ด๊ฑฐ๋ ๋ฐ์ดํฐ ์ ์ก๊ณผ์ ์์ ๋ฌธ์ ๊ฐ ๋ฐ์ํ๊ฑฐ๋, ๊ธฐ์ ์ ํ๊ณ (ex. GPS error : 10~20meters) ๋ก ์ธํด ๋ฐ์ํ ์ ์๋ค.
• Noisy data Handling
โช 1. Binning: ๋ฐ์ดํฐ๋ฅผ sorting ํ ํ์ bin ์ ๊ตฌ๊ฐ์ผ๋ก ๋๋๊ณ , bin ์ ํ๊ท ์ด๋ ์ค๊ฐ๊ฐ ํน์ boundary ๋ก Smoothing ์ ํ๋ค.
โช 2. Regression: ํ๊ท๋ถ์ ๋ชจ๋ธ์ ์ ์ฉํด ๋ฐ์ดํฐ๋ฅผ ์ ํฉ์์ผ ๋ ธ์ด์ฆ๋ฅผ ์คใน์ธ๋ค.
โช 3. Clustering: ์ด์์น๋ฅผ ๋ฐ๊ฒฌํ์ฌ ์ ๊ฑฐํ๋ค.
โช 4. computer ์ human inspection ์ ์กฐํฉ : ์์ฌํ ๋งํ ๊ฐ์ ์ปดํจํฐ๋ก ์ฐพ๊ณ , ์ฌ๋์ด ์ฒดํฌํด๋ณธ๋ค.
• Noisy Handling EX: Map Matching
โค Inconsistent: containing discrepancies (์ฐจ์ด) in codes or names
ex. age = 42, birthday = 03/08/2010
ex. was rating 1,2,3 , now rating A,B,C
โฅ Intentional: disguised missing data
ex. ์์ผ์ด๋ ์ง์ญ์ ๊ทธ๋ฅ ๋ํดํธ ๋์ด ์๋ ์ํ์์ ์ ํํ๋ ๊ฒ
2. Data Integration
• Combining data from multiple sources into a coherent data store, as in data warehouses
โ Schema Intergration
• Integrating data from multiple sources with heterogeneous schemas
ex. A.cust-id = B.cust-# ?
โก Entity resolution
• Identifying the matching records from multiple sources
โข Redundancy ๋ถํ์ํ ์ค๋ณต
• Handling
โช ์ฌ๋ฌ ๋ฐ์ดํฐ๋ฒ ์ด์ค๋ฅผ ์กฐํฉํ ๋ ์ค๋ณต์ด ์์ฃผ ๋ฐ์ํ๋ค.
โช Object identification : ๊ฐ์ ์์ฑ์ ๋ค๋ฅธ ๋ฐ์ดํฐ๋ฒ ์ด์ค์์ ๋ค๋ฅธ ์ด๋ฆ์ ๊ฐ์ง๊ณ ์์ ์ ์๋ค.
โช Derivable data : ํ ์์ฑ์ ๋ค๋ฅธ ํ ์ด๋ธ์ ์์ฑ์ผ๋ก๋ถํฐ derived ๋ ์ ์๋ค. ๊ฐ๋ น ์ฐ๊ฐ ์์ต์ ๋ค๋ฅธ ์์ฑ๋ค์ ์ํด ๊ณ์ฐ๋ ๊ฐ์ผ ์ ์๋ค.
โช Correlation analysis ๋ covariance analysis ๋ก ๋ฐ๊ฒฌํด๋ผ ์ ์๋ค !
โช Chi-square test , PCC, Covariance
• Pearson product-moment coefficient (PCC)
โช correlation X and Y (linear dependence)
โฃ Inconsistency: Finding the true value of an attribute
ex. ์์ ๋ง๋ค ๋์ผํ ์ฑ ์ ๊ฐ๊ฒฉ์ด ์กฐ๊ธ์ฉ ๋ค๋ฅผ ์ ์์
3. Data Reduction
• ๋ฐ์ดํฐ๋ฒ ์ด์ค๋ ๋ฐ์ดํฐ์จ์ดํ์ฐ์ค๋ petabyte ๋จ์์ ๋ฐ์ดํฐ๋ค์ ๊ฐ์ง๊ณ ์์ผ๋ฉฐ ๋ณต์กํ ๋ฐ์ดํฐ ๋ถ์์ ์ ์ฉํ ๊ฒฝ์ฐ ์๊ฐ์ด ์ค๋๊ฑธ๋ฆด ์ ์๋ค.
• data reduction ์ ๋์ผํ ๋ถ์ ๊ฒฐ๊ณผ๋ฅผ ์ ๊ณตํ์ง๋ง (produces the same analytical results) ๋ฐ์ดํฐ์ ์ ๋ ์์ volume ์ผ๋ก ์ค์ฌ์ ํํํ๋ ๊ฒ์ ์๋ฏธํ๋ค.
โ Strategies
โช 1. Dimensionality reduction
โธ Wavelet transform
โธ PCA
โธ feature subset selection, feature creation
โช 2. Numerosity reduction (some simply call it data reduction)
โธ Regression
โธ Histograms, clustering, sampling
โธ Data cube aggregation
โช 3. Data compression
โก Motivation of Dimensionality reduction
• ์ฐจ์์ ์ ์ฃผ curse of dimensionality
โช ์ฐจ์์ด ์ฆ๊ฐํ ์๋ก data ๋ ๋งค์ฐ sparse ํด์ง๋ค. ๋ฐ์ดํฐ๊ฐ sparse ํด์ง๋ฉด ํด๋ฌ์คํฐ๋ง์ด๋ ์ด์์น ๊ฐ์ง๋ฅผ ์งํํ ๋ ๊ฒฐ๊ณผ๊ฐ ๋งค์ฐ ์์ข์์ง๋ค.
โช ํด๋ฌ์คํฐ๋ง ํน์ ์ด์์น ํ์ง ๋ถ์์์ ๋งค์ฐ ์ค์ํ ๊ฐ๋ ์ธ ๊ฐ ๋ฐ์ดํฐ ํฌ์ธํธ์ ๋ํ ๋ฐ๋๋ ๊ฑฐ๋ฆฌ์ ๋ํ ์ ์๊ฐ ๋ฌด์๋ฏธํด์ง ์ ์๋ค.
โช ์ฐจ์์ ์๊ฐ ์ ์ ๋๋ ๊ฑฐ๋ฆฌ ์ฐจ์ด๊ฐ ํฌ์ง๋ง, ์ฐจ์์ด ์ฆ๊ฐํ๋ฉด ๊ฑฐ๋ฆฌ ์ฐจ์ด๊ฐ ์์์ง๋ค : less meaningful
โช ์ฐจ์์ด ์ปค์ง์๋ก ๋ชจ๋ ๋ฐ์ดํฐํฌ์ธํธ๋ค์ด ๊ณต๊ฐ ์์์ ์ฝ๋์ชฝ์ผ๋ก ๋ชฐ๋ฆฌ๊ธฐ ๋๋ฌธ์ ๋ฐ์ดํฐ๊ฐ์ ๊ฑฐ๋ฆฌ๋ฅผ ์ธก์ ํ๋๊ฒ ๋ฌด์๋ฏธํด์ง๋ค. (there is no middle resion)
โข Dimensionality Reduction
• ๋ชฉ์
- ์ฐจ์์ ์ ์ฃผ ํผํ๊ธฐ
- ๋ฐ์ดํฐ๋ง์ด๋ ์๊ณ ๋ฆฌ์ฆ์ ์๊ฐ๋ณต์ก๋์ ๋ฉ๋ชจ๋ฆฌ ๋ณต์ก๋๋ฅผ ์ค์ด๊ธฐ
- ์๊ฐํํ๊ธฐ ์ฝ๊ฒ
- ๊ด๊ณ์๋ ์์ฑ์ด๋ ๋ ธ์ด์ฆ๋ฅผ ์ ๊ฑฐํ๊ธฐ ์ํจ
• ๋ถ์๊ธฐ์
- Wavvelet transform
- PCA
- Feature selection
• a. PCA
- use orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components
- ์ฃผ์ฑ๋ถ์ ๊ฐ์๋ ๋ฐ์ดํฐ ์์ฑ ๊ฐ์๋ณด๋ค ์๋ค.
- first principal component has the largest possible variance
- each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components
โช eigenvector ๋ ๋ฐฉํฅ์ ๋ํ๋ด๊ณ , eigenvalue ๋ ํด๋น ๋ฐฉํฅ์์์ ๋ฐ์ดํฐ์ ๋ถ์ฐ์ ํฌ๊ธฐ๋ฅผ ์๋ฏธํ๋ค.
โช ์ฒซ๋ฒ์งธ ์ฃผ์ฑ๋ถ๊ณผ ๋๋ฒ์งธ ์ฃผ์ฑ๋ถ๋ง์ ๊ณ ๋ คํด ์๋ก์ด ์ฐจ์์ ๋ฐ์ดํฐ๋ฅผ ๋งตํ์ํจ๋ค.
• b. Feature selection
- removing redundant and irrelevant features
- Redundant features: Duplicating much or all of the information
ex. the purchase price of a product and the amount of sales tax paid
- Irrelevant features: Containing no information that is useful for the data mining task
ex. students' ID is often irrelevant to the task of predicting students' GPA
- Heuristics in attribute subset selection: Stepwise forward selection (picking best), Stepwise backward elimination (eliminate worst), combination two, Decision tree induction
• c. Discrete Wavelet Transform
- widely used for signal processing
- wavelet : mathematical function used to divide a given function or continuous-time signal into different scale components
- One-dimensional Haar wavelet transform
โช wavelet transform ์ผ๋ก๋ถํฐ ์๋ ๋ฐ์ดํฐ๋ฅผ ๋ณต๊ตฌํ ์ ์๋ค.
โช using this information we can construct original value
โช advantage: a large number of the detail coefficients turn out to be very small in magnitude. Truncating, or removing, these small coefficients introduces only small errors in the reconstructed data, giving a form of “lossy” compression
โฃ Numerosity Reduction
• Reducing the data volume by choosing alternative, smaller forms of data representation
• Parametric methods (e.g., regression)
โช store only the parameters
• Non-Parametric methods
โช Do not assume models
โช histograms, clustering, sampling, …
• a. sampling
- selection of a subset of individual observations within a population of individuals intended to yield some knowledge about the population
- mining algorithm to run in complexity that is potentially sub-linear to the size of the data
- choosing a representative subset
- ๊ฐ๋จํ ๋๋ค ์ํ๋ง์ ๊ฒฝ์ฐ, skew ๊ฐ ์กด์ฌํ๋ค๋ฉด ์ฑ๋ฅ์ด ์ข์ง ์์ ์ ์๋ค.
- Types of Sampling
โธ Simple random sampling : the equal probability of selecting any particular item
โธ Sampling without replacement
โธ Sampling with replacement : same object can be picked up more than once
โธ Stratified sampling : The population is divided into non-overlapping groups (i.e., strata)
- sampling size : ์ต๋ํ ์ ๊ฒ ์ํ๋งํ๋ ๊ฒ๋ ํ์ํ์ง๋ง ์๋ ๋ฐ์ดํฐ ๋ถํฌ๋ ๋ณด์กดํ ์ ๋๋ก!
• b. Data cube aggregation
- lowest level of a data cube (based cuboid)
- aggregated data for an individual entity of interest
ex. the amount of sales per day
- Multiple levels of aggregation in data cubes : day → week → month → quarter → year
- Using the smallest representation which is enough to solve the task
โค Data compression
• Data compression is the process of encoding information using fewer bits than the original representation would use
• it is important also for improving the query performance
• Almost all data warehousing systems compress data when loading the data
• a. Run-length encoding
• b. Dictionary encoding
- for unique value, a separate dictionary entry is created; the index of the dictionary entry is used instead of the value
- only two bits are required for each value
4. DataTransformation and discretization
• A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values
• Methods
a. Normalization: min-max, z-score
b. Discretization: concept hierarchy climbing
โช Reducing the number of values for a given continuous attribute by dividing the range of the attribute into intervals and replacing actual data values with the interval labels
โช Purposes
โธ To find informative cut-off points in the data
โธ To enable the use of some learning algorithms (some learning algorithms can accept only discrete variables)
โธ To reduce the data size
โช Concept Hierarchy Generation
โธ Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is usually associated with each dimension in a data warehouse
โธ facilitate drilling and rolling in data
โธ Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior)
โธ can be explicitly specified by domain experts and/or data warehouse designers
'1๏ธโฃ AIโขDS > ๐ ๋จธ์ ๋ฌ๋' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
๋ฐ์ดํฐ๋ง์ด๋ Classification (decision tree) (1) | 2023.04.15 |
---|---|
๋ฐ์ดํฐ๋ง์ด๋ Association analysis (0) | 2023.03.29 |
๋ฐ์ดํฐ๋ง์ด๋ Preprocessing โก (0) | 2023.03.15 |
๋ฐ์ดํฐ๋ง์ด๋ Preprocessing โ (1) | 2023.03.15 |
[05. ํด๋ฌ์คํฐ๋ง] K-means, ํ๊ท ์ด๋, GMM, DBSCAN (0) | 2022.05.07 |
๋๊ธ