1. Data Cleaning
โ Data quality โ preprocess ๋ฅผ ํ๋ ์ด์
โข Accuracy, Completeness, Consistency, Timeliness, Believability, Interpretability

โก Data Cleansing
โข Data in the real world is dirty
โข Incomplete, Noisy, Inconsistent, Intentional
โข Incomplete: lacking attribute values, lacking certain attributes of interest or containing only aggregate data
ex. missing data
โช ์ผ์๊ธฐ๊ธฐ๊ฐ ๊ณ ์ฅ๋ฌ๊ฑฐ๋, ์ ๋ณด๊ฐ ์ฝ๊ฒ ๋ชจ์ด์ง ์๊ฑฐ๋, ์์ด๋ค์๊ฒ๋ ์ฐ๊ฐ ์๋์ ์์งํ ์ ์๋ ๊ฒ์ฒ๋ผ ๋ชจ๋ ์ผ์ด์ค์์ ์์งํ ์ ์๋ ์์ฑ์ด ์๋๊ฑฐ๋ โโโ
โช need to be inferred
โข Missing data Handing
โช 1. Ignoring the tuple: ๊ฒฐ์ธก์น๊ฐ ๋งค์ฐ ์ ์(few) ์ํฉ์ด๋ผ๋ฉด ๋จ์ํ ์ ๊ฑฐํ๋ ๋ฐฉ๋ฒ์ ์ ์ ํ์ง ์์ ์ ์์
โช 2. Filling the missing value manually: tedious, infeasible
โช 3. Filling in it automatically : with a global constant (ex. unknown), the attribute mean, The most probable value inference based such as a Bayesian formula or decision tree
โฃ Noisy: containing noise, errors, or outliers
ex. salary = '-10'
โข Noise : random error or variance
โข ๋ฐ์ดํฐ ์์ง ๊ธฐ๊ธฐ์ ์ค๋ฅ์ด๊ฑฐ๋ ๋ฐ์ดํฐ ์ ์ก๊ณผ์ ์์ ๋ฌธ์ ๊ฐ ๋ฐ์ํ๊ฑฐ๋, ๊ธฐ์ ์ ํ๊ณ (ex. GPS error : 10~20meters) ๋ก ์ธํด ๋ฐ์ํ ์ ์๋ค.
โข Noisy data Handling
โช 1. Binning: ๋ฐ์ดํฐ๋ฅผ sorting ํ ํ์ bin ์ ๊ตฌ๊ฐ์ผ๋ก ๋๋๊ณ , bin ์ ํ๊ท ์ด๋ ์ค๊ฐ๊ฐ ํน์ boundary ๋ก Smoothing ์ ํ๋ค.

โช 2. Regression: ํ๊ท๋ถ์ ๋ชจ๋ธ์ ์ ์ฉํด ๋ฐ์ดํฐ๋ฅผ ์ ํฉ์์ผ ๋ ธ์ด์ฆ๋ฅผ ์คใน์ธ๋ค.
โช 3. Clustering: ์ด์์น๋ฅผ ๋ฐ๊ฒฌํ์ฌ ์ ๊ฑฐํ๋ค.
โช 4. computer ์ human inspection ์ ์กฐํฉ : ์์ฌํ ๋งํ ๊ฐ์ ์ปดํจํฐ๋ก ์ฐพ๊ณ , ์ฌ๋์ด ์ฒดํฌํด๋ณธ๋ค.
โข Noisy Handling EX: Map Matching

โค Inconsistent: containing discrepancies (์ฐจ์ด) in codes or names
ex. age = 42, birthday = 03/08/2010
ex. was rating 1,2,3 , now rating A,B,C
โฅ Intentional: disguised missing data
ex. ์์ผ์ด๋ ์ง์ญ์ ๊ทธ๋ฅ ๋ํดํธ ๋์ด ์๋ ์ํ์์ ์ ํํ๋ ๊ฒ
2. Data Integration
โข Combining data from multiple sources into a coherent data store, as in data warehouses

โ Schema Intergration
โข Integrating data from multiple sources with heterogeneous schemas
ex. A.cust-id = B.cust-# ?
โก Entity resolution
โข Identifying the matching records from multiple sources

โข Redundancy ๋ถํ์ํ ์ค๋ณต
โข Handling
โช ์ฌ๋ฌ ๋ฐ์ดํฐ๋ฒ ์ด์ค๋ฅผ ์กฐํฉํ ๋ ์ค๋ณต์ด ์์ฃผ ๋ฐ์ํ๋ค.
โช Object identification : ๊ฐ์ ์์ฑ์ ๋ค๋ฅธ ๋ฐ์ดํฐ๋ฒ ์ด์ค์์ ๋ค๋ฅธ ์ด๋ฆ์ ๊ฐ์ง๊ณ ์์ ์ ์๋ค.
โช Derivable data : ํ ์์ฑ์ ๋ค๋ฅธ ํ ์ด๋ธ์ ์์ฑ์ผ๋ก๋ถํฐ derived ๋ ์ ์๋ค. ๊ฐ๋ น ์ฐ๊ฐ ์์ต์ ๋ค๋ฅธ ์์ฑ๋ค์ ์ํด ๊ณ์ฐ๋ ๊ฐ์ผ ์ ์๋ค.
โช Correlation analysis ๋ covariance analysis ๋ก ๋ฐ๊ฒฌํด๋ผ ์ ์๋ค !
โช Chi-square test , PCC, Covariance
โข Pearson product-moment coefficient (PCC)
โช correlation X and Y (linear dependence)


โฃ Inconsistency: Finding the true value of an attribute
ex. ์์ ๋ง๋ค ๋์ผํ ์ฑ ์ ๊ฐ๊ฒฉ์ด ์กฐ๊ธ์ฉ ๋ค๋ฅผ ์ ์์
3. Data Reduction
โข ๋ฐ์ดํฐ๋ฒ ์ด์ค๋ ๋ฐ์ดํฐ์จ์ดํ์ฐ์ค๋ petabyte ๋จ์์ ๋ฐ์ดํฐ๋ค์ ๊ฐ์ง๊ณ ์์ผ๋ฉฐ ๋ณต์กํ ๋ฐ์ดํฐ ๋ถ์์ ์ ์ฉํ ๊ฒฝ์ฐ ์๊ฐ์ด ์ค๋๊ฑธ๋ฆด ์ ์๋ค.
โข data reduction ์ ๋์ผํ ๋ถ์ ๊ฒฐ๊ณผ๋ฅผ ์ ๊ณตํ์ง๋ง (produces the same analytical results) ๋ฐ์ดํฐ์ ์ ๋ ์์ volume ์ผ๋ก ์ค์ฌ์ ํํํ๋ ๊ฒ์ ์๋ฏธํ๋ค.
โ Strategies
โช 1. Dimensionality reduction
โธ Wavelet transform
โธ PCA
โธ feature subset selection, feature creation
โช 2. Numerosity reduction (some simply call it data reduction)
โธ Regression
โธ Histograms, clustering, sampling
โธ Data cube aggregation
โช 3. Data compression
โก Motivation of Dimensionality reduction
โข ์ฐจ์์ ์ ์ฃผ curse of dimensionality
โช ์ฐจ์์ด ์ฆ๊ฐํ ์๋ก data ๋ ๋งค์ฐ sparse ํด์ง๋ค. ๋ฐ์ดํฐ๊ฐ sparse ํด์ง๋ฉด ํด๋ฌ์คํฐ๋ง์ด๋ ์ด์์น ๊ฐ์ง๋ฅผ ์งํํ ๋ ๊ฒฐ๊ณผ๊ฐ ๋งค์ฐ ์์ข์์ง๋ค.

โช ํด๋ฌ์คํฐ๋ง ํน์ ์ด์์น ํ์ง ๋ถ์์์ ๋งค์ฐ ์ค์ํ ๊ฐ๋ ์ธ ๊ฐ ๋ฐ์ดํฐ ํฌ์ธํธ์ ๋ํ ๋ฐ๋๋ ๊ฑฐ๋ฆฌ์ ๋ํ ์ ์๊ฐ ๋ฌด์๋ฏธํด์ง ์ ์๋ค.

โช ์ฐจ์์ ์๊ฐ ์ ์ ๋๋ ๊ฑฐ๋ฆฌ ์ฐจ์ด๊ฐ ํฌ์ง๋ง, ์ฐจ์์ด ์ฆ๊ฐํ๋ฉด ๊ฑฐ๋ฆฌ ์ฐจ์ด๊ฐ ์์์ง๋ค : less meaningful

โช ์ฐจ์์ด ์ปค์ง์๋ก ๋ชจ๋ ๋ฐ์ดํฐํฌ์ธํธ๋ค์ด ๊ณต๊ฐ ์์์ ์ฝ๋์ชฝ์ผ๋ก ๋ชฐ๋ฆฌ๊ธฐ ๋๋ฌธ์ ๋ฐ์ดํฐ๊ฐ์ ๊ฑฐ๋ฆฌ๋ฅผ ์ธก์ ํ๋๊ฒ ๋ฌด์๋ฏธํด์ง๋ค. (there is no middle resion)
โข Dimensionality Reduction
โข ๋ชฉ์
- ์ฐจ์์ ์ ์ฃผ ํผํ๊ธฐ
- ๋ฐ์ดํฐ๋ง์ด๋ ์๊ณ ๋ฆฌ์ฆ์ ์๊ฐ๋ณต์ก๋์ ๋ฉ๋ชจ๋ฆฌ ๋ณต์ก๋๋ฅผ ์ค์ด๊ธฐ
- ์๊ฐํํ๊ธฐ ์ฝ๊ฒ
- ๊ด๊ณ์๋ ์์ฑ์ด๋ ๋ ธ์ด์ฆ๋ฅผ ์ ๊ฑฐํ๊ธฐ ์ํจ
โข ๋ถ์๊ธฐ์
- Wavvelet transform
- PCA
- Feature selection
โข a. PCA
- use orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components
- ์ฃผ์ฑ๋ถ์ ๊ฐ์๋ ๋ฐ์ดํฐ ์์ฑ ๊ฐ์๋ณด๋ค ์๋ค.
- first principal component has the largest possible variance
- each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components

โช eigenvector ๋ ๋ฐฉํฅ์ ๋ํ๋ด๊ณ , eigenvalue ๋ ํด๋น ๋ฐฉํฅ์์์ ๋ฐ์ดํฐ์ ๋ถ์ฐ์ ํฌ๊ธฐ๋ฅผ ์๋ฏธํ๋ค.

โช ์ฒซ๋ฒ์งธ ์ฃผ์ฑ๋ถ๊ณผ ๋๋ฒ์งธ ์ฃผ์ฑ๋ถ๋ง์ ๊ณ ๋ คํด ์๋ก์ด ์ฐจ์์ ๋ฐ์ดํฐ๋ฅผ ๋งตํ์ํจ๋ค.
โข b. Feature selection
- removing redundant and irrelevant features
- Redundant features: Duplicating much or all of the information
ex. the purchase price of a product and the amount of sales tax paid
- Irrelevant features: Containing no information that is useful for the data mining task
ex. students' ID is often irrelevant to the task of predicting students' GPA
- Heuristics in attribute subset selection: Stepwise forward selection (picking best), Stepwise backward elimination (eliminate worst), combination two, Decision tree induction

โข c. Discrete Wavelet Transform
- widely used for signal processing
- wavelet : mathematical function used to divide a given function or continuous-time signal into different scale components

- One-dimensional Haar wavelet transform

โช wavelet transform ์ผ๋ก๋ถํฐ ์๋ ๋ฐ์ดํฐ๋ฅผ ๋ณต๊ตฌํ ์ ์๋ค.

โช using this information we can construct original value

โช advantage: a large number of the detail coefficients turn out to be very small in magnitude. Truncating, or removing, these small coefficients introduces only small errors in the reconstructed data, giving a form of โlossyโ compression
โฃ Numerosity Reduction
โข Reducing the data volume by choosing alternative, smaller forms of data representation
โข Parametric methods (e.g., regression)

โช store only the parameters
โข Non-Parametric methods
โช Do not assume models
โช histograms, clustering, sampling, โฆ
โข a. sampling
- selection of a subset of individual observations within a population of individuals intended to yield some knowledge about the population
- mining algorithm to run in complexity that is potentially sub-linear to the size of the data
- choosing a representative subset
- ๊ฐ๋จํ ๋๋ค ์ํ๋ง์ ๊ฒฝ์ฐ, skew ๊ฐ ์กด์ฌํ๋ค๋ฉด ์ฑ๋ฅ์ด ์ข์ง ์์ ์ ์๋ค.
- Types of Sampling
โธ Simple random sampling : the equal probability of selecting any particular item
โธ Sampling without replacement
โธ Sampling with replacement : same object can be picked up more than once
โธ Stratified sampling : The population is divided into non-overlapping groups (i.e., strata)

- sampling size : ์ต๋ํ ์ ๊ฒ ์ํ๋งํ๋ ๊ฒ๋ ํ์ํ์ง๋ง ์๋ ๋ฐ์ดํฐ ๋ถํฌ๋ ๋ณด์กดํ ์ ๋๋ก!

โข b. Data cube aggregation
- lowest level of a data cube (based cuboid)
- aggregated data for an individual entity of interest
ex. the amount of sales per day
- Multiple levels of aggregation in data cubes : day โ week โ month โ quarter โ year
- Using the smallest representation which is enough to solve the task
โค Data compression
โข Data compression is the process of encoding information using fewer bits than the original representation would use
โข it is important also for improving the query performance
โข Almost all data warehousing systems compress data when loading the data
โข a. Run-length encoding

โข b. Dictionary encoding
- for unique value, a separate dictionary entry is created; the index of the dictionary entry is used instead of the value
- only two bits are required for each value

4. DataTransformation and discretization
โข A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values
โข Methods
a. Normalization: min-max, z-score

b. Discretization: concept hierarchy climbing
โช Reducing the number of values for a given continuous attribute by dividing the range of the attribute into intervals and replacing actual data values with the interval labels
โช Purposes
โธ To find informative cut-off points in the data
โธ To enable the use of some learning algorithms (some learning algorithms can accept only discrete variables)
โธ To reduce the data size


โช Concept Hierarchy Generation
โธ Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is usually associated with each dimension in a data warehouse
โธ facilitate drilling and rolling in data
โธ Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior)
โธ can be explicitly specified by domain experts and/or data warehouse designers

'1๏ธโฃ AIโขDS > ๐ ๋จธ์ ๋ฌ๋' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
๋ฐ์ดํฐ๋ง์ด๋ Classification (decision tree) (1) | 2023.04.15 |
---|---|
๋ฐ์ดํฐ๋ง์ด๋ Association analysis (0) | 2023.03.29 |
๋ฐ์ดํฐ๋ง์ด๋ Preprocessing โก (0) | 2023.03.15 |
๋ฐ์ดํฐ๋ง์ด๋ Preprocessing โ (1) | 2023.03.15 |
[05. ํด๋ฌ์คํฐ๋ง] K-means, ํ๊ท ์ด๋, GMM, DBSCAN (0) | 2022.05.07 |
๋๊ธ