๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“• ๋จธ์‹ ๋Ÿฌ๋‹

๋ฐ์ดํ„ฐ๋งˆ์ด๋‹ Preprocessing โ‘ข

by isdawell 2023. 3. 29.
728x90

 

1. Data Cleaning 


 

โ‘  Data quality → preprocess ๋ฅผ ํ•˜๋Š” ์ด์œ  

 

•  Accuracy, Completeness, Consistency, Timeliness, Believability, Interpretability 

 

 

 

โ‘ก Data Cleansing 

 

•  Data in the real world is dirty 

•  Incomplete, Noisy, Inconsistent, Intentional 

 

 

โ‘ข Incomplete: lacking attribute values, lacking certain attributes of interest or containing only aggregate data

 

    ex. missing data 

โ†ช ์„ผ์„œ๊ธฐ๊ธฐ๊ฐ€ ๊ณ ์žฅ๋‚ฌ๊ฑฐ๋‚˜, ์ •๋ณด๊ฐ€ ์‰ฝ๊ฒŒ ๋ชจ์ด์ง€ ์•Š๊ฑฐ๋‚˜, ์•„์ด๋“ค์—๊ฒŒ๋Š” ์—ฐ๊ฐ„ ์†Œ๋“์„ ์ˆ˜์ง‘ํ•  ์ˆ˜ ์—†๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ชจ๋“  ์ผ€์ด์Šค์—์„œ ์ˆ˜์ง‘ํ•  ์ˆ˜ ์žˆ๋Š” ์†์„ฑ์ด ์•„๋‹ˆ๊ฑฐ๋‚˜ โˆ™โˆ™โˆ™

โ†ช need to be inferred 

 

•  Missing data Handing 

โ†ช 1. Ignoring the tuple: ๊ฒฐ์ธก์น˜๊ฐ€ ๋งค์šฐ ์ ์€(few) ์ƒํ™ฉ์ด๋ผ๋ฉด ๋‹จ์ˆœํžˆ ์ œ๊ฑฐํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์ ์ ˆํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Œ 

โ†ช 2. Filling the missing value manually: tedious, infeasible 

โ†ช 3. Filling in it automatically : with a global constant (ex. unknown), the attribute mean, The most probable value inference based such as a Bayesian formula or decision tree

 

 

 

โ‘ฃ Noisy: containing noise, errors, or outliers 

    ex. salary = '-10' 

 

•  Noise : random error or variance 

•  ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๊ธฐ๊ธฐ์˜ ์˜ค๋ฅ˜์ด๊ฑฐ๋‚˜ ๋ฐ์ดํ„ฐ ์ „์†ก๊ณผ์ •์—์„œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ–ˆ๊ฑฐ๋‚˜, ๊ธฐ์ˆ ์˜ ํ•œ๊ณ„ (ex. GPS error : 10~20meters) ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

•  Noisy data Handling 

โ†ช 1. Binning: ๋ฐ์ดํ„ฐ๋ฅผ sorting ํ•œ ํ›„์— bin ์˜ ๊ตฌ๊ฐ„์œผ๋กœ ๋‚˜๋ˆ„๊ณ , bin ์˜ ํ‰๊ท ์ด๋‚˜ ์ค‘๊ฐ„๊ฐ’ ํ˜น์€ boundary ๋กœ Smoothing ์„ ํ•œ๋‹ค. 

 

 

โ†ช 2. Regression: ํšŒ๊ท€๋ถ„์„ ๋ชจ๋ธ์„ ์ ์šฉํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ ํ•ฉ์‹œ์ผœ ๋…ธ์ด์ฆˆ๋ฅผ ์ค„ใ„น์ธ๋‹ค. 

โ†ช 3. Clustering: ์ด์ƒ์น˜๋ฅผ ๋ฐœ๊ฒฌํ•˜์—ฌ ์ œ๊ฑฐํ•œ๋‹ค. 

โ†ช 4. computer ์™€ human inspection ์˜ ์กฐํ•ฉ : ์˜์‹ฌํ• ๋งŒํ•œ ๊ฐ’์„ ์ปดํ“จํ„ฐ๋กœ ์ฐพ๊ณ , ์‚ฌ๋žŒ์ด ์ฒดํฌํ•ด๋ณธ๋‹ค. 

 

 

•  Noisy Handling EX: Map Matching 

 

 

 

 

 

โ‘ค  Inconsistent: containing discrepancies (์ฐจ์ด)  in codes or names 

    ex. age = 42, birthday = 03/08/2010 

    ex. was rating 1,2,3 , now rating A,B,C

 

 

โ‘ฅ  Intentional: disguised missing data

    ex. ์ƒ์ผ์ด๋‚˜ ์ง€์—ญ์„ ๊ทธ๋ƒฅ ๋””ํดํŠธ ๋˜์–ด ์žˆ๋Š” ์ƒํƒœ์—์„œ ์„ ํƒํ•˜๋Š” ๊ฒƒ

 

 

 

 

 

 

 

 

2. Data Integration 


 

•  Combining data from multiple sources into a coherent data store, as in data warehouses 

 

 

 

โ‘  Schema Intergration

 

•  Integrating data from multiple sources with heterogeneous schemas 

   ex. A.cust-id  =  B.cust-# ? 

 

 

โ‘ก Entity resolution 

 

•  Identifying the matching records from multiple sources 

 

 

โ‘ข Redundancy ๋ถˆํ•„์š”ํ•œ ์ค‘๋ณต 

 

•  Handling 

โ†ช ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ์กฐํ•ฉํ•  ๋•Œ ์ค‘๋ณต์ด ์ž์ฃผ ๋ฐœ์ƒํ•œ๋‹ค. 

โ†ช Object identification : ๊ฐ™์€ ์†์„ฑ์€ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ๋‹ค๋ฅธ ์ด๋ฆ„์„ ๊ฐ€์ง€๊ณ  ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค. 

โ†ช Derivable data : ํ•œ ์†์„ฑ์€ ๋‹ค๋ฅธ ํ…Œ์ด๋ธ”์˜ ์†์„ฑ์œผ๋กœ๋ถ€ํ„ฐ derived ๋  ์ˆ˜ ์žˆ๋‹ค. ๊ฐ€๋ น ์—ฐ๊ฐ„ ์ˆ˜์ต์€ ๋‹ค๋ฅธ ์†์„ฑ๋“ค์— ์˜ํ•ด ๊ณ„์‚ฐ๋œ ๊ฐ’์ผ ์ˆ˜ ์žˆ๋‹ค. 

โ†ช Correlation analysis ๋‚˜ covariance analysis ๋กœ ๋ฐœ๊ฒฌํ•ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค ! 

    โ†ช Chi-square test , PCC, Covariance 

 

 

•  Pearson product-moment coefficient (PCC) 

โ†ช correlation X and Y  (linear dependence) 

 

 

 

โ‘ฃ Inconsistency: Finding the true value of an attribute 

    ex. ์ƒ์ ๋งˆ๋‹ค ๋™์ผํ•œ ์ฑ…์˜ ๊ฐ€๊ฒฉ์ด ์กฐ๊ธˆ์”ฉ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ 

 

 

 

 

 

 

 

3. Data Reduction


 

•  ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋‚˜ ๋ฐ์ดํ„ฐ์›จ์–ดํ•˜์šฐ์Šค๋Š” petabyte ๋‹จ์œ„์˜ ๋ฐ์ดํ„ฐ๋“ค์„ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์ ์šฉํ•  ๊ฒฝ์šฐ ์‹œ๊ฐ„์ด ์˜ค๋ž˜๊ฑธ๋ฆด ์ˆ˜ ์žˆ๋‹ค. 

•  data reduction ์€ ๋™์ผํ•œ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•˜์ง€๋งŒ (produces the same analytical results)  ๋ฐ์ดํ„ฐ์…‹์„  ๋” ์ž‘์€ volume ์œผ๋กœ ์ค„์—ฌ์„œ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. 

 

 

โ‘  Strategies 

 

โ†ช 1. Dimensionality reduction 

    โ–ธ Wavelet transform 

    โ–ธ PCA 

    โ–ธ feature subset selection, feature creation 

 

โ†ช 2. Numerosity reduction (some simply call it data reduction) 

    โ–ธ Regression 

    โ–ธ Histograms, clustering, sampling

    โ–ธ Data cube aggregation 

 

โ†ช 3. Data compression

 

 

 

 

โ‘ก Motivation of Dimensionality reduction

 

•  ์ฐจ์›์˜ ์ €์ฃผ curse of dimensionality 

โ†ช ์ฐจ์›์ด ์ฆ๊ฐ€ํ• ์ˆ˜๋ก data ๋Š” ๋งค์šฐ sparse ํ•ด์ง„๋‹ค. ๋ฐ์ดํ„ฐ๊ฐ€ sparse ํ•ด์ง€๋ฉด ํด๋Ÿฌ์Šคํ„ฐ๋ง์ด๋‚˜ ์ด์ƒ์น˜ ๊ฐ์ง€๋ฅผ ์ง„ํ–‰ํ•  ๋•Œ ๊ฒฐ๊ณผ๊ฐ€ ๋งค์šฐ ์•ˆ์ข‹์•„์ง„๋‹ค. 

 

 

โ†ช ํด๋Ÿฌ์Šคํ„ฐ๋ง ํ˜น์€ ์ด์ƒ์น˜ ํƒ์ง€ ๋ถ„์„์—์„œ ๋งค์šฐ ์ค‘์š”ํ•œ ๊ฐœ๋…์ธ ๊ฐ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์— ๋Œ€ํ•œ ๋ฐ€๋„๋‚˜ ๊ฑฐ๋ฆฌ์— ๋Œ€ํ•œ ์ •์˜๊ฐ€ ๋ฌด์˜๋ฏธํ•ด์งˆ ์ˆ˜ ์žˆ๋‹ค. 

โ†ช ์ฐจ์›์˜ ์ˆ˜๊ฐ€ ์ ์„ ๋•Œ๋Š” ๊ฑฐ๋ฆฌ ์ฐจ์ด๊ฐ€ ํฌ์ง€๋งŒ, ์ฐจ์›์ด ์ฆ๊ฐ€ํ•˜๋ฉด ๊ฑฐ๋ฆฌ ์ฐจ์ด๊ฐ€ ์ž‘์•„์ง„๋‹ค : less meaningful 

 

โ†ช ์ฐจ์›์ด ์ปค์งˆ์ˆ˜๋ก ๋ชจ๋“  ๋ฐ์ดํ„ฐํฌ์ธํŠธ๋“ค์ด ๊ณต๊ฐ„ ์ƒ์—์„œ ์ฝ”๋„ˆ์ชฝ์œผ๋กœ ๋ชฐ๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์ธก์ •ํ•˜๋Š”๊ฒŒ ๋ฌด์˜๋ฏธํ•ด์ง„๋‹ค. (there is no middle resion) 

 

 

 

 

โ‘ข Dimensionality Reduction

 

•  ๋ชฉ์  

   - ์ฐจ์›์˜ ์ €์ฃผ ํ”ผํ•˜๊ธฐ 

   - ๋ฐ์ดํ„ฐ๋งˆ์ด๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์‹œ๊ฐ„๋ณต์žก๋„์™€ ๋ฉ”๋ชจ๋ฆฌ ๋ณต์žก๋„๋ฅผ ์ค„์ด๊ธฐ  

   - ์‹œ๊ฐํ™”ํ•˜๊ธฐ ์‰ฝ๊ฒŒ 

   - ๊ด€๊ณ„์—†๋Š” ์†์„ฑ์ด๋‚˜ ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•จ 

 

•  ๋ถ„์„๊ธฐ์ˆ  

   - Wavvelet transform 

   - PCA 

   - Feature selection 

 

 

•  a. PCA 

   - use orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

   - ์ฃผ์„ฑ๋ถ„์˜ ๊ฐœ์ˆ˜๋Š” ๋ฐ์ดํ„ฐ ์†์„ฑ ๊ฐœ์ˆ˜๋ณด๋‹ค ์ž‘๋‹ค. 

   - first principal component has the largest possible variance

   - each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the  preceding components

 

 

โ†ช   eigenvector ๋Š” ๋ฐฉํ–ฅ์„ ๋‚˜ํƒ€๋‚ด๊ณ , eigenvalue ๋Š” ํ•ด๋‹น ๋ฐฉํ–ฅ์—์„œ์˜ ๋ฐ์ดํ„ฐ์˜ ๋ถ„์‚ฐ์˜ ํฌ๊ธฐ๋ฅผ ์˜๋ฏธํ•œ๋‹ค. 

 

โ†ช   ์ฒซ๋ฒˆ์งธ ์ฃผ์„ฑ๋ถ„๊ณผ ๋‘๋ฒˆ์งธ ์ฃผ์„ฑ๋ถ„๋งŒ์„ ๊ณ ๋ คํ•ด ์ƒˆ๋กœ์šด ์ฐจ์›์— ๋ฐ์ดํ„ฐ๋ฅผ ๋งตํ•‘์‹œํ‚จ๋‹ค. 

 

 

 

•  b. Feature selection 

   - removing redundant and irrelevant features

   - Redundant features: Duplicating much or all of the information

      ex. the purchase price of a product and the amount of sales tax paid

   - Irrelevant features: Containing no information that is useful for the data mining task

      ex. students' ID is often irrelevant to the task of predicting students' GPA

   - Heuristics in attribute subset selection: Stepwise forward selection (picking best), Stepwise backward elimination (eliminate worst), combination two, Decision tree induction 

 

 

•  c. Discrete Wavelet Transform 

   - widely used for signal processing 

   - wavelet : mathematical function used to divide a given function or continuous-time signal into different scale components

 

 

   - One-dimensional Haar wavelet transform

 

โ†ช wavelet transform ์œผ๋กœ๋ถ€ํ„ฐ ์›๋ž˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณต๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

โ†ช using this information we can construct original value 

 

 

โ†ช advantage: a large number of the detail coefficients turn out to be very small in magnitude. Truncating, or removing, these small coefficients introduces only small errors in the reconstructed data, giving a form of “lossy” compression

 

 

 

 

 

โ‘ฃ Numerosity Reduction

 

•  Reducing the data volume by choosing alternative, smaller forms of data representation

 

•  Parametric methods (e.g., regression)

โ†ช store only the parameters

 

 

•  Non-Parametric methods 

  โ†ช Do not assume models

  โ†ช histograms, clustering, sampling, …

 

•  a. sampling 

  -  selection of a subset of individual observations within a population of individuals intended to yield some knowledge about the population

  - mining algorithm to run in complexity that is potentially sub-linear to the size of the data

  - choosing a representative subset

  - ๊ฐ„๋‹จํ•œ ๋žœ๋ค ์ƒ˜ํ”Œ๋ง์˜ ๊ฒฝ์šฐ, skew ๊ฐ€ ์กด์žฌํ•œ๋‹ค๋ฉด ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋‹ค. 

 

  - Types of Sampling

     โ–ธ Simple random sampling : the equal probability of selecting any particular item

     โ–ธ Sampling without replacement

     โ–ธ Sampling with replacement : same object can be picked up more than once 

     โ–ธ Stratified sampling : The population is divided into non-overlapping groups (i.e., strata

 

 

  - sampling size : ์ตœ๋Œ€ํ•œ ์ ๊ฒŒ ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๊ฒƒ๋„ ํ•„์š”ํ•˜์ง€๋งŒ ์›๋ž˜ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ๋Š” ๋ณด์กดํ•  ์ •๋„๋กœ! 

 

•  b. Data cube aggregation 

   - lowest level of a data cube (based cuboid

   - aggregated data for an individual entity of interest

     ex. the amount of sales per day

   - Multiple levels of aggregation in data cubes : day → week → month → quarter → year

   - Using the smallest representation which is enough to solve the task

 

 

 

โ‘ค Data compression 

 

•  Data compression is the process of encoding information using fewer bits than the original representation would use

 

•  it is important also for improving the query performance

•  Almost all data warehousing systems compress data when loading the data

 

•  a. Run-length encoding 

 

 

•  b. Dictionary encoding 

   - for unique value, a separate dictionary entry is created; the index of the dictionary entry is used instead of the value

   - only two bits are required for each value

 

 

 

 

4. DataTransformation and discretization 


 

•  A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values

 

•  Methods 

 

a. Normalization: min-max, z-score

 

 

 

b. Discretization: concept hierarchy climbing

โ†ช Reducing the number of values for a given continuous attribute by dividing the range of the attribute into intervals and replacing actual data values with the interval labels

 

โ†ช Purposes 

    โ–ธ To find informative cut-off points in the data

    โ–ธ To enable the use of some learning algorithms (some learning algorithms can accept only discrete variables)

    โ–ธ To reduce the data size

 

 

 

โ†ช Concept Hierarchy Generation

    โ–ธ Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is usually associated with each dimension in a data warehouse

    โ–ธ facilitate drilling and rolling in data

    โ–ธ Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior)

    โ–ธ can be explicitly specified by domain experts and/or data warehouse designers

 

 

 

 

 

 

728x90

๋Œ“๊ธ€