데이터마이닝 Preprocessing ③

728x90

1. Data Cleaning

① Data quality → preprocess 를 하는 이유

• Accuracy, Completeness, Consistency, Timeliness, Believability, Interpretability

② Data Cleansing

• Data in the real world is dirty

• Incomplete, Noisy, Inconsistent, Intentional

③ Incomplete: lacking attribute values, lacking certain attributes of interest or containing only aggregate data

ex. missing data

↪ 센서기기가 고장났거나, 정보가 쉽게 모이지 않거나, 아이들에게는 연간 소득을 수집할 수 없는 것처럼 모든 케이스에서 수집할 수 있는 속성이 아니거나 ∙∙∙

↪ need to be inferred

• Missing data Handing

↪ 1. Ignoring the tuple: 결측치가 매우 적은(few) 상황이라면 단순히 제거하는 방법은 적절하지 않을 수 있음

↪ 2. Filling the missing value manually: tedious, infeasible

↪ 3. Filling in it automatically : with a global constant (ex. unknown), the attribute mean, The most probable value inference based such as a Bayesian formula or decision tree

④ Noisy: containing noise, errors, or outliers

ex. salary = '-10'

• Noise : random error or variance

• 데이터 수집 기기의 오류이거나 데이터 전송과정에서 문제가 발생했거나, 기술의 한계 (ex. GPS error : 10~20meters) 로 인해 발생할 수 있다.

• Noisy data Handling

↪ 1. Binning: 데이터를 sorting 한 후에 bin 의 구간으로 나누고, bin 의 평균이나 중간값 혹은 boundary 로 Smoothing 을 한다.

↪ 2. Regression: 회귀분석 모델을 적용해 데이터를 적합시켜 노이즈를 줄ㄹ인다.

↪ 3. Clustering: 이상치를 발견하여 제거한다.

↪ 4. computer 와 human inspection 의 조합 : 의심할만한 값을 컴퓨터로 찾고, 사람이 체크해본다.

• Noisy Handling EX: Map Matching

⑤ Inconsistent: containing discrepancies (차이) in codes or names

ex. age = 42, birthday = 03/08/2010

ex. was rating 1,2,3 , now rating A,B,C

⑥ Intentional: disguised missing data

ex. 생일이나 지역을 그냥 디폴트 되어 있는 상태에서 선택하는 것

2. Data Integration

• Combining data from multiple sources into a coherent data store, as in data warehouses

① Schema Intergration

• Integrating data from multiple sources with heterogeneous schemas

ex. A.cust-id = B.cust-# ?

② Entity resolution

• Identifying the matching records from multiple sources

③ Redundancy 불필요한 중복

• Handling

↪ 여러 데이터베이스를 조합할 때 중복이 자주 발생한다.

↪ Object identification : 같은 속성은 다른 데이터베이스에서 다른 이름을 가지고 있을 수 있다.

↪ Derivable data : 한 속성은 다른 테이블의 속성으로부터 derived 될 수 있다. 가령 연간 수익은 다른 속성들에 의해 계산된 값일 수 있다.

↪ Correlation analysis 나 covariance analysis 로 발견해낼 수 있다 !

↪ Chi-square test , PCC, Covariance

• Pearson product-moment coefficient (PCC)

↪ correlation X and Y (linear dependence)

④ Inconsistency: Finding the true value of an attribute

ex. 상점마다 동일한 책의 가격이 조금씩 다를 수 있음

3. Data Reduction

• 데이터베이스나 데이터웨어하우스는 petabyte 단위의 데이터들을 가지고 있으며 복잡한 데이터 분석을 적용할 경우 시간이 오래걸릴 수 있다.

• data reduction 은 동일한 분석 결과를 제공하지만 (produces the same analytical results) 데이터셋을 더 작은 volume 으로 줄여서 표현하는 것을 의미한다.

① Strategies

↪ 1. Dimensionality reduction

▸ Wavelet transform

▸ PCA

▸ feature subset selection, feature creation

↪ 2. Numerosity reduction (some simply call it data reduction)

▸ Regression

▸ Histograms, clustering, sampling

▸ Data cube aggregation

↪ 3. Data compression

② Motivation of Dimensionality reduction

• 차원의 저주 curse of dimensionality

↪ 차원이 증가할수록 data 는 매우 sparse 해진다. 데이터가 sparse 해지면 클러스터링이나 이상치 감지를 진행할 때 결과가 매우 안좋아진다.

↪ 클러스터링 혹은 이상치 탐지 분석에서 매우 중요한 개념인 각 데이터 포인트에 대한 밀도나 거리에 대한 정의가 무의미해질 수 있다.

↪ 차원의 수가 적을 때는 거리 차이가 크지만, 차원이 증가하면 거리 차이가 작아진다 : less meaningful

↪ 차원이 커질수록 모든 데이터포인트들이 공간 상에서 코너쪽으로 몰리기 때문에 데이터간의 거리를 측정하는게 무의미해진다. (there is no middle resion)

③ Dimensionality Reduction

• 목적

- 차원의 저주 피하기

- 데이터마이닝 알고리즘의 시간복잡도와 메모리 복잡도를 줄이기

- 시각화하기 쉽게

- 관계없는 속성이나 노이즈를 제거하기 위함

• 분석기술

- Wavvelet transform

- PCA

- Feature selection

• a. PCA

- use orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components

- 주성분의 개수는 데이터 속성 개수보다 작다.

- first principal component has the largest possible variance

- each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components

↪ eigenvector 는 방향을 나타내고, eigenvalue 는 해당 방향에서의 데이터의 분산의 크기를 의미한다.

↪ 첫번째 주성분과 두번째 주성분만을 고려해 새로운 차원에 데이터를 맵핑시킨다.

• b. Feature selection

- removing redundant and irrelevant features

- Redundant features: Duplicating much or all of the information

ex. the purchase price of a product and the amount of sales tax paid

- Irrelevant features: Containing no information that is useful for the data mining task

ex. students' ID is often irrelevant to the task of predicting students' GPA

- Heuristics in attribute subset selection: Stepwise forward selection (picking best), Stepwise backward elimination (eliminate worst), combination two, Decision tree induction

• c. Discrete Wavelet Transform

- widely used for signal processing

- wavelet : mathematical function used to divide a given function or continuous-time signal into different scale components

- One-dimensional Haar wavelet transform

↪ wavelet transform 으로부터 원래 데이터를 복구할 수 있다.

↪ using this information we can construct original value

↪ advantage: a large number of the detail coefficients turn out to be very small in magnitude. Truncating, or removing, these small coefficients introduces only small errors in the reconstructed data, giving a form of “lossy” compression

④ Numerosity Reduction

• Reducing the data volume by choosing alternative, smaller forms of data representation

• Parametric methods (e.g., regression)

↪ store only the parameters

• Non-Parametric methods

↪ Do not assume models

↪ histograms, clustering, sampling, …

• a. sampling

- selection of a subset of individual observations within a population of individuals intended to yield some knowledge about the population

- mining algorithm to run in complexity that is potentially sub-linear to the size of the data

- choosing a representative subset

- 간단한 랜덤 샘플링의 경우, skew 가 존재한다면 성능이 좋지 않을 수 있다.

- Types of Sampling

▸ Simple random sampling : the equal probability of selecting any particular item

▸ Sampling without replacement

▸ Sampling with replacement : same object can be picked up more than once

▸ Stratified sampling : The population is divided into non-overlapping groups (i.e., strata)

- sampling size : 최대한 적게 샘플링하는 것도 필요하지만 원래 데이터 분포는 보존할 정도로!

• b. Data cube aggregation

- lowest level of a data cube (based cuboid)

- aggregated data for an individual entity of interest

ex. the amount of sales per day

- Multiple levels of aggregation in data cubes : day → week → month → quarter → year

- Using the smallest representation which is enough to solve the task

⑤ Data compression

• Data compression is the process of encoding information using fewer bits than the original representation would use

• it is important also for improving the query performance

• Almost all data warehousing systems compress data when loading the data

• a. Run-length encoding

• b. Dictionary encoding

- for unique value, a separate dictionary entry is created; the index of the dictionary entry is used instead of the value

- only two bits are required for each value

4. DataTransformation and discretization

• A function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values

• Methods

a. Normalization: min-max, z-score

b. Discretization: concept hierarchy climbing

↪ Reducing the number of values for a given continuous attribute by dividing the range of the attribute into intervals and replacing actual data values with the interval labels

↪ Purposes

▸ To find informative cut-off points in the data

▸ To enable the use of some learning algorithms (some learning algorithms can accept only discrete variables)

▸ To reduce the data size

↪ Concept Hierarchy Generation

▸ Concept hierarchy organizes concepts (i.e., attribute values) hierarchically and is usually associated with each dimension in a data warehouse

▸ facilitate drilling and rolling in data

▸ Recursively reduce the data by collecting and replacing low level concepts (such as numeric values for age) by higher level concepts (such as youth, adult, or senior)

▸ can be explicitly specified by domain experts and/or data warehouse designers

728x90

'1️⃣ AI•DS > 📕 머신러닝' 카테고리의 다른 글

데이터마이닝 Classification (decision tree) (1)	2023.04.15
데이터마이닝 Association analysis (0)	2023.03.29
데이터마이닝 Preprocessing ② (0)	2023.03.15
데이터마이닝 Preprocessing ① (1)	2023.03.15
[05. 클러스터링] K-means, 평균이동, GMM, DBSCAN (0)	2022.05.07

Getting better

데이터마이닝 Preprocessing ③

1. Data Cleaning

① Data quality → preprocess 를 하는 이유

② Data Cleansing

③ Incomplete: lacking attribute values, lacking certain attributes of interest or containing only aggregate data

④ Noisy: containing noise, errors, or outliers

⑤ Inconsistent: containing discrepancies (차이) in codes or names

⑥ Intentional: disguised missing data

2. Data Integration

① Schema Intergration

② Entity resolution

③ Redundancy 불필요한 중복

④ Inconsistency: Finding the true value of an attribute

3. Data Reduction

① Strategies

② Motivation of Dimensionality reduction

③ Dimensionality Reduction

④ Numerosity Reduction

⑤ Data compression

4. DataTransformation and discretization

a. Normalization: min-max, z-score

b. Discretization: concept hierarchy climbing

'1️⃣ AI•DS > 📕 머신러닝' 카테고리의 다른 글

댓글

티스토리툴바

데이터마이닝 Preprocessing ③

1. Data Cleaning

① Data quality → preprocess 를 하는 이유

② Data Cleansing

③ Incomplete: lacking attribute values, lacking certain attributes of interest or containing only aggregate data

④ Noisy: containing noise, errors, or outliers

⑤ Inconsistent: containing discrepancies (차이) in codes or names

⑥ Intentional: disguised missing data

2. Data Integration

① Schema Intergration

② Entity resolution

③ Redundancy 불필요한 중복

④ Inconsistency: Finding the true value of an attribute

3. Data Reduction

① Strategies

② Motivation of Dimensionality reduction

③ Dimensionality Reduction

④ Numerosity Reduction

⑤ Data compression

4. DataTransformation and discretization

a. Normalization: min-max, z-score

b. Discretization: concept hierarchy climbing

'1️⃣ AI•DS > 📕 머신러닝' 카테고리의 다른 글

관련글

댓글

티스토리툴바