๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“• ๋จธ์‹ ๋Ÿฌ๋‹

๋ฐ์ดํ„ฐ๋งˆ์ด๋‹ Preprocessing โ‘ก

by isdawell 2023. 3. 15.
728x90

 

1. Types of data sets 


 

 

 

•  Relational Records 

โ†ช collections of records, each of which consists of a fixed set of attributes 

 

 

 

 

•  Data matrix: record data 

โ†ช If the data objects have the same fixed set of numeric attributes, the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute 

โ†ช m by n matrix, m rows, n columns

 

 

 

 

•  Document data: record data 

โ†ช Each document becomes a term vector (word vector) 

โ†ช Each term is a component of the vector 

โ†ช The value of each component is the # of times the corresponding term occurs in the document 

 

 

 

 

•  Transaction data: record data 

โ†ช special type of record data, where each record involves a set of items 

โ†ช Also called market basket data 

 

 

 

 

•  WWW data: graph data 

โ†ช world wide web: HTML links between web pages can be represented as a graph 

 

 

 

•  Social network data: graph data 

โ†ช Relationships (friends, followers) between users can be modeled as a graph 

 

 

 

•  Molecular structures: graph data 

โ†ช Chemical compounds originally have the graph structure 

 

 

 

•  Sequential data:  ordered data 

โ†ช Sequences of transaction 

 

 

 

•  Time series data:  ordered data 

โ†ช Sequences of data points, measured typically at successive items spaced at uniform time intervals 

(์ผ, ์›”, ์—ฐ๋„) 

 

 

 

•  Genetic Sequence data:  ordered data 

โ†ช Nucleic acids consist of a chain of linked units called nucleotides 

 

 

 

•  Trajectory data:  ordered data 

โ†ช Sequences of the locations and timestamps of moving objects  →  Usually represented by latitude and longitude 

 

 

 

 

•  Text data:  ordered data 

 

 

 

 

 

2. Data objects and attribute types 


 

•  Data objects 

โ†ช data set can often be viewed as a collection of data objects 

โ†ช data object (or sample, example, instance, point, tuple) represents an entry 

 

 

โ†ช data objects are described by attributes 

โ†ช In relational databases, rows are data objects, columns are attributes 

 

 

•  Attributes

โ†ช attribute (or dimension, feature) is a property or characteristic of an object 

โ†ช Types 

    โ–ธ categorical : qualitative → Nominal, Ordinal 

    โ–ธ Numeric : quantitative Interval, Ratio 

 

 

 

โ‘   Categorical attributes types

 

1) Nominal : categories, states, or "names of things" 

 

 

 

2) Binary : special type of nominal attribute

โ–ธ Nominaattributete with only 2 states (0,1) 

โ–ธ Symmetric binary : both outcomes s equally important 

    โ‡จ ex. female and male 

โ–ธ Asymmetric binary : outcomes not equally important 

    โ‡จ ex. medical test (positive vs negative : positive is more important 

โ–ธ Convention : assigning to 1 most important outcome 

    โ‡จ ex. HIV positive 

 

 

3) Ordinal : values have a meaningful order (ranking), but the magnitude between successive values is not known 

โ–ธ Ex. Sizes (={small, medium, large}), grades, army rankings 

 

โ–ธ CQA ) the median of even-numbered ordinal dataset cannot be accurately calculated, but ex, 9 shirts can have sizes from {XXS, XS, S, M, L, XL} in ascending order as follows : XXS, XS, S, S, M, M, L, XL. Then the median is clearly S.

โ–ธ CQA ) we do not define the order for any set of attributes. For example, the lexicographic order for names is just for convenience.

 

 

 

โ‘ก  Numeric attribute types

 

1) Interval 

โ–ธ Values are measured on a scale of equal-sized units: values have order 

   โ‡จ ex . temperature in Cโˆ˜ or  Fโˆ˜, calendar dates 

โ–ธ No true zero-point 

   โ‡จ ex. 0Cโˆ˜ is in a physical sense, arbitrary (cf. 0Kโˆ˜) โ‡จ  In a physical sense, zero celsius is arbitrary

 

2) Ratio 

โ–ธ Inherent zero point (true zero point) 

โ–ธ We can speak of values as being an order of magnitude larger than the units of measurement 

   โ‡จ ex. 10Kโˆ˜ is twice as high as 5Kโˆ˜ 

   โ‡จ ex. temperature in Kelvin, length, counts, monetary quantities 

 

 

โ‘ข Possible data analysis

 

 

โ‡จ Ordinal attributes ๋Š” ์ค‘์•™๊ฐ’, percentile ์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค. (CQA ์ฐธ๊ณ ) 

โ‡จ interval attributes ๋Š” ratio ๋‚˜ coefficient of variation ์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์—†๋‹ค. 

 

 

 

•  Continuous vs Discrete Attributes

โ†ช Discrete attribute: A finite number of possible values, can be represented as integer variables 

โ†ช Continuous attribute: infinite number (ex. temperature, height, weight), Typically represented as floating-point variables 

 

 

 

 

 

 

 

3. Basic statistical description of data 


 

•  Central tendency 

โ†ช Mean 

โ†ช Median : The middle value of the ordered set if the number of data is odd; otherwise, the average of the two middle values (๋ฐ์ดํ„ฐ๊ฐ€ ํ™€์ˆ˜๊ฐœ๋ฉด ๊ฐ€์šด๋ฐ ๊ฐ’, ์ง์ˆ˜๊ฐœ๋ฉด ์ค‘๊ฐ„์— ์žˆ๋Š” ๋‘ ๊ฐ’์˜ ํ‰๊ท ๊ฐ’) 

โ†ช Mode : The value that occurs most frequently in the set 

    โ‡จ Unimodal : only one mode 

    โ‡จ Bimodal : two modes 

    โ‡จ Multimodal : multiple modes 

 

 

 

 

•  Types of Measures 

 

โ†ช Distributive: A measure that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset, and then merging the results in order to arrive at the measure’s value for the original (entire) data set 

  โ‡จ ex. sum, count, min, max 

 

โ†ช Algebraic: A measure that can be computed by applying an algebraic function to one or more distributive measures. 

  โ‡จ ex. mean = sum/count 

 

 

 

โ†ช Holistic: A measure that must be computed on the entire data set as a whole

  โ‡จ ex. median → exact value for the whole dataset,cannot be calculated by each subset

 

 

 

 

 

•  Approximation of the Median 

 

โ†ช The median value of a data set can be easily approximated by interpolation for grouped data 

 

 

 

•  Symmetric vs Skewed data 

 

 

โ‡จ Positively skewed (long right tail) : mode < median < mean 

โ‡จ Negatively skewed (long left tail) : mean < median < mode 

 

 

 

 

•  Dispersion of data 

โ†ช Quartiles 

    โ‡จ  Q1 (25th percentile) , Q3 (75th percentile), Q2 (median, 50th percentile) 

    โ‡จ  IQR = Q3 - Q1 

    โ‡จ  Five number summary : min, Q1,Q2,Q3, max

    โ‡จ  Outlier : usually , a value higher/lower than 1.5 X IQR 

 

โ†ช Variance and standard deviation 

 

 

 

•  Quartile ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ• 

โ†ช ์ฒซ๋ฒˆ์งธ ๋ฐฉ๋ฒ• 

    โ‡จ  ์ค‘์•™๊ฐ’์„ ํฌํ•จํ•˜์ง€ ์•Š๊ณ , ์ค‘์•™๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ์ ˆ๋ฐ˜์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ„์–ด์„œ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ• 

โ†ช ๋‘๋ฒˆ์งธ ๋ฐฉ๋ฒ• 

    โ‡จ  ๋ฐ์ดํ„ฐ๊ฐ€ ํ™€์ˆ˜๊ฐœ์—ฌ์„œ ์ค‘์•™๊ฐ’์ด ๋ฐ์ดํ„ฐ ๋‚ด์—์„œ ์ฐพ์•„์งˆ ๋•Œ (datum ์ด๋ผ ๋ถ€๋ฆ„) ๋Š” ์ค‘์•™๊ฐ’์„ ํฌํ•จ์‹œํ‚ค๊ณ  ์ ˆ๋ฐ˜์œผ๋กœ ๋‚˜๋ˆ„์–ด ๊ณ„์‚ฐํ•œ๋‹ค. ๋งŒ์•ฝ, datum ์ด ์•„๋‹ˆ๋ผ๋ฉด, ์ฒซ๋ฒˆ์งธ ๋ฐฉ์‹์œผ๋กœ (์ค‘์•™๊ฐ’์„ ํฌํ•จํ•˜์ง€ ์•Š์Œ) ๊ณ„์‚ฐํ•œ๋‹ค. 

 

 

 

 

 

 

•  Graphic Displays of Basic Statistical Descriptions

 

โ‘  box plot 

 

 

 

 

 

โ‘ก histogram โญ 

 

โ–ธ showing a visual impression of the distribution of data 

โ–ธ ํžˆ์Šคํ† ๊ทธ๋žจ ์ง์‚ฌ๊ฐํ˜• ๋†’์ด = frequency density of the interval : frequency/width of the interval 

โ–ธ The total area is equal to the number of data 

 

 

โ‘ข Quantile plot 

 

โ–ธ For a value xi, when data are sorted in increasing order, fi, indicates that approximately 100 fi % of the data are below or equal to xi 

 

 

โ–ธ kth q-quantile 

   โ‡จ 1 <= k <= (q-1) 

   โ‡จ q = 4 : quartile, 1st-4quantile is Q1, 2nd-4quantile is Q2, 1rd-4quantile is Q3

   โ‡จ q = 100 : percentile 

 

 

 

โ‘ฃ QQ plot

 

โ–ธ comparing two probability distributions by plotting their quantiles against each other. 

 

โ–ธ If the two distributions being compared are identical, the Q-Q plot follows the line y = x

 

 

โ‘ค Scatter plot 

 

โ–ธ  A type of mathematical diagram using Cartesian coordinates to display values for two variables for a set of data

โ–ธ QQ plot ์€ x์™€ y์ถ•์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์œ„๊ฐ’์ด ๋™์ผํ•œ ๋ฐ˜๋ฉด (ex. # of item sold in branch1, # of item sold in branch2) scatter plot ์€ ๋‹จ์œ„๊ฐ€ ๋‹ค๋ฅธ ๋‘ ๋ณ€์ˆ˜๊ฐ€ ์™€๋„ ์ƒ๊ด€ ์—†์Œ

 

 

 

 

 

 

4. Data visualization techniques 


 

•  Thematic Cartography

 

โ‘  Choropleth

 

 

โ‘ก Proportional symbol

 

 

 

โ‘ข Isarithmic

 

 

 

 

โ‘ฃ Dot

 

 

 

 

โ‘ค Dasymetric

 

 

 

โ‡จ similar to choropleth map, but one difference is that the regions are “not predefined”

 

 

 

 

5. Data similarity and dissimilarity 


 

•  Dissimilarity between Nominal attributes 

 

 

 

 

test - 1 ์— ๋Œ€ํ•œ attribute ์˜ ๋น„์œ ์‚ฌ์„ฑ ๊ณ„์‚ฐ 

 

 

 

 

•  Dissimilarity between Binary attributes

 

 

โ–ธ symmetric binary variable 

 

๋ถ„์ž๋Š” mismatch ์˜ ๊ฐœ์ˆ˜

 

โ–ธ asymmetric binary variable 

 

๋ถ„์ž๋Š” mismatch์˜ ๊ฐœ์ˆ˜

 

โ‡จ asymmetric ์€ ๋‘ ๊ฐœ์˜ ์ƒํƒœ๊ฐ€ ๋™์ผํ•œ ์ค‘์š”์„ฑ์„ ๊ฐ–๊ณ  ์žˆ์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ๋‘ ๊ฐ’์ด ๋ชจ๋‘ negative ํ•œ ๋ฐ์ดํ„ฐ์ธ t ๋Š” ๋ฌด์‹œํ•ด๋ฒ„๋ฆฐ๋‹ค. 

 

 

โ–ธ asymmetric binary similarity : Jaccard coefficient 

 

 

 

 

 

 

•  Distance on Numeric Attributes Minkowski Distance 

 

L-h norm

 

โ‡จ h = 1 : Manhanttan 

โ‡จ h = 2 : Euclidean 

โ‡จ h → ∞ : supremum 

 

 

 

 

 

•  Common Properties of a Distance 

 

โ‡จ Positive definities 

โ‡จ Symmetry 

โ‡จ Triangle Inequality 

โ‡จ Dissimilarity 

 

 

 

•  Dissimilarity between Ordinal attributes

 

 

 

 

•  Cosine similarity 

 

 

•  Mixed types 

 

 

 

→ ๊ฐ ํƒ€์ž…์— ๋งž๋Š” ์‹์œผ๋กœ ๊ฒŒ์‚ฐ ํ›„์— ๋‹ค์‹œ ํ†ตํ•ฉํ•ด์ค€๋‹ค. 

 

 

→ ๊ฐ€์ค‘์น˜ : weight importance for the f-th attribute

 

728x90

๋Œ“๊ธ€