λ°μ΄ν°λ§μ΄λ Preprocessing β‘
1. Types of data sets
• Relational Records
βͺ collections of records, each of which consists of a fixed set of attributes
• Data matrix: record data
βͺ If the data objects have the same fixed set of numeric attributes, the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute
βͺ m by n matrix, m rows, n columns
• Document data: record data
βͺ Each document becomes a term vector (word vector)
βͺ Each term is a component of the vector
βͺ The value of each component is the # of times the corresponding term occurs in the document
• Transaction data: record data
βͺ special type of record data, where each record involves a set of items
βͺ Also called market basket data
• WWW data: graph data
βͺ world wide web: HTML links between web pages can be represented as a graph
• Social network data: graph data
βͺ Relationships (friends, followers) between users can be modeled as a graph
• Molecular structures: graph data
βͺ Chemical compounds originally have the graph structure
• Sequential data: ordered data
βͺ Sequences of transaction
• Time series data: ordered data
βͺ Sequences of data points, measured typically at successive items spaced at uniform time intervals
(μΌ, μ, μ°λ)
• Genetic Sequence data: ordered data
βͺ Nucleic acids consist of a chain of linked units called nucleotides
• Trajectory data: ordered data
βͺ Sequences of the locations and timestamps of moving objects → Usually represented by latitude and longitude
• Text data: ordered data
2. Data objects and attribute types
• Data objects
βͺ data set can often be viewed as a collection of data objects
βͺ data object (or sample, example, instance, point, tuple) represents an entry
βͺ data objects are described by attributes
βͺ In relational databases, rows are data objects, columns are attributes
• Attributes
βͺ attribute (or dimension, feature) is a property or characteristic of an object
βͺ Types
βΈ categorical : qualitative → Nominal, Ordinal
βΈ Numeric : quantitative → Interval, Ratio
β Categorical attributes types
1) Nominal : categories, states, or "names of things"
2) Binary : special type of nominal attribute
βΈ Nominaattributete with only 2 states (0,1)
βΈ Symmetric binary : both outcomes s equally important
β¨ ex. female and male
βΈ Asymmetric binary : outcomes not equally important
β¨ ex. medical test (positive vs negative : positive is more important
βΈ Convention : assigning to 1 most important outcome
β¨ ex. HIV positive
3) Ordinal : values have a meaningful order (ranking), but the magnitude between successive values is not known
βΈ Ex. Sizes (={small, medium, large}), grades, army rankings
βΈ CQA ) the median of even-numbered ordinal dataset cannot be accurately calculated, but ex, 9 shirts can have sizes from {XXS, XS, S, M, L, XL} in ascending order as follows : XXS, XS, S, S, M, M, L, XL. Then the median is clearly S.
βΈ CQA ) we do not define the order for any set of attributes. For example, the lexicographic order for names is just for convenience.
β‘ Numeric attribute types
1) Interval
βΈ Values are measured on a scale of equal-sized units: values have order
β¨ ex . temperature in Cβ or Fβ, calendar dates
βΈ No true zero-point
β¨ ex. 0Cβ is in a physical sense, arbitrary (cf. 0Kβ) β¨ In a physical sense, zero celsius is arbitrary
2) Ratio
βΈ Inherent zero point (true zero point)
βΈ We can speak of values as being an order of magnitude larger than the units of measurement
β¨ ex. 10Kβ is twice as high as 5Kβ
β¨ ex. temperature in Kelvin, length, counts, monetary quantities
β’ Possible data analysis
β¨ Ordinal attributes λ μ€μκ°, percentile μ κ³μ°ν μ μλ€. (CQA μ°Έκ³ )
β¨ interval attributes λ ratio λ coefficient of variation μ κ³μ°ν μ μλ€.
• Continuous vs Discrete Attributes
βͺ Discrete attribute: A finite number of possible values, can be represented as integer variables
βͺ Continuous attribute: infinite number (ex. temperature, height, weight), Typically represented as floating-point variables
3. Basic statistical description of data
• Central tendency
βͺ Mean
βͺ Median : The middle value of the ordered set if the number of data is odd; otherwise, the average of the two middle values (λ°μ΄ν°κ° νμκ°λ©΄ κ°μ΄λ° κ°, μ§μκ°λ©΄ μ€κ°μ μλ λ κ°μ νκ· κ°)
βͺ Mode : The value that occurs most frequently in the set
β¨ Unimodal : only one mode
β¨ Bimodal : two modes
β¨ Multimodal : multiple modes
• Types of Measures
βͺ Distributive: A measure that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset, and then merging the results in order to arrive at the measure’s value for the original (entire) data set
β¨ ex. sum, count, min, max
βͺ Algebraic: A measure that can be computed by applying an algebraic function to one or more distributive measures.
β¨ ex. mean = sum/count
βͺ Holistic: A measure that must be computed on the entire data set as a whole
β¨ ex. median → exact value for the whole dataset,cannot be calculated by each subset
• Approximation of the Median
βͺ The median value of a data set can be easily approximated by interpolation for grouped data
• Symmetric vs Skewed data
β¨ Positively skewed (long right tail) : mode < median < mean
β¨ Negatively skewed (long left tail) : mean < median < mode
• Dispersion of data
βͺ Quartiles
β¨ Q1 (25th percentile) , Q3 (75th percentile), Q2 (median, 50th percentile)
β¨ IQR = Q3 - Q1
β¨ Five number summary : min, Q1,Q2,Q3, max
β¨ Outlier : usually , a value higher/lower than 1.5 X IQR
βͺ Variance and standard deviation
• Quartile ꡬνλ λ°©λ²
βͺ 첫λ²μ§Έ λ°©λ²
β¨ μ€μκ°μ ν¬ν¨νμ§ μκ³ , μ€μκ°μ κΈ°μ€μΌλ‘ μ λ°μΌλ‘ λ°μ΄ν°λ₯Ό λλμ΄μ κ³μ°νλ λ°©λ²
βͺ λλ²μ§Έ λ°©λ²
β¨ λ°μ΄ν°κ° νμκ°μ¬μ μ€μκ°μ΄ λ°μ΄ν° λ΄μμ μ°Ύμμ§ λ (datum μ΄λΌ λΆλ¦) λ μ€μκ°μ ν¬ν¨μν€κ³ μ λ°μΌλ‘ λλμ΄ κ³μ°νλ€. λ§μ½, datum μ΄ μλλΌλ©΄, 첫λ²μ§Έ λ°©μμΌλ‘ (μ€μκ°μ ν¬ν¨νμ§ μμ) κ³μ°νλ€.
• Graphic Displays of Basic Statistical Descriptions
β box plot
β‘ histogram β
βΈ showing a visual impression of the distribution of data
βΈ νμ€ν κ·Έλ¨ μ§μ¬κ°ν λμ΄ = frequency density of the interval : frequency/width of the interval
βΈ The total area is equal to the number of data
β’ Quantile plot
βΈ For a value xi, when data are sorted in increasing order, fi, indicates that approximately 100 fi % of the data are below or equal to xi
βΈ kth q-quantile
β¨ 1 <= k <= (q-1)
β¨ q = 4 : quartile, 1st-4quantile is Q1, 2nd-4quantile is Q2, 1rd-4quantile is Q3
β¨ q = 100 : percentile
β£ QQ plot
βΈ comparing two probability distributions by plotting their quantiles against each other.
βΈ If the two distributions being compared are identical, the Q-Q plot follows the line y = x
β€ Scatter plot
βΈ A type of mathematical diagram using Cartesian coordinates to display values for two variables for a set of data
βΈ QQ plot μ xμ yμΆμ ν΄λΉνλ λ¨μκ°μ΄ λμΌν λ°λ©΄ (ex. # of item sold in branch1, # of item sold in branch2) scatter plot μ λ¨μκ° λ€λ₯Έ λ λ³μκ° μλ μκ΄ μμ
4. Data visualization techniques
• Thematic Cartography
β Choropleth
β‘ Proportional symbol
β’ Isarithmic
β£ Dot
β€ Dasymetric
β¨ similar to choropleth map, but one difference is that the regions are “not predefined”
5. Data similarity and dissimilarity
• Dissimilarity between Nominal attributes
• Dissimilarity between Binary attributes
βΈ symmetric binary variable
βΈ asymmetric binary variable
β¨ asymmetric μ λ κ°μ μνκ° λμΌν μ€μμ±μ κ°κ³ μμ§ μκΈ° λλ¬Έμ, λ κ°μ΄ λͺ¨λ negative ν λ°μ΄ν°μΈ t λ 무μν΄λ²λ¦°λ€.
βΈ asymmetric binary similarity : Jaccard coefficient
• Distance on Numeric Attributes Minkowski Distance
β¨ h = 1 : Manhanttan
β¨ h = 2 : Euclidean
β¨ h → ∞ : supremum
• Common Properties of a Distance
β¨ Positive definities
β¨ Symmetry
β¨ Triangle Inequality
β¨ Dissimilarity
• Dissimilarity between Ordinal attributes
• Cosine similarity
• Mixed types
→ κ° νμ μ λ§λ μμΌλ‘ κ²μ° νμ λ€μ ν΅ν©ν΄μ€λ€.
→ κ°μ€μΉ : weight importance for the f-th attribute