1. Types of data sets
• Relational Records
โช collections of records, each of which consists of a fixed set of attributes
• Data matrix: record data
โช If the data objects have the same fixed set of numeric attributes, the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute
โช m by n matrix, m rows, n columns
• Document data: record data
โช Each document becomes a term vector (word vector)
โช Each term is a component of the vector
โช The value of each component is the # of times the corresponding term occurs in the document
• Transaction data: record data
โช special type of record data, where each record involves a set of items
โช Also called market basket data
• WWW data: graph data
โช world wide web: HTML links between web pages can be represented as a graph
• Social network data: graph data
โช Relationships (friends, followers) between users can be modeled as a graph
• Molecular structures: graph data
โช Chemical compounds originally have the graph structure
• Sequential data: ordered data
โช Sequences of transaction
• Time series data: ordered data
โช Sequences of data points, measured typically at successive items spaced at uniform time intervals
(์ผ, ์, ์ฐ๋)
• Genetic Sequence data: ordered data
โช Nucleic acids consist of a chain of linked units called nucleotides
• Trajectory data: ordered data
โช Sequences of the locations and timestamps of moving objects → Usually represented by latitude and longitude
• Text data: ordered data
2. Data objects and attribute types
• Data objects
โช data set can often be viewed as a collection of data objects
โช data object (or sample, example, instance, point, tuple) represents an entry
โช data objects are described by attributes
โช In relational databases, rows are data objects, columns are attributes
• Attributes
โช attribute (or dimension, feature) is a property or characteristic of an object
โช Types
โธ categorical : qualitative → Nominal, Ordinal
โธ Numeric : quantitative → Interval, Ratio
โ Categorical attributes types
1) Nominal : categories, states, or "names of things"
2) Binary : special type of nominal attribute
โธ Nominaattributete with only 2 states (0,1)
โธ Symmetric binary : both outcomes s equally important
โจ ex. female and male
โธ Asymmetric binary : outcomes not equally important
โจ ex. medical test (positive vs negative : positive is more important
โธ Convention : assigning to 1 most important outcome
โจ ex. HIV positive
3) Ordinal : values have a meaningful order (ranking), but the magnitude between successive values is not known
โธ Ex. Sizes (={small, medium, large}), grades, army rankings
โธ CQA ) the median of even-numbered ordinal dataset cannot be accurately calculated, but ex, 9 shirts can have sizes from {XXS, XS, S, M, L, XL} in ascending order as follows : XXS, XS, S, S, M, M, L, XL. Then the median is clearly S.
โธ CQA ) we do not define the order for any set of attributes. For example, the lexicographic order for names is just for convenience.
โก Numeric attribute types
1) Interval
โธ Values are measured on a scale of equal-sized units: values have order
โจ ex . temperature in Cโ or Fโ, calendar dates
โธ No true zero-point
โจ ex. 0Cโ is in a physical sense, arbitrary (cf. 0Kโ) โจ In a physical sense, zero celsius is arbitrary
2) Ratio
โธ Inherent zero point (true zero point)
โธ We can speak of values as being an order of magnitude larger than the units of measurement
โจ ex. 10Kโ is twice as high as 5Kโ
โจ ex. temperature in Kelvin, length, counts, monetary quantities
โข Possible data analysis
โจ Ordinal attributes ๋ ์ค์๊ฐ, percentile ์ ๊ณ์ฐํ ์ ์๋ค. (CQA ์ฐธ๊ณ )
โจ interval attributes ๋ ratio ๋ coefficient of variation ์ ๊ณ์ฐํ ์ ์๋ค.
• Continuous vs Discrete Attributes
โช Discrete attribute: A finite number of possible values, can be represented as integer variables
โช Continuous attribute: infinite number (ex. temperature, height, weight), Typically represented as floating-point variables
3. Basic statistical description of data
• Central tendency
โช Mean
โช Median : The middle value of the ordered set if the number of data is odd; otherwise, the average of the two middle values (๋ฐ์ดํฐ๊ฐ ํ์๊ฐ๋ฉด ๊ฐ์ด๋ฐ ๊ฐ, ์ง์๊ฐ๋ฉด ์ค๊ฐ์ ์๋ ๋ ๊ฐ์ ํ๊ท ๊ฐ)
โช Mode : The value that occurs most frequently in the set
โจ Unimodal : only one mode
โจ Bimodal : two modes
โจ Multimodal : multiple modes
• Types of Measures
โช Distributive: A measure that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset, and then merging the results in order to arrive at the measure’s value for the original (entire) data set
โจ ex. sum, count, min, max
โช Algebraic: A measure that can be computed by applying an algebraic function to one or more distributive measures.
โจ ex. mean = sum/count
โช Holistic: A measure that must be computed on the entire data set as a whole
โจ ex. median → exact value for the whole dataset,cannot be calculated by each subset
• Approximation of the Median
โช The median value of a data set can be easily approximated by interpolation for grouped data
• Symmetric vs Skewed data
โจ Positively skewed (long right tail) : mode < median < mean
โจ Negatively skewed (long left tail) : mean < median < mode
• Dispersion of data
โช Quartiles
โจ Q1 (25th percentile) , Q3 (75th percentile), Q2 (median, 50th percentile)
โจ IQR = Q3 - Q1
โจ Five number summary : min, Q1,Q2,Q3, max
โจ Outlier : usually , a value higher/lower than 1.5 X IQR
โช Variance and standard deviation
• Quartile ๊ตฌํ๋ ๋ฐฉ๋ฒ
โช ์ฒซ๋ฒ์งธ ๋ฐฉ๋ฒ
โจ ์ค์๊ฐ์ ํฌํจํ์ง ์๊ณ , ์ค์๊ฐ์ ๊ธฐ์ค์ผ๋ก ์ ๋ฐ์ผ๋ก ๋ฐ์ดํฐ๋ฅผ ๋๋์ด์ ๊ณ์ฐํ๋ ๋ฐฉ๋ฒ
โช ๋๋ฒ์งธ ๋ฐฉ๋ฒ
โจ ๋ฐ์ดํฐ๊ฐ ํ์๊ฐ์ฌ์ ์ค์๊ฐ์ด ๋ฐ์ดํฐ ๋ด์์ ์ฐพ์์ง ๋ (datum ์ด๋ผ ๋ถ๋ฆ) ๋ ์ค์๊ฐ์ ํฌํจ์ํค๊ณ ์ ๋ฐ์ผ๋ก ๋๋์ด ๊ณ์ฐํ๋ค. ๋ง์ฝ, datum ์ด ์๋๋ผ๋ฉด, ์ฒซ๋ฒ์งธ ๋ฐฉ์์ผ๋ก (์ค์๊ฐ์ ํฌํจํ์ง ์์) ๊ณ์ฐํ๋ค.
• Graphic Displays of Basic Statistical Descriptions
โ box plot
โก histogram โญ
โธ showing a visual impression of the distribution of data
โธ ํ์คํ ๊ทธ๋จ ์ง์ฌ๊ฐํ ๋์ด = frequency density of the interval : frequency/width of the interval
โธ The total area is equal to the number of data
โข Quantile plot
โธ For a value xi, when data are sorted in increasing order, fi, indicates that approximately 100 fi % of the data are below or equal to xi
โธ kth q-quantile
โจ 1 <= k <= (q-1)
โจ q = 4 : quartile, 1st-4quantile is Q1, 2nd-4quantile is Q2, 1rd-4quantile is Q3
โจ q = 100 : percentile
โฃ QQ plot
โธ comparing two probability distributions by plotting their quantiles against each other.
โธ If the two distributions being compared are identical, the Q-Q plot follows the line y = x
โค Scatter plot
โธ A type of mathematical diagram using Cartesian coordinates to display values for two variables for a set of data
โธ QQ plot ์ x์ y์ถ์ ํด๋นํ๋ ๋จ์๊ฐ์ด ๋์ผํ ๋ฐ๋ฉด (ex. # of item sold in branch1, # of item sold in branch2) scatter plot ์ ๋จ์๊ฐ ๋ค๋ฅธ ๋ ๋ณ์๊ฐ ์๋ ์๊ด ์์
4. Data visualization techniques
• Thematic Cartography
โ Choropleth
โก Proportional symbol
โข Isarithmic
โฃ Dot
โค Dasymetric
โจ similar to choropleth map, but one difference is that the regions are “not predefined”
5. Data similarity and dissimilarity
• Dissimilarity between Nominal attributes
• Dissimilarity between Binary attributes
โธ symmetric binary variable
โธ asymmetric binary variable
โจ asymmetric ์ ๋ ๊ฐ์ ์ํ๊ฐ ๋์ผํ ์ค์์ฑ์ ๊ฐ๊ณ ์์ง ์๊ธฐ ๋๋ฌธ์, ๋ ๊ฐ์ด ๋ชจ๋ negative ํ ๋ฐ์ดํฐ์ธ t ๋ ๋ฌด์ํด๋ฒ๋ฆฐ๋ค.
โธ asymmetric binary similarity : Jaccard coefficient
• Distance on Numeric Attributes Minkowski Distance
โจ h = 1 : Manhanttan
โจ h = 2 : Euclidean
โจ h → ∞ : supremum
• Common Properties of a Distance
โจ Positive definities
โจ Symmetry
โจ Triangle Inequality
โจ Dissimilarity
• Dissimilarity between Ordinal attributes
• Cosine similarity
• Mixed types
→ ๊ฐ ํ์ ์ ๋ง๋ ์์ผ๋ก ๊ฒ์ฐ ํ์ ๋ค์ ํตํฉํด์ค๋ค.
→ ๊ฐ์ค์น : weight importance for the f-th attribute
'1๏ธโฃ AIโขDS > ๐ ๋จธ์ ๋ฌ๋' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
๋ฐ์ดํฐ๋ง์ด๋ Association analysis (0) | 2023.03.29 |
---|---|
๋ฐ์ดํฐ๋ง์ด๋ Preprocessing โข (0) | 2023.03.29 |
๋ฐ์ดํฐ๋ง์ด๋ Preprocessing โ (1) | 2023.03.15 |
[05. ํด๋ฌ์คํฐ๋ง] K-means, ํ๊ท ์ด๋, GMM, DBSCAN (0) | 2022.05.07 |
[06. ์ฐจ์์ถ์] PCA, LDA, SVD, NMF (0) | 2022.04.24 |
๋๊ธ