1. Types of data sets

โข Relational Records
โช collections of records, each of which consists of a fixed set of attributes

โข Data matrix: record data
โช If the data objects have the same fixed set of numeric attributes, the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute
โช m by n matrix, m rows, n columns

โข Document data: record data
โช Each document becomes a term vector (word vector)
โช Each term is a component of the vector
โช The value of each component is the # of times the corresponding term occurs in the document

โข Transaction data: record data
โช special type of record data, where each record involves a set of items
โช Also called market basket data

โข WWW data: graph data
โช world wide web: HTML links between web pages can be represented as a graph

โข Social network data: graph data
โช Relationships (friends, followers) between users can be modeled as a graph

โข Molecular structures: graph data
โช Chemical compounds originally have the graph structure

โข Sequential data: ordered data
โช Sequences of transaction

โข Time series data: ordered data
โช Sequences of data points, measured typically at successive items spaced at uniform time intervals
(์ผ, ์, ์ฐ๋)

โข Genetic Sequence data: ordered data
โช Nucleic acids consist of a chain of linked units called nucleotides

โข Trajectory data: ordered data
โช Sequences of the locations and timestamps of moving objects โ Usually represented by latitude and longitude

โข Text data: ordered data

2. Data objects and attribute types
โข Data objects
โช data set can often be viewed as a collection of data objects
โช data object (or sample, example, instance, point, tuple) represents an entry

โช data objects are described by attributes
โช In relational databases, rows are data objects, columns are attributes
โข Attributes
โช attribute (or dimension, feature) is a property or characteristic of an object
โช Types
โธ categorical : qualitative โ Nominal, Ordinal
โธ Numeric : quantitative โ Interval, Ratio
โ Categorical attributes types
1) Nominal : categories, states, or "names of things"

2) Binary : special type of nominal attribute
โธ Nominaattributete with only 2 states (0,1)
โธ Symmetric binary : both outcomes s equally important
โจ ex. female and male
โธ Asymmetric binary : outcomes not equally important
โจ ex. medical test (positive vs negative : positive is more important
โธ Convention : assigning to 1 most important outcome
โจ ex. HIV positive
3) Ordinal : values have a meaningful order (ranking), but the magnitude between successive values is not known
โธ Ex. Sizes (={small, medium, large}), grades, army rankings
โธ CQA ) the median of even-numbered ordinal dataset cannot be accurately calculated, but ex, 9 shirts can have sizes from {XXS, XS, S, M, L, XL} in ascending order as follows : XXS, XS, S, S, M, M, L, XL. Then the median is clearly S.
โธ CQA ) we do not define the order for any set of attributes. For example, the lexicographic order for names is just for convenience.
โก Numeric attribute types
1) Interval
โธ Values are measured on a scale of equal-sized units: values have order
โจ ex . temperature in Cโ or Fโ, calendar dates
โธ No true zero-point
โจ ex. 0Cโ is in a physical sense, arbitrary (cf. 0Kโ) โจ In a physical sense, zero celsius is arbitrary
2) Ratio
โธ Inherent zero point (true zero point)
โธ We can speak of values as being an order of magnitude larger than the units of measurement
โจ ex. 10Kโ is twice as high as 5Kโ
โจ ex. temperature in Kelvin, length, counts, monetary quantities
โข Possible data analysis

โจ Ordinal attributes ๋ ์ค์๊ฐ, percentile ์ ๊ณ์ฐํ ์ ์๋ค. (CQA ์ฐธ๊ณ )
โจ interval attributes ๋ ratio ๋ coefficient of variation ์ ๊ณ์ฐํ ์ ์๋ค.
โข Continuous vs Discrete Attributes
โช Discrete attribute: A finite number of possible values, can be represented as integer variables
โช Continuous attribute: infinite number (ex. temperature, height, weight), Typically represented as floating-point variables

3. Basic statistical description of data
โข Central tendency
โช Mean
โช Median : The middle value of the ordered set if the number of data is odd; otherwise, the average of the two middle values (๋ฐ์ดํฐ๊ฐ ํ์๊ฐ๋ฉด ๊ฐ์ด๋ฐ ๊ฐ, ์ง์๊ฐ๋ฉด ์ค๊ฐ์ ์๋ ๋ ๊ฐ์ ํ๊ท ๊ฐ)
โช Mode : The value that occurs most frequently in the set
โจ Unimodal : only one mode
โจ Bimodal : two modes
โจ Multimodal : multiple modes
โข Types of Measures
โช Distributive: A measure that can be computed for a given data set by partitioning the data into smaller subsets, computing the measure for each subset, and then merging the results in order to arrive at the measureโs value for the original (entire) data set
โจ ex. sum, count, min, max
โช Algebraic: A measure that can be computed by applying an algebraic function to one or more distributive measures.
โจ ex. mean = sum/count

โช Holistic: A measure that must be computed on the entire data set as a whole
โจ ex. median โ exact value for the whole dataset,cannot be calculated by each subset
โข Approximation of the Median
โช The median value of a data set can be easily approximated by interpolation for grouped data

โข Symmetric vs Skewed data

โจ Positively skewed (long right tail) : mode < median < mean
โจ Negatively skewed (long left tail) : mean < median < mode
โข Dispersion of data
โช Quartiles
โจ Q1 (25th percentile) , Q3 (75th percentile), Q2 (median, 50th percentile)
โจ IQR = Q3 - Q1
โจ Five number summary : min, Q1,Q2,Q3, max
โจ Outlier : usually , a value higher/lower than 1.5 X IQR
โช Variance and standard deviation

โข Quartile ๊ตฌํ๋ ๋ฐฉ๋ฒ
โช ์ฒซ๋ฒ์งธ ๋ฐฉ๋ฒ
โจ ์ค์๊ฐ์ ํฌํจํ์ง ์๊ณ , ์ค์๊ฐ์ ๊ธฐ์ค์ผ๋ก ์ ๋ฐ์ผ๋ก ๋ฐ์ดํฐ๋ฅผ ๋๋์ด์ ๊ณ์ฐํ๋ ๋ฐฉ๋ฒ
โช ๋๋ฒ์งธ ๋ฐฉ๋ฒ
โจ ๋ฐ์ดํฐ๊ฐ ํ์๊ฐ์ฌ์ ์ค์๊ฐ์ด ๋ฐ์ดํฐ ๋ด์์ ์ฐพ์์ง ๋ (datum ์ด๋ผ ๋ถ๋ฆ) ๋ ์ค์๊ฐ์ ํฌํจ์ํค๊ณ ์ ๋ฐ์ผ๋ก ๋๋์ด ๊ณ์ฐํ๋ค. ๋ง์ฝ, datum ์ด ์๋๋ผ๋ฉด, ์ฒซ๋ฒ์งธ ๋ฐฉ์์ผ๋ก (์ค์๊ฐ์ ํฌํจํ์ง ์์) ๊ณ์ฐํ๋ค.

โข Graphic Displays of Basic Statistical Descriptions
โ box plot


โก histogram โญ
โธ showing a visual impression of the distribution of data
โธ ํ์คํ ๊ทธ๋จ ์ง์ฌ๊ฐํ ๋์ด = frequency density of the interval : frequency/width of the interval
โธ The total area is equal to the number of data

โข Quantile plot
โธ For a value xi, when data are sorted in increasing order, fi, indicates that approximately 100 fi % of the data are below or equal to xi

โธ kth q-quantile
โจ 1 <= k <= (q-1)
โจ q = 4 : quartile, 1st-4quantile is Q1, 2nd-4quantile is Q2, 1rd-4quantile is Q3
โจ q = 100 : percentile
โฃ QQ plot
โธ comparing two probability distributions by plotting their quantiles against each other.

โธ If the two distributions being compared are identical, the Q-Q plot follows the line y = x
โค Scatter plot
โธ A type of mathematical diagram using Cartesian coordinates to display values for two variables for a set of data
โธ QQ plot ์ x์ y์ถ์ ํด๋นํ๋ ๋จ์๊ฐ์ด ๋์ผํ ๋ฐ๋ฉด (ex. # of item sold in branch1, # of item sold in branch2) scatter plot ์ ๋จ์๊ฐ ๋ค๋ฅธ ๋ ๋ณ์๊ฐ ์๋ ์๊ด ์์
4. Data visualization techniques
โข Thematic Cartography
โ Choropleth

โก Proportional symbol

โข Isarithmic

โฃ Dot

โค Dasymetric

โจ similar to choropleth map, but one difference is that the regions are โnot predefinedโ
5. Data similarity and dissimilarity
โข Dissimilarity between Nominal attributes





โข Dissimilarity between Binary attributes

โธ symmetric binary variable

โธ asymmetric binary variable

โจ asymmetric ์ ๋ ๊ฐ์ ์ํ๊ฐ ๋์ผํ ์ค์์ฑ์ ๊ฐ๊ณ ์์ง ์๊ธฐ ๋๋ฌธ์, ๋ ๊ฐ์ด ๋ชจ๋ negative ํ ๋ฐ์ดํฐ์ธ t ๋ ๋ฌด์ํด๋ฒ๋ฆฐ๋ค.
โธ asymmetric binary similarity : Jaccard coefficient


โข Distance on Numeric Attributes Minkowski Distance

โจ h = 1 : Manhanttan
โจ h = 2 : Euclidean
โจ h โ โ : supremum

โข Common Properties of a Distance
โจ Positive definities
โจ Symmetry
โจ Triangle Inequality
โจ Dissimilarity
โข Dissimilarity between Ordinal attributes

โข Cosine similarity

โข Mixed types

โ ๊ฐ ํ์ ์ ๋ง๋ ์์ผ๋ก ๊ฒ์ฐ ํ์ ๋ค์ ํตํฉํด์ค๋ค.

โ ๊ฐ์ค์น : weight importance for the f-th attribute
'1๏ธโฃ AIโขDS > ๐ ๋จธ์ ๋ฌ๋' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
๋ฐ์ดํฐ๋ง์ด๋ Association analysis (0) | 2023.03.29 |
---|---|
๋ฐ์ดํฐ๋ง์ด๋ Preprocessing โข (0) | 2023.03.29 |
๋ฐ์ดํฐ๋ง์ด๋ Preprocessing โ (1) | 2023.03.15 |
[05. ํด๋ฌ์คํฐ๋ง] K-means, ํ๊ท ์ด๋, GMM, DBSCAN (0) | 2022.05.07 |
[06. ์ฐจ์์ถ์] PCA, LDA, SVD, NMF (0) | 2022.04.24 |
๋๊ธ