데이터마이닝 Preprocessing ①

728x90

1. GOAL of the course

• 데이터마이닝의 기본 개념들

• 데이터 전처리 (data preprocessing)

• Association, correlation, and frequent pattern analysis

• Classification

• Cluster and outlier analysis

• Data Mining: Industry efforts and social impacts

2. Technology Trend

• Explosive growth of data : from terabytes to petabytes

↪ Big data, Internet of Things, Web2.0, Scientific simulation

• Motivation of Data Mining

↪ Information is often hidden in data, but much of the data is not analyzed at all

↪ Unprecedented amounts of data are being generated and collected (computing and storage technologies), but most of data are just kept stored

• Internet of Things , Machine-to-Machine, Smarter Planet

↪ sensors in our daily life , with wireless network

↪ Wireless Network > Sensor Streams > Data mining and knowledge discovery

• Web 2.0

↪ Web 2.0 site allows users to interact and collaborate with each other in a social media dialogue

↪ Social networking sites (Facebook, Twitter) , Blogs, Wikis, Photo/Video sharing sites

↪ Social networking sites → data is modeled using a graph

• Scientific Simulations

↪ Data-centric science: through scientific simulations and observations data is produced

↪ Empirical science → theoretical science → computational science → data science

↪ ex. NASA Center for Climate Simulation (NCCS) - satellite observational data, Particle accelerator, DNA sequencing

↪ It is very important to efficiently analyze the vast amount of data generated by observations and simulations to facilitate scientific research

3. Introduction to Data Mining

• Data mining

↪ knowledge discovery from data

↪ Extraction of interesting (non-trivial, implicit, previously unknown, and potentially useful) patterns or knowledge from huge amounts of data

↪ Exploration & analysis, by automatic means, of large quantities of data in order to discover meaningful patterns

↪ 다른 명칭 : Knowledge Discovery from data (KDD), Machine learning, Knowledge extraction

• Not Data Mining

↪ ex1. Look up phone numbers in a phone → simple searching

↪ ex2. Query a Web search engine for the information about "Amazon" → simple keyword search

↪ Simple searching is not considered as data mining

• Knowledge discovery process

↪ data mining is a part of KDD

↪ (Data) → selection → (target data) → preprocessing → (preprocessed data) → transformation → (transformed data) → Data mining → (patterns) → interpretation / evaluation → knowledge

▸ Preprocessing : Data Cleaning (noise, error), Data Integration, Feature selection

▸ Transformation : some algorithm requires a specific data format

▸ Interpretation/Evaluation: Analysis, Visualization by a human expert

• Predictive data analytics

↪ One of the topics in data mining

↪ Progression from data to insights to decisions

↪ Art of building and using models that make predictions based on patterns extracted from historical data

• Beginning of Data mining

↪ Dr.Rakesh Agrawal's pioneering work imid-1990s's: Association rules (e.g market basket analysis), Apriori algorithm (how often two sets of items occur), Sequential patterns

• Confluence of multiple disciplines

↪ Tremendous amount of data: higperformancece computing techniques to handle

↪ high dimensionality of data

↪ high complexity of data type: data streams and sensor data, Time-series data, sequence data, trajectory data, social network data, graph-structured data → heterogeneous data types

• sub-fields in data mining

• Data mining and privacy

↪ Data mining should NOT violate the privacy of the data owners

↪ Privacy-preserving data mining is an important direction in data mining

• 데이터마이닝 분야 저널/컨퍼런스

↪ KDD conference: KDD, ICDM, SDM

↪ Journal: IEEE, TKDE, DMKD, TKDD

4. Example

• Association Rule Discovery

↪ predict the occurrence of an item based on occurrences of other items

• Market Basket analysis

↪ An example of association rule discovery

↪ Goal: to identify items that are bought together by sufficiently many customers

↪ ex. diaper and beer

• Classification

• Direct marketing

↪ Goal: reducing the cost of mailing by targeting a set of consumers likely to buy a new product

↪ ex. 카카오톡 지그재그 메시지

• Clustering

↪ Grouping data to form new categories (clusters)

↪ 원리: maximizing intra-cluster (within) similarity and minimizing inter-cluster (between) similarity

• Document Clustering

↪ Useful to automatically group retrieved documents into a list of meaningful categories

• Outlier analysis (↔ clustering)

↪ Finding data objects that do not comply with the general behavior or model of the data

↪ To find abnormal (suspicious) behavior from the data

↪ Ex. credit card fraud detection

728x90

'1️⃣ AI•DS > 📕 머신러닝' 카테고리의 다른 글

데이터마이닝 Preprocessing ③ (0)	2023.03.29
데이터마이닝 Preprocessing ② (0)	2023.03.15
[05. 클러스터링] K-means, 평균이동, GMM, DBSCAN (0)	2022.05.07
[06. 차원축소] PCA, LDA, SVD, NMF (0)	2022.04.24
[05. 회귀] 선형회귀, 다항회귀, 규제회귀, 로지스틱회귀, 회귀트리 (0)	2022.03.25

Getting better

데이터마이닝 Preprocessing ①

1. GOAL of the course

2. Technology Trend

3. Introduction to Data Mining

4. Example

'1️⃣ AI•DS > 📕 머신러닝' 카테고리의 다른 글

댓글

티스토리툴바

데이터마이닝 Preprocessing ①

1. GOAL of the course

2. Technology Trend

3. Introduction to Data Mining

4. Example

'1️⃣ AI•DS > 📕 머신러닝' 카테고리의 다른 글

관련글

댓글

티스토리툴바