데이터마이닝 Association analysis

728x90

1. Basic Concepts

① Overview

• Motivation : finding inherent regularities in data

↪ 함께 구매되는 상품이 무엇이 있을까

↪ PC 를 사고난 직후에 구매하는 상품은 뭐가 있을까

↪ 새로운 약에 민감한 DNA는 어떤 종류가 있을까

↪ 웹문서를 자동으로 분류할 수 있을까

⇨ 연관된 규칙을 찾아보자

• Application

↪ Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis

② Association Rule Mining

• Given a set of transactions, we find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

③ Definition

A. Frequent Itemset

↪ Itemset : collection of one or more items : {Milk, Bread, Diaper}

↪ k-itemset : An itemset that contains k items

↪ Support count (σ) or absolute support

↪ (Relative) support : Fraction of transactions that contain an itemset

↪ Frequent Itemset : An itemset whose support is greater than or equal to a minsup threshold

B. Association Rule

↪ Association Rule : An implication expression of the form X→Y, where X and Y are itemsets

↪ Support (s) : Fraction of transactions that contain both X and Y

↪ Confidence (c) : Measures how often items in Y appear in transactions that contain X

④ Association Rule Mining Task

• Given a set of transactions T, the goal of association rule mining is to find all rules such that

• Brute-force approach

- 모든 연관규칙을 나열한다

- 각 규칙에 대해 support 와 confidence 를 계산한다.

- minsup 과 minconf 를 넘지 않는 규칙은 제거한다. (prune)

⇨ 계산량이 매우 많다!

⑤ Mining Association Rules

• Examples and observations

↪ 위의 모든 규칙은 {Milk, Diaper, Beer} 의 binary partition 의 조합이다.

↪ 같은 itemset 으로부터 발생된 규칙들은 동일한 support 값을 가지고 있지만, confidence 값은 다르다.

↪ 따라서 support 와 confidence 의 요구사항을 분리해볼 수 있다.

↪ 빈번하지 않은 item set 은 연관규칙에서 찾아볼 수 없다. 빈발하지 않은 itemset 의 partition 에 해당하는 규칙은 minsup 을 만족시킬 수 없다 ⇨ support 와 confidence 의 조건을 분리시켜서 봐야 한다.

• Steps

1. Frequent itenset generation 빈발집합항목 생성

▸ Generating all itemsets whose support ≥ minsup

▸ 여전히 계산적으로 비용이 높긴 함

2. Rule generation

▸ 각 빈발항목 집합에서 high confidence rule 생성, 여기서 각 규칙은 빈발 항목집합의 이진분할

▸ 1번에서 minsup 은 확인했으니, 이 단계에서는 minconf 만 확인하면 됨

2. Frequent itemset mining methods

① Introduce

• Candidate Itemsets : 2^d 만큼의 후보 itemset 들을 고려해볼 수 있음, d는 transaction 에 등장한 고유한 아이템 수를 의미함

• Brute-force approach

↪ Each itemset in the lattice is a candidate frequent itemset

↪ We count the support of each candidate by scanning the database

↪ 모든 후보와 각 transaction 들을 매치해보아야 함

↪ Complexity ~ O(NMw) ⇨ Expensive since M = 2^d !! 계산복잡도가 매우 큼

• Computational complexity

↪ d unique items 이 있을 때, 가능한 총 itemset 은 2^d 개이다.

↪ 이때 가능한 총 association rule 들은 R 개가 가능하다.

• Possible Improvements

a. Reducing the # of candidates (M)

⇨ Complete search: M=2^d

⇨ Using pruning techniques to reduce M : apriori 알고리즘

b. Reducing the number of transactions (N)

⇨ Reducing size of N as the size of itemset increases

⇨ Being used by DHP and vertical-based mining algorithms (수업에서는 다루지 않음)

c. Reducing the number of comparisons (NM)

⇨ Using efficient data structures (e.g., hash tables) to store the candidates or transactions

⇨ No need to match every candidate against every transaction : FP-growth 알고리즘

② Apriori

Candidate 개수를 줄이는 방법!

• Apriori principle or downward closure property

⇨ If an itemset is frequent, then all of its subsets must also be frequent

: 항목집합이 빈발하게 있는 경우, 항목집합의 모든 부분집합도 빈발해야 한다.

⇨ anti-monotone property of support : Support of an itemset never exceeds the support of its subsets

• Example

↪ 1-itemset 에서 minsup 을 만족하지 않는 Coke 와 Eggs 는 제외한다.

↪ coke 와 eggs 를 제외하여 조합해 만든 2-itemset 에서 역시 minsup 을 만족하지 않는 조합은 제외한다.

↪ 3-itemset 조합 ... (새로운 빈발집합을 찾을 때까지 반복)

⇨ 이로써 가능한 후보들을 줄일 수 있다.

• Algorithm Pseudo code

전체코드

쉬운설명블로그

부분적으로 살펴보자 !

↪ Lk : frequent k-itemsets

↪ Ck : candidate k-itemsets

↪ Ct : transaction t 에서의 candidate

⇨ Ck = apriori-gen(L_k-1)

⇨ Lk = {c | c.count ≥ minsup}

⇨ Ct = subset(Ck,t)

• Candidate Genration (apriori-gen)

↪ 1. L_k-1 에서 self-joining 과정을 수행

↪ 2. Pruning

↪ 예를들어, L3 = {abc,abd,acd,ace,bcd} 길이가 3인 빈발집합이라고 했을 때, L3 의 self joining 결과로 C4 길이가 4인 후보집합을 생성한다고 가정해보자. 이때 self joining 을 하려면, 최소 2개의 item 이 겹쳐야 joining 이 가능하다. 각각의 조합으로 생성된 것들 중에, abc 와 abd 의 조합으로 abcd 를 만들었고, acd와 ace의 조합으로 acde 를 만들었다고 해보자. 여기서 abcd 의 경우, abc, abd, acd, bcd 가 L3 에 모두 포함되어 있다. 그러나 acde 의 경우, acd,ace 는 L3 에 포함되어 있지만, ade, cde 는 포함되지 않기 때문에 anti-monotone 성질에 위배되어, acde 는 C4 후보에서 제외할 수밖에 없다.

↪ 다시 정확한 이해를 위해 설명해보자면, L2 로부터 C3 후보집합을 도출해낼 때, {ABC} 의 경우 {AB} 의 조합이 L2에 없기 때문에 제외, {ABE}의 경우 {AE} 조합이 없기 때문에 제외, {ACE} 의 경우도 {AE} 가 없기 때문에 제외한다. {BCE} 같은 경우는 {BC},{BE},{CE} 가 모두 L2 에 빈발집합으로 존재하기 때문에 가능하다.

↪ 여기서도 L2에서 C3 후보집합을 생성할 때 설명을 참고해보자!

↪ 아무튼 이런식으로 빈발집합 L 들을 찾아가면 된다.

③ Reducing # of comparisons

비교 개수를 줄이는 방법

• support counting : 트랜젝션의 데이터베이스를 살펴보는 것은 각 후보집합의 support 를 결정한다. 비교횟수를 줄이기 위해서, 후보집합들을 해시구조 hash structure 로 저장한다. 모든 후보집합에 대해 각각의 트랜젝션을 비교할 필요 없이, 특정 후보집합을 포함하는 hased buckets 과만 비교하면 된다.

• Generating hash tree

길이가 3인 15개의 후보집합들이 있다고 가정해보자

해시함수가 아래 그림과 같이 주어지고, 리프노드에 저장할 수 있는 최대 아이템집합의 수인 Max leaf size 를 결정해준다. 후보 아이템 집합의 수가 max size 를 넘어간다면 노드를 아래로 쪼갠다.

핵심은 해시트리의 i 번째 level 에서는 i번째 item 을 확인해 해시함수 규칙에 맞게 후보 아이템셋을 배정하는 것이다.

예를들어 {145} itemset 를 할당한다고 하면, 첫번째 level 에서 첫번째 item 을 고려해 할당하므로 1은 (1,4,7) 중에 포함되어 있기 때문에 가장 왼쪽 가지 하위로 요소가 배정된다. 노드 아래로 배정되는 아이템셋이 3개가 초과되면 계속 하위 노드들을 생성해 배정을 진행하면 된다.

• Subset operation using the hash tree

해시트리를 사용하는 이유는 subset operation 을 효율적으로 하기 위함이다.

트랜젝션 t를 포함하는 모든 가능한 후보집합들을 찾고싶다.

트랜젝션 t가 주어졌을 때, 가능한 크기가 3인 subset 들을 계산하는 과정은 아래와 같다.

이때 트랜젝션t에서, 해시트리를 사용하여 candidates 들과 비교해 빈발집합을 찾아가는 과정을 그리면 다음과 같다.

노란색은 우리가 해시함수를 통해 할당해놓은 candidate 들이고, 민트색은 transaction 에서 파생된 아이템집합들이라 볼 수 있다. 해시트리를 사용해서 빈발집합을 찾아보자.

subset operation 처럼 prefix 를 고정하고 가능한 조합들을 hash 함수에 따라 할당해보면 다음과 같다. 핑크색과 파란색 화살표들이 가리키고 있는 노드에 후보군이 위치할 수 있다.

cadidate 들과 transaction 에서 생성되는 itemset 들을 비교해보며 support 를 계산해 빈발집합들을 찾아내간다. 이때 선택되지 않은 리프노드들 같은 경우엔 트랜젝션에 포함되지 않는 subset 이라고 생각하면 된다.

여기서 458 은 트랜젝션에 포함되지 않아 횟수를 더하진 않지만 어쨋든 전체 함수가 트랜젝션의 subset 들에 적용은 되기 때문에 count를 해주긴 해준다!

그래서 최종적으로 15개의 후보들 중에 9개가 트랜젝션과 매칭이 되고, 우리는 9개의 후보 아이템셋중에 5개의 리프노드에 대해 support 를 게산해낸다.

• Apriori 의 Complexity 에 영향을 미치는 요소들

a. Choice of minimum support threshold : 낮은 minsup 을 고르면 더 많은 빈발집합들을 얻을 수 있고, 이는 빈발집합의 최대 길이와 가능한 후보집합의 수를 증가시킬 수 있다.

b. Dimensionality (number of items) of the data set : More space is needed to store the support count of each item. 만약 item 수가 증가한다면 계산 복잡도도 증가할 것이다.

c. Size of database : apriori 는 다양한 경로로 만들어지기 때문에 트랜젝션의 수만큼 알고리즘의 run time 도 증가할 것이다.

d. Average transaction width : Transaction width increases with denser data sets. 이는 빈발집합의 최대 크기를 증가시킬 수 있다.

⇨ 최소 지지도 임계값 선택, 데이터집합의 차원 (항목의수), 데이터베이스의 크기, 평균 트랜잭션 너비

③ FP-growth

• Motivation

> 여러 개의 데이터베이스를 살펴보는 것은 비용이 많이 든다. 또한 긴 패턴을 찾는 것은 많은 scanning 이 필요하고 많은 후보집합들을 생성해낸다. (# of whole database scan is proportional to the longest pattern)

Bottleneck: candidate-generation-and-test

> candidate generation 에서 발생하는 복잡도를 피해갈 수는 없을까 ⇨ FP-growth 알고리즘의 등장 이유

• Frequent-Pattern Mining without Candidate Generation

> Heuristic : P 를 빈발집합, S를 P를 포함하고 있는 트랜잭션 셋, x를 item 이라고 하자. 만약 x가 빈발 아이템에 해당한다면 {x}∪P 도 반드시 빈발집합이 된다.

> FP grwoth 알고리즘은 candidate generation 이 필요치 않다 ⭐

> compact 한 데이터구조인 FP-tree 는 frequent pattern mining 을 위한 정보 저장 구조이다.

> 빈발 패턴의 set 을 살펴보기 위해 Recursive mining algorithm 구조를 취한다.

• Intuition

universe 가 전체 데이터셋이고, P 는 빈발집합이고, S 는 트랜잭션이고, S 는 P 를 포함하고 있다. x 는 각 아이템 element 를 의미하고, S 에서 빈발한 아이템에 해당하는 경우인 x가 있다고 할 때, P+x 는 전체 데이터베이스에서 빈발하다고 볼 수 있다. P+x 가 빈발한지 아닌지를 체크하기 위해서 전체 데이터베이스를 살펴볼 필요는 없다. S 만 살펴보면 된다.

• FP-Tree

FP-growth 알고리즘을 만들기 위해선 FP-tree 를 만들어야 한다.

1) 데이터베이스를 한 번만 훑어보면서, 빈발한 1-itemset 들을 확인한다. (single item patterns)

a. 빈발하지 않은 itemset 은 제거한다.

2) frequency descending order 인 f list 에 근거하여 빈발집합들을 정렬한다.

3) DB 를 다시 한번 더 훑으면서 FP tree 를 만든다. 부모노드를 중심으로 트랜잭션을 자식노드로 추가하며 트리를 생성한다.

위와 같은 ordered frequent items 를 기준으로 하여, 각 트랜잭션별로 살펴보며 FP tree 를 만드는 과정은 다음과 같다.

TID = 200 을 스캐닝 하는 과정을 자세히 살펴보면, 아이템이 기존 노드에 있는 경우에는 count 를 하나씩 증가시키고, 기존 노드가 없는 경우는 새로운 path 로 추가시키고 있음을 확인해 볼 수 있다.

최종적으로 생성된 FP tree 는 다음과 같다.

이때 Header table 의 노드 수와 FP tree 에서의 노드의 frequency 를 헷갈리지 않도록 주의하자!

• FP-Tree Construction

이처럼 FP-tree 를 생성하는데 있어 데이터베이스를 딱 2번만 보면 된다. 1-iemset 를 찾기 위해 한 번, 트리에서 각 트랜잭션을 삽입하기 위해 한번!

이어지는 mining 작업은 FP-tree (very compact, not using the database) 를 기반으로 이루어진다.

쉬운설명블로그

• How to mine an FP-Tree

(1) forming conditional pattern bases

빈발집합인 α 를 포함하는 FP-tree 의 prefix path 를 따라가며 support 를 계산한다.

예를들어 α = {m} 인 경우, {m} 은 빈발집합에 해당하고, {m} 의 conditional pattern base 를 계산하면 다음과 같다. m 을 포함하는 path 는 FP-tree 에서 살펴봤을 때 <f,c,a,m> 와 <f,c,a,b,m> 이 있고, support 를 계산하면 (m 노드의 frequency 가 support 가 된다) 각각 2와 1이 된다.

(2) constructing conditional FP-trees

앞에서 살펴본 conditional pattern base (α = {m}) 를 기반으로 조건부 FP tree 를 생성하는데, 아래 그림을 참고하면 다음과 같다. 이때 b 는 frequency 가 1이므로 minsup 을 충족시키지 못해 제외시킨다.

(3) recursively mining conditional FP-trees

(1),(2) 에 해당하는 과정을 반복적으로 수행하며 빈발집합들을 찾아가면 된다.

• Pattern Growth

⇨ α 가 DB 의 빈발집합이고, B 는 α 의 conditional pattern base 이고, β 는 B 의 itemset 에 해당할 때, αUβ 는 DB 에서 빈발하고 β 는 B에서 빈발하다고 볼 수 있다.

⇨ Process of mining frequent patterns can be viewed as first mining frequent 1-itemsets and then progressively growing each such itemset by mining its conditional pattern base, which can in turn be done similarly

⇨ We successfully transform a frequent k-itemset mining problem into a sequence of k frequent 1-itemset mining problems via a set of conditional pattern bases

⇨ {m}과 {f,c,a} 의 join 의 모든 결과도 다 빈발집합 {mf}, {mc}, {ma}, {mfc}, {mfa}, {mca}, {mfca}

• Single path Tree

- Suppose an FP-tree T has a single path P

- The complete set of the frequent patterns of T can be generated by the enumeration of all the combinations of the subpaths of P

- The support is the minimum support of the items contained in the subpath

• FP-growth algorithm

• FP-growth algorithm vs Apriori

→ support threshold 값이 낮으면 더 많은 빈발집합들이 생겨나게 되고, 그래서 값이 매우 낮은 경우에는 알고리즘이 주어진 시간 내에 동작하지 못할 수 있다.

→ FP-Growth : Divide and conquer (conditional pattern base 만 보면 된다), No candidate generation, no candidate test, compressed database, No repeated scan of the entire database, Cheap operations

④ Maximal, Closed frequent itemsets

• Definisions

▸ Closed : An itemset X is closed, if X is frequent and there exists no superset Y⊃X, with the same support as X : 가장 support 값이 크면서 해당 itemset 이 가장 원소의 개수가 작아야 함

▸ Maximal : An itemset X is maximal if X is frequent and there exists no superset Y⊃X : support 값이 같아도 됨

• Closed itemset은 frequent itemset의 하위집합이지만, 다른 frequent itemset으로부터 독립적입니다. 즉, closed itemset은 자신의 support와 같은 support를 갖는 다른 frequent itemset이 존재하지 않는 frequent itemset입니다.

• maximal itemset은 더 이상 확장될 수 없는 frequent itemset입니다.

• closed itemset과 maximal itemset은 frequent itemset의 특수한 경우입니다. closed itemset은 frequent itemset 중에서 독립적인 것을 찾고, maximal itemset은 frequent itemset 중에서 더 이상 확장될 수 없는 것을 찾습니다.

• superset : superset은 어떤 아이템 집합에 대해, 해당 아이템 집합을 포함하는 더 큰 아이템 집합을 의미합니다. 예를 들어, {우유, 계란}이라는 아이템 집합이 있을 때, {우유, 계란, 빵}은 {우유, 계란}의 superset이 됩니다.

3. Rule Generation

빈발집합을 생성하는 알고리즘 (Apriori, FP-Growth) 에 대해 배웠다. 이제 해당 빈발집합셋을 가지고 연관규칙을 만드는 과정에 대해 살펴보자.

• Rule generation

⇨ L : 빈발집합 아이템셋

⇨ k : L 의 원소의 개수

⇨ 2^k - 2 : 가능한 연관규칙의 개수 (후보 연관규칙의 개수)

• 효율적으로 규칙 생성하는 방법

⇨ confidence 는 support 와 달리, anti-monotone 한 성격을 가지고 있지 않다. (ABC → D) 의 confidence 는 (AB → D) 의 confidence 보다 클 수도 작을수도 있다.

⇨ 그러나 같은 itemset 으로부터 생성된 규칙의 confidence 는 antimonotone 한 성격을 가지고 있다. 예를들어 L = {A,B,C,D} 빈발집합에서 다음과 같은 규칙들은 anti-monotone 한 confidence 관계를 가질 수 있다. (ABC → D) ≽ (AB → CD) ≽ (A → BCD) . 해당 규칙들은 규칙 내에서 모두 ABCD 전체를 포함하고 있다. Right hand side (RHS)

• Rule Generation for Apriori Algorithm

⇨ RHS 를 만족하는 규칙들 사이에선 anti-monotone 성질이 적용될 수 있기 때문에, 위의 그림처럼 제일 상단에 있는 규칙의 confidence 가 낮다면, 그 하위 규칙들 ({ABCD} set 을 만족하는 규칙들) 도 confidence 가 낮기 때문에, Prune 할 수 있다. (가지치기, 제거)

⇨ consequent (화살표 뒤에 있는 규칙) 규칙에서 같은 prefix 를 공유하고 있는 두 규칙을 merge 하여 새로운 후보규칙을 만들 수 있다. 가령 (CD → AB) 와 (BD → AC) 규칙의 경우, consequent 에서 A라는 prefix 를 공유하고 있기 때문에, D → ABC 라는 규칙을 생성할 수 있고, confidence 의 RHS 의 anti-monotone 한 성질에 의해서 만약 AD → BC 의 confidence 값이 높지 않다면, 후보 규칙에서 제거될 수 있다.

4. Interestingness measures

① Applications of interestingness measures

• interestingness measures

| 추출된 정보가 얼마나 흥미로운지, 혹은 얼마나 유용한지 평가하는 지표이다.

② Lift

• Lift = confidence / probability of consequent

| A→B 의 연관규칙에서 임의로 B가 구매되는 경우에 비해, A와의 관계가 고려되어 구매되는 경우의 비율

A에서 0으로 가는 규칙에 대해, support = 3/7 (전체에서 A→0 가 등장한 비율), confidence = 3/4 (A가 나타났을 때 A→0 이 등장한 비율) 로 계산할 수 있고 B에서 1로 가는 규칙에 대해서도 동일하게 support 와 confidence 를 계산해볼 수 있다.

Lift 는 confidence 를 probability of consequent 로 나눈 값이다. Rule 1에서 probability of consequent 를 구하면, 전체에서 0이 등장한 비율이므로 4/7이다. Rule 2에서는 전체에서 1이 등장한 비율로 3/7이다.

• Lift 값이 1보다 크다면, Antecedent 와 Consequent 의 발생이 서로 연관 (correlated / dependent) 되어 있는지 알려준다.

• 규칙1 (A→0) 은 규칙2 (B→1) 보다 높은 confidence 를 가지지만, lift 는 더 낮다.

• Interestingness measure : Correlation (Lift)

(농구를한다 → 시리얼을 먹는다) 라는 규칙이 있을 때, (support, confidence) 가 (40%, 66.7%) 로 계산된 결과는 오해를 불러올 수 있다. (high confidence is misleading) 이미 시리얼을 먹는 학생의 전체 비율이 75% 이므로, 굳이 농구를 한다는 단서가 붙지 않아도 시리얼을 먹는 학생들의 %가 높다.

(농구를한다 → 시리얼을 먹지 않는다) 라는 규칙이 있을 때, (support, confidence) 가 (20%, 33.3%) 로 계산된 결과가, 비록 값은 낮더라도 좀 더 정확하다.

Lift 값을 계산해보면, 첫번째 규칙은 0.89 (1보다 값이 작으므로 농구를 하는 것과 시리얼을 먹는 것은 연관성이 없음을 의미), 두번째 규칙은 1.33 (1보다 크므로 시리얼을 먹지 않을 확률보다, 농구를 하고 시리얼을 먹지 않을 확률이 더 높은, 둘 사이의 연관성이 매우 높은) 의 결과가 나온다.

• Good measures for correlation?

↪ (월넛을 산다 → 우유를 산다) 의 규칙이 (support, confidence) = (1%, 80%) 결과가 나왔을 때, 만약 85%의 소비자들이 우유를 산다면, 해당 결과는 오해를 불러올 수 있다.

↪ support 와 confidence 는 antecedent 와 consequent 사이의 correlation 을 해석하는데 적절한 지표가 아니다. 적절한 지표에 대해 20개 이상의 interestingness measure 가 제안된 바 있다.

③ Interestingness measures

• Null-invariant measures

↪ 여러가지 지표들 중에, null variance 성질을 가진 지표에 대해 살펴보자.

• Comparison of interestingness measures

↪ 상관분석에서 null-invariance 성질은 매우 중요하다. Lift 와 x2 은 null-invariant 가 아니다.

↪ null invariance : not dependent on the null transaction

↪ 위의 예시에서, no milk & no coffee 에 해당하는 셀이 null transaction 의 개수이다. chi-square 와 Lift 값은 null transaction 에 영향을 받는 지표들이기 때문에, 서로 결과가 매우 상반되게 나옴을 살펴볼 수 있다. 반면 AllConf 지표의 경우, Data set 1 과 2 에 대해 같은 값을 가진다. 이러한 지표의 경우는 null transaction 에 영향을 받지 않는 null-invariant 한 지표라 볼 수 있다.

↪ D1와 D2 : positive correlated (두개를 동시에 구매한 경우가 한쪽만 구매한 경우보다 많음)

↪ D3 : negative correlated (두개를 동시에 구매한 경우가 한쪽만 구매한 경우보다 훨씬 적음)

↪ D4,D5,D6 : neutral (특정한 패턴이 보이지 않음)

• Example : analysis of DBLP Coauthor relationships

컴퓨터과학 분야에서 논문 출간에 관련된 데이터셋에 관한 예시이다.

다양한 지표를 살펴봄으로써 다양한 관계에 대해 해석해볼 수 있다.

↪ Kulc 의 경우, 상위 3개의 sup 의 분포를 보면, sup(ab) 가 sup(b) 와 같음을 볼 수 있고, sup(b) 가 sup(a) 보다 현저히 낮은 숫자임을 알 수 있다. 이러한 결과를 해석해보면, advisor (교수) - advisee (대학원생) 관계를 도출해볼 수 있다. 이러한 관계는 skewed 한 관계라 볼 수 있다.

↪ coherence 의 경우, 상위 3개의 sup 분포를 보면, sup(b) 가 sup(ab) 보다 값이 높음을 알 수 있다. 이는 저자 b 가 독립적인 연구자에 해당한다고 유추해볼 수 있다. (교수님의 도움을 받지 않고 스스로 출간한 논문이 많다)

↪ kulc tends to more credit on skewed pattern , coherence prefers balanced pattern, cosine in between

• Null-invariant 지표가 더 좋은 것인가?

명확하게 그렇다고 말하기는 어렵다.

↪ IR (imbalance ratio) : A와 B의 불균형을 측정하는 지표

⭐ Programming

5. Advanced association analysis

Pattern mining roadmap

① Rare or Negative

• Infrequent or rare patterns

ex. Rolex 시계를 구매하는 것 → 드물게 나타나지만, 매출에 있어선 중요한 발생

• Negative patterns

ex. SUV 차를 구매하는 것과 Toyota 차를 구매하는 것은 음의 상관패턴을 보인다 → 상품 간에 음의 관계성이 보일 수 있다.

② Abstraction levels

• Multi-level association rules

↪ 낮은 레벨에 있는 아이템일수록 support 값 또한 작을 수 있다. (Items at the lower level are expected to have lower support)

③ Flexible support and redundancy filtering

• Flexible min-support thresholds

몇몇 상품군은 적은 빈도로 나타나지만, 더 가치있을 수 있다. 이런 경우에는 non-uniform한 group-based min-support 를 사용하는 것이 좋다.

ex. {다이아몬드,시계,카메라} = 0.05%, {빵, 우유} = 5%

↪ 비싼 아이템 군집은 상대적으로 발생 빈도가 낮을 수 있기 때문에 낮은 support threshold 를 적용하는 것이 좋고, 비교적 저렴한 아이템 군집은 상대적으로 발생 빈도가 높을 수 있기 때문에 높은 support threshold 를 적용하는 것이 적절할 수 있다.

• Redundancy Filtering : 몇몇 규칙은 ancestor 의 특성으로 인해 중복될 수 있다.

2% 지방 우유와 빵의 규칙은 우유와 빵의 규칙과 중복될 수 있다. 규칙의 ancestor 에 기반한 expected support value (위에서는 8%/4 = 2%로 완전 일치) 가 비슷하다면, 해당 규칙은 중복 (redundant) 됬다고 볼 수 있다. 2% milk 규칙은 중복된 규칙이다.

④ # of dimensions

• Multi-dimensional association rules

2개 이상의 차원을 나타내는 연관규칙이 존재할 수 있다.

single dimension 을 가지는 규칙의 예는 다음과 같다. 아래의 규칙은 buys 에서 buys 로 가는 규칙이다.

⑤ Types of values

• Quantitative association rules

수치형 값을 가진 상품군과 범주형 값을 가진 상품군 사이의 연관규칙

⑥ Constraints or Criteria

• Constraint-based

↪ Those satisfying a set of user-defined constraints

• Approximate, compressed, near-match

↪ Those that tally the support count of the near or almost matching itemsets

• Top-k

↪ The k most frequent itemsets for a user-specified value k

• Redundancy-aware top-k

↪ The top-k patterns with similar or redundant patterns excluded

⑦ Constraints or Criteria : Constraints in data mining

• Knowledge type constraint : Classification, association

• Data constraint - using SQL like queries : Finding product pairs sold together in stores in Chicago this year → Region area constraints (Chicago), temporal constraints (this year)

• Dimension/level constraint: In relevance to region, price, brand, customer category

• Rule (or pattern) constraint : Small sales (price < $10) trigger big sales (sum > $200)

• Interestingness constraint : Strong rules: min_support ≽ 3%, min_confidence ≽ 60%

⑧ Constraints or Criteria : Compressed Patterns

• we do not keep all frequent patterns

• closed frequent patterns 를 기준으로 compression 을 하고자 할 때는, P1, P2, P3, P4, P5 모두가 closed form 이기 때문에, 압축할 것이 없다.

• Maximal frequent patterns 를 기준으로 compression 을 하고자 할 때, maximal form 에 해당하는 ID 는 P3 이기 때문에, 5가지 ID 에 대해 P3 만을 선택하게 된다. 그러나 이러한 경우P1,P2,P4,P5 를 모두 버리기 때문에 정보의 손실이 발생한다.

• 가장 좋은 compression 방법은, pattern 간의 distance 를 구해서 clustering 을 하는 방법이다.

⑨ Constraints or Criteria : Redundancy-Award Top-k patterns

• 높은 significance & 낮은 redundancy : redundancy award top-k 방식을 적용한 (b) 가 해당 조건을 만족한다.

• (c)에서는 significance (가령 lift 지표) 가 가장 높은 것들을 고르고, (d) 에서는 정보가 중복되지 않도록 clustering 에서 한개씩 고른다. 이 둘을 적절히 섞은 것이 (b) 이다. 선택된 3개를 보면, significance 가 가장 높은 것 들 중 멀리 떨어진 두 개를 고르고, 그 다음 lift (진회색) 들 중 가장 거리가 있는 오른쪽 동그라미를 고른다.

⑩ Kinds of Data and Features

• Frequent itemset mining

• Sequential patterns : Frequent sequences of ordered events (ex. PC 를 먼저 사고, 디지털 카메라를 사고, 그 다음에 메모리카드를 사고)

• Structural patterns : Frequent substructures (ex. 화학 구조물)

• Application-Domain specific

↪ 도메인에 따라 발견할 수 있는 패턴은 다르다.

• Data analysis Usage

↪ 빈발패턴마이닝은 데이터 분석의 중간단계로 활용될 수 있다.

728x90

'1️⃣ AI•DS > 📕 머신러닝' 카테고리의 다른 글

Uplift modeling (0)	2023.06.06
데이터마이닝 Classification (decision tree) (1)	2023.04.15
데이터마이닝 Preprocessing ③ (0)	2023.03.29
데이터마이닝 Preprocessing ② (0)	2023.03.15
데이터마이닝 Preprocessing ① (1)	2023.03.15

Getting better

데이터마이닝 Association analysis

1. Basic Concepts

① Overview

② Association Rule Mining

③ Definition

④ Association Rule Mining Task

⑤ Mining Association Rules

2. Frequent itemset mining methods

① Introduce

② Apriori

③ Reducing # of comparisons

③ FP-growth

④ Maximal, Closed frequent itemsets

3. Rule Generation

4. Interestingness measures

① Applications of interestingness measures

② Lift

③ Interestingness measures

5. Advanced association analysis

① Rare or Negative

② Abstraction levels

③ Flexible support and redundancy filtering

④ # of dimensions

⑤ Types of values

⑥ Constraints or Criteria

⑦ Constraints or Criteria : Constraints in data mining

⑧ Constraints or Criteria : Compressed Patterns

⑨ Constraints or Criteria : Redundancy-Award Top-k patterns

⑩ Kinds of Data and Features

'1️⃣ AI•DS > 📕 머신러닝' 카테고리의 다른 글

댓글

티스토리툴바

데이터마이닝 Association analysis

1. Basic Concepts

① Overview

② Association Rule Mining

③ Definition

④ Association Rule Mining Task

⑤ Mining Association Rules

2. Frequent itemset mining methods

① Introduce

② Apriori

③ Reducing # of comparisons

③ FP-growth

④ Maximal, Closed frequent itemsets

3. Rule Generation

4. Interestingness measures

① Applications of interestingness measures

② Lift

③ Interestingness measures

5. Advanced association analysis

① Rare or Negative

② Abstraction levels

③ Flexible support and redundancy filtering

④ # of dimensions

⑤ Types of values

⑥ Constraints or Criteria

⑦ Constraints or Criteria : Constraints in data mining

⑧ Constraints or Criteria : Compressed Patterns

⑨ Constraints or Criteria : Redundancy-Award Top-k patterns

⑩ Kinds of Data and Features

'1️⃣ AI•DS > 📕 머신러닝' 카테고리의 다른 글

관련글

댓글

티스토리툴바