๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
1๏ธโƒฃ AI•DS/๐Ÿ“• ๋จธ์‹ ๋Ÿฌ๋‹

๋ฐ์ดํ„ฐ๋งˆ์ด๋‹ Association analysis

by isdawell 2023. 3. 29.
728x90

 

1. Basic Concepts 


 

โ‘   Overview 

 

•  Motivation : finding inherent regularities in data

   โ†ช ํ•จ๊ป˜ ๊ตฌ๋งค๋˜๋Š” ์ƒํ’ˆ์ด ๋ฌด์—‡์ด ์žˆ์„๊นŒ 

   โ†ช PC ๋ฅผ ์‚ฌ๊ณ ๋‚œ ์งํ›„์— ๊ตฌ๋งคํ•˜๋Š” ์ƒํ’ˆ์€ ๋ญ๊ฐ€ ์žˆ์„๊นŒ

   โ†ช ์ƒˆ๋กœ์šด ์•ฝ์— ๋ฏผ๊ฐํ•œ DNA๋Š” ์–ด๋–ค ์ข…๋ฅ˜๊ฐ€ ์žˆ์„๊นŒ 

   โ†ช ์›น๋ฌธ์„œ๋ฅผ ์ž๋™์œผ๋กœ ๋ถ„๋ฅ˜ํ•  ์ˆ˜ ์žˆ์„๊นŒ 

 

โ‡จ ์—ฐ๊ด€๋œ ๊ทœ์น™์„ ์ฐพ์•„๋ณด์ž

 

•  Application 

   โ†ช Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis

 

 

 

โ‘ก  Association Rule Mining

 

•  Given a set of transactions, we find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

 

 

 

โ‘ข  Definition 

 

A.  Frequent Itemset 

 

   โ†ช Itemset : collection of one or more items : {Milk, Bread, Diaper}

   โ†ช k-itemset : An itemset that contains k items

   โ†ช Support count (σ) or absolute support

 

   โ†ช (Relative) support : Fraction of transactions that contain an itemset

 

   โ†ช Frequent Itemset : An itemset whose support is greater than or equal to a minsup threshold

 

 

 

B. Association Rule 

 

   โ†ช Association Rule : An implication expression of the form X→Y, where X and Y are itemsets

 

   โ†ช Support (s) : Fraction of transactions that contain both X and Y

   โ†ช Confidence (c) : Measures how often items in Y appear in transactions that contain X

 

 

โ‘ฃ  Association Rule Mining Task

 

•  Given a set of transactions T, the goal of association rule mining is to find all rules such that

•  Brute-force approach

   - ๋ชจ๋“  ์—ฐ๊ด€๊ทœ์น™์„ ๋‚˜์—ดํ•œ๋‹ค 

   - ๊ฐ ๊ทœ์น™์— ๋Œ€ํ•ด support ์™€ confidence ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. 

   - minsup ๊ณผ minconf ๋ฅผ ๋„˜์ง€ ์•Š๋Š” ๊ทœ์น™์€ ์ œ๊ฑฐํ•œ๋‹ค. (prune) 

     โ‡จ ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งค์šฐ ๋งŽ๋‹ค! 

 

 

 

 

โ‘ค  Mining Association Rules

 

•  Examples and observations

 

 

โ†ช  ์œ„์˜ ๋ชจ๋“  ๊ทœ์น™์€ {Milk, Diaper, Beer} ์˜ binary partition ์˜ ์กฐํ•ฉ์ด๋‹ค. 

โ†ช  ๊ฐ™์€ itemset ์œผ๋กœ๋ถ€ํ„ฐ ๋ฐœ์ƒ๋œ ๊ทœ์น™๋“ค์€ ๋™์ผํ•œ support ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์žˆ์ง€๋งŒ, confidence ๊ฐ’์€ ๋‹ค๋ฅด๋‹ค. 

โ†ช  ๋”ฐ๋ผ์„œ support ์™€ confidence ์˜ ์š”๊ตฌ์‚ฌํ•ญ์„ ๋ถ„๋ฆฌํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

โ†ช  ๋นˆ๋ฒˆํ•˜์ง€ ์•Š์€ item set ์€ ์—ฐ๊ด€๊ทœ์น™์—์„œ ์ฐพ์•„๋ณผ ์ˆ˜ ์—†๋‹ค. ๋นˆ๋ฐœํ•˜์ง€ ์•Š์€ itemset ์˜ partition ์— ํ•ด๋‹นํ•˜๋Š” ๊ทœ์น™์€ minsup ์„ ๋งŒ์กฑ์‹œํ‚ฌ ์ˆ˜ ์—†๋‹ค โ‡จ support ์™€ confidence ์˜ ์กฐ๊ฑด์„ ๋ถ„๋ฆฌ์‹œ์ผœ์„œ ๋ด์•ผ ํ•œ๋‹ค. 

 

 

•  Steps 

 

   1. Frequent itenset generation ๋นˆ๋ฐœ์ง‘ํ•ฉํ•ญ๋ชฉ ์ƒ์„ฑ 

       โ–ธ Generating all itemsets whose support ≥ minsup 

       โ–ธ ์—ฌ์ „ํžˆ ๊ณ„์‚ฐ์ ์œผ๋กœ ๋น„์šฉ์ด ๋†’๊ธด ํ•จ 

 

   2. Rule generation 

       โ–ธ ๊ฐ ๋นˆ๋ฐœํ•ญ๋ชฉ ์ง‘ํ•ฉ์—์„œ high confidence rule ์ƒ์„ฑ, ์—ฌ๊ธฐ์„œ ๊ฐ ๊ทœ์น™์€ ๋นˆ๋ฐœ ํ•ญ๋ชฉ์ง‘ํ•ฉ์˜ ์ด์ง„๋ถ„ํ• 

       โ–ธ 1๋ฒˆ์—์„œ minsup ์€ ํ™•์ธํ–ˆ์œผ๋‹ˆ, ์ด ๋‹จ๊ณ„์—์„œ๋Š” minconf ๋งŒ ํ™•์ธํ•˜๋ฉด ๋จ 

 

 

 

 

 

2. Frequent itemset mining methods 


 

โ‘  Introduce 

 

•  Candidate Itemsets : 2^d ๋งŒํผ์˜ ํ›„๋ณด itemset ๋“ค์„ ๊ณ ๋ คํ•ด๋ณผ ์ˆ˜ ์žˆ์Œ, d๋Š” transaction ์— ๋“ฑ์žฅํ•œ ๊ณ ์œ ํ•œ ์•„์ดํ…œ ์ˆ˜๋ฅผ ์˜๋ฏธํ•จ 

 

 

 

 

 

•  Brute-force approach 

โ†ช Each itemset in the lattice is a candidate frequent itemset

โ†ช We count the support of each candidate by scanning the database

 

โ†ช ๋ชจ๋“  ํ›„๋ณด์™€ ๊ฐ transaction ๋“ค์„ ๋งค์น˜ํ•ด๋ณด์•„์•ผ ํ•จ 

โ†ช Complexity ~ O(NMw) โ‡จ  Expensive since M = 2^d !! ๊ณ„์‚ฐ๋ณต์žก๋„๊ฐ€ ๋งค์šฐ ํผ 

 

 

•  Computational complexity 

โ†ช d unique items ์ด ์žˆ์„ ๋•Œ, ๊ฐ€๋Šฅํ•œ ์ด itemset ์€ 2^d ๊ฐœ์ด๋‹ค. 

โ†ช ์ด๋•Œ ๊ฐ€๋Šฅํ•œ ์ด association rule ๋“ค์€ R ๊ฐœ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค. 

 

 

 

 

•  Possible Improvements 

 

   a. Reducing the # of candidates (M) 

       โ‡จ Complete search: M=2^d

       โ‡จ Using pruning techniques to reduce M : apriori ์•Œ๊ณ ๋ฆฌ์ฆ˜ 

 

   b. Reducing the number of transactions (N) 

       โ‡จ Reducing size of N as the size of itemset increases

       โ‡จ Being used by DHP and vertical-based mining algorithms (์ˆ˜์—…์—์„œ๋Š” ๋‹ค๋ฃจ์ง€ ์•Š์Œ) 

 

   c. Reducing the number of comparisons (NM)

       โ‡จ Using efficient data structures (e.g., hash tables) to store the candidates or transactions

       โ‡จ No need to match every candidate against every transaction : FP-growth ์•Œ๊ณ ๋ฆฌ์ฆ˜ 

 

 

 

โ‘ก Apriori 

 

Candidate ๊ฐœ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๋ฐฉ๋ฒ•! 

 

 

•  Apriori principle or downward closure property 

 

   โ‡จ If an itemset is frequent, then all of its subsets must also be frequent

       : ํ•ญ๋ชฉ์ง‘ํ•ฉ์ด ๋นˆ๋ฐœํ•˜๊ฒŒ ์žˆ๋Š” ๊ฒฝ์šฐ, ํ•ญ๋ชฉ์ง‘ํ•ฉ์˜ ๋ชจ๋“  ๋ถ€๋ถ„์ง‘ํ•ฉ๋„ ๋นˆ๋ฐœํ•ด์•ผ ํ•œ๋‹ค. 

   โ‡จ anti-monotone property of support : Support of an itemset never exceeds the support of its subsets

 

 

 

 

 

 

•  Example 

 

 

โ†ช 1-itemset ์—์„œ minsup ์„ ๋งŒ์กฑํ•˜์ง€ ์•Š๋Š” Coke ์™€ Eggs ๋Š” ์ œ์™ธํ•œ๋‹ค. 

โ†ช coke ์™€ eggs ๋ฅผ ์ œ์™ธํ•˜์—ฌ ์กฐํ•ฉํ•ด ๋งŒ๋“  2-itemset ์—์„œ ์—ญ์‹œ minsup ์„ ๋งŒ์กฑํ•˜์ง€ ์•Š๋Š” ์กฐํ•ฉ์€ ์ œ์™ธํ•œ๋‹ค. 

โ†ช 3-itemset ์กฐํ•ฉ ... (์ƒˆ๋กœ์šด ๋นˆ๋ฐœ์ง‘ํ•ฉ์„ ์ฐพ์„ ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณต) 

 

โ‡จ ์ด๋กœ์จ ๊ฐ€๋Šฅํ•œ ํ›„๋ณด๋“ค์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

•  Algorithm Pseudo code 

 

์ „์ฒด์ฝ”๋“œ 

 

์‰ฌ์šด์„ค๋ช…๋ธ”๋กœ๊ทธ

 

 

๋ถ€๋ถ„์ ์œผ๋กœ ์‚ดํŽด๋ณด์ž !

 

 

โ†ช  Lk : frequent k-itemsets 

โ†ช  Ck : candidate k-itemsets 

โ†ช  Ct : transaction t ์—์„œ์˜ candidate 

      โ‡จ  Ck = apriori-gen(L_k-1) 

      โ‡จ  Lk = {c | c.count minsup} 

      โ‡จ  Ct = subset(Ck,t) 

 

 

 

 

•  Candidate Genration (apriori-gen) 

 

โ†ช 1. L_k-1 ์—์„œ self-joining ๊ณผ์ •์„ ์ˆ˜ํ–‰ 

โ†ช 2. Pruning 

 

 

 

โ†ช ์˜ˆ๋ฅผ๋“ค์–ด, L3 = {abc,abd,acd,ace,bcd} ๊ธธ์ด๊ฐ€ 3์ธ ๋นˆ๋ฐœ์ง‘ํ•ฉ์ด๋ผ๊ณ  ํ–ˆ์„ ๋•Œ, L3 ์˜ self joining ๊ฒฐ๊ณผ๋กœ C4 ๊ธธ์ด๊ฐ€ 4์ธ ํ›„๋ณด์ง‘ํ•ฉ์„ ์ƒ์„ฑํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด๋ณด์ž. ์ด๋•Œ self joining ์„ ํ•˜๋ ค๋ฉด, ์ตœ์†Œ 2๊ฐœ์˜ item ์ด ๊ฒน์ณ์•ผ joining ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๊ฐ๊ฐ์˜ ์กฐํ•ฉ์œผ๋กœ ์ƒ์„ฑ๋œ ๊ฒƒ๋“ค ์ค‘์—, abc ์™€ abd ์˜ ์กฐํ•ฉ์œผ๋กœ abcd ๋ฅผ ๋งŒ๋“ค์—ˆ๊ณ , acd์™€ ace์˜ ์กฐํ•ฉ์œผ๋กœ acde ๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค๊ณ  ํ•ด๋ณด์ž. ์—ฌ๊ธฐ์„œ abcd ์˜ ๊ฒฝ์šฐ, abc, abd, acd, bcd  ๊ฐ€ L3 ์— ๋ชจ๋‘ ํฌํ•จ๋˜์–ด ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ acde ์˜ ๊ฒฝ์šฐ, acd,ace ๋Š” L3 ์— ํฌํ•จ๋˜์–ด ์žˆ์ง€๋งŒ, ade, cde ๋Š” ํฌํ•จ๋˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— anti-monotone ์„ฑ์งˆ์— ์œ„๋ฐฐ๋˜์–ด, acde ๋Š” C4 ํ›„๋ณด์—์„œ ์ œ์™ธํ•  ์ˆ˜๋ฐ–์— ์—†๋‹ค. 

 

 

โ†ช ๋‹ค์‹œ ์ •ํ™•ํ•œ ์ดํ•ด๋ฅผ ์œ„ํ•ด ์„ค๋ช…ํ•ด๋ณด์ž๋ฉด, L2 ๋กœ๋ถ€ํ„ฐ C3 ํ›„๋ณด์ง‘ํ•ฉ์„ ๋„์ถœํ•ด๋‚ผ ๋•Œ, {ABC} ์˜ ๊ฒฝ์šฐ {AB} ์˜ ์กฐํ•ฉ์ด L2์— ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ œ์™ธ, {ABE}์˜ ๊ฒฝ์šฐ {AE} ์กฐํ•ฉ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ œ์™ธ, {ACE} ์˜ ๊ฒฝ์šฐ๋„ {AE} ๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์ œ์™ธํ•œ๋‹ค. {BCE} ๊ฐ™์€ ๊ฒฝ์šฐ๋Š” {BC},{BE},{CE} ๊ฐ€ ๋ชจ๋‘ L2 ์— ๋นˆ๋ฐœ์ง‘ํ•ฉ์œผ๋กœ ์กด์žฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€๋Šฅํ•˜๋‹ค. 

 

โ†ช ์—ฌ๊ธฐ์„œ๋„ L2์—์„œ C3 ํ›„๋ณด์ง‘ํ•ฉ์„ ์ƒ์„ฑํ•  ๋•Œ ์„ค๋ช…์„ ์ฐธ๊ณ ํ•ด๋ณด์ž! 

โ†ช ์•„๋ฌดํŠผ ์ด๋Ÿฐ์‹์œผ๋กœ ๋นˆ๋ฐœ์ง‘ํ•ฉ L ๋“ค์„ ์ฐพ์•„๊ฐ€๋ฉด ๋œ๋‹ค. 

 

 

 

โ‘ข  Reducing # of comparisons 

 

๋น„๊ต ๊ฐœ์ˆ˜๋ฅผ ์ค„์ด๋Š” ๋ฐฉ๋ฒ• 

 

  support counting : ํŠธ๋žœ์ ์…˜์˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ์€ ๊ฐ ํ›„๋ณด์ง‘ํ•ฉ์˜ support ๋ฅผ ๊ฒฐ์ •ํ•œ๋‹ค. ๋น„๊ตํšŸ์ˆ˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ, ํ›„๋ณด์ง‘ํ•ฉ๋“ค์„ ํ•ด์‹œ๊ตฌ์กฐ hash structure ๋กœ ์ €์žฅํ•œ๋‹ค. ๋ชจ๋“  ํ›„๋ณด์ง‘ํ•ฉ์— ๋Œ€ํ•ด ๊ฐ๊ฐ์˜ ํŠธ๋žœ์ ์…˜์„ ๋น„๊ตํ•  ํ•„์š” ์—†์ด, ํŠน์ • ํ›„๋ณด์ง‘ํ•ฉ์„ ํฌํ•จํ•˜๋Š” hased buckets ๊ณผ๋งŒ ๋น„๊ตํ•˜๋ฉด ๋œ๋‹ค. 

 

 

 

 

 

 

  Generating hash tree 

 

๊ธธ์ด๊ฐ€ 3์ธ 15๊ฐœ์˜ ํ›„๋ณด์ง‘ํ•ฉ๋“ค์ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด๋ณด์ž

 

 

ํ•ด์‹œํ•จ์ˆ˜๊ฐ€ ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ์ฃผ์–ด์ง€๊ณ , ๋ฆฌํ”„๋…ธ๋“œ์— ์ €์žฅํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ๋Œ€ ์•„์ดํ…œ์ง‘ํ•ฉ์˜ ์ˆ˜์ธ Max leaf size ๋ฅผ ๊ฒฐ์ •ํ•ด์ค€๋‹ค. ํ›„๋ณด ์•„์ดํ…œ ์ง‘ํ•ฉ์˜ ์ˆ˜๊ฐ€ max size ๋ฅผ ๋„˜์–ด๊ฐ„๋‹ค๋ฉด ๋…ธ๋“œ๋ฅผ ์•„๋ž˜๋กœ ์ชผ๊ฐ ๋‹ค. 

 

 

 

ํ•ต์‹ฌ์€ ํ•ด์‹œํŠธ๋ฆฌ์˜ i ๋ฒˆ์งธ level ์—์„œ๋Š” i๋ฒˆ์งธ item ์„ ํ™•์ธํ•ด ํ•ด์‹œํ•จ์ˆ˜ ๊ทœ์น™์— ๋งž๊ฒŒ ํ›„๋ณด ์•„์ดํ…œ์…‹์„ ๋ฐฐ์ •ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. 

 

 

์˜ˆ๋ฅผ๋“ค์–ด {145} itemset ๋ฅผ ํ• ๋‹นํ•œ๋‹ค๊ณ  ํ•˜๋ฉด, ์ฒซ๋ฒˆ์งธ level ์—์„œ ์ฒซ๋ฒˆ์งธ item ์„ ๊ณ ๋ คํ•ด ํ• ๋‹นํ•˜๋ฏ€๋กœ 1์€ (1,4,7) ์ค‘์— ํฌํ•จ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€์žฅ ์™ผ์ชฝ ๊ฐ€์ง€ ํ•˜์œ„๋กœ ์š”์†Œ๊ฐ€ ๋ฐฐ์ •๋œ๋‹ค. ๋…ธ๋“œ ์•„๋ž˜๋กœ ๋ฐฐ์ •๋˜๋Š” ์•„์ดํ…œ์…‹์ด 3๊ฐœ๊ฐ€ ์ดˆ๊ณผ๋˜๋ฉด ๊ณ„์† ํ•˜์œ„ ๋…ธ๋“œ๋“ค์„ ์ƒ์„ฑํ•ด ๋ฐฐ์ •์„ ์ง„ํ–‰ํ•˜๋ฉด ๋œ๋‹ค. 

 

 

 

 

  Subset operation using the hash tree

 

ํ•ด์‹œํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š” subset operation ์„ ํšจ์œจ์ ์œผ๋กœ ํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค. 

ํŠธ๋žœ์ ์…˜ t๋ฅผ ํฌํ•จํ•˜๋Š” ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ํ›„๋ณด์ง‘ํ•ฉ๋“ค์„ ์ฐพ๊ณ ์‹ถ๋‹ค. 

ํŠธ๋žœ์ ์…˜ t๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ๊ฐ€๋Šฅํ•œ ํฌ๊ธฐ๊ฐ€ 3์ธ subset ๋“ค์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ณผ์ •์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค. 

 

 

 

์ด๋•Œ ํŠธ๋žœ์ ์…˜t์—์„œ, ํ•ด์‹œํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ candidates ๋“ค๊ณผ ๋น„๊ตํ•ด ๋นˆ๋ฐœ์ง‘ํ•ฉ์„ ์ฐพ์•„๊ฐ€๋Š” ๊ณผ์ •์„ ๊ทธ๋ฆฌ๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 

 

 

๋…ธ๋ž€์ƒ‰์€ ์šฐ๋ฆฌ๊ฐ€ ํ•ด์‹œํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ํ• ๋‹นํ•ด๋†“์€ candidate ๋“ค์ด๊ณ , ๋ฏผํŠธ์ƒ‰์€ transaction ์—์„œ ํŒŒ์ƒ๋œ ์•„์ดํ…œ์ง‘ํ•ฉ๋“ค์ด๋ผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ํ•ด์‹œํŠธ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋นˆ๋ฐœ์ง‘ํ•ฉ์„ ์ฐพ์•„๋ณด์ž. 

 

 

 

 

subset operation ์ฒ˜๋Ÿผ prefix ๋ฅผ ๊ณ ์ •ํ•˜๊ณ  ๊ฐ€๋Šฅํ•œ ์กฐํ•ฉ๋“ค์„ hash ํ•จ์ˆ˜์— ๋”ฐ๋ผ ํ• ๋‹นํ•ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ํ•‘ํฌ์ƒ‰๊ณผ ํŒŒ๋ž€์ƒ‰ ํ™”์‚ดํ‘œ๋“ค์ด ๊ฐ€๋ฆฌํ‚ค๊ณ  ์žˆ๋Š” ๋…ธ๋“œ์— ํ›„๋ณด๊ตฐ์ด ์œ„์น˜ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

 

cadidate ๋“ค๊ณผ transaction ์—์„œ ์ƒ์„ฑ๋˜๋Š” itemset ๋“ค์„ ๋น„๊ตํ•ด๋ณด๋ฉฐ support ๋ฅผ ๊ณ„์‚ฐํ•ด ๋นˆ๋ฐœ์ง‘ํ•ฉ๋“ค์„ ์ฐพ์•„๋‚ด๊ฐ„๋‹ค. ์ด๋•Œ ์„ ํƒ๋˜์ง€ ์•Š์€ ๋ฆฌํ”„๋…ธ๋“œ๋“ค ๊ฐ™์€ ๊ฒฝ์šฐ์—” ํŠธ๋žœ์ ์…˜์— ํฌํ•จ๋˜์ง€ ์•Š๋Š” subset ์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค. 

์—ฌ๊ธฐ์„œ 458 ์€ ํŠธ๋žœ์ ์…˜์— ํฌํ•จ๋˜์ง€ ์•Š์•„ ํšŸ์ˆ˜๋ฅผ ๋”ํ•˜์ง„ ์•Š์ง€๋งŒ ์–ด์จ‹๋“  ์ „์ฒด ํ•จ์ˆ˜๊ฐ€ ํŠธ๋žœ์ ์…˜์˜ subset ๋“ค์— ์ ์šฉ์€ ๋˜๊ธฐ ๋•Œ๋ฌธ์— count๋ฅผ ํ•ด์ฃผ๊ธด ํ•ด์ค€๋‹ค! 

 

๊ทธ๋ž˜์„œ ์ตœ์ข…์ ์œผ๋กœ 15๊ฐœ์˜ ํ›„๋ณด๋“ค ์ค‘์— 9๊ฐœ๊ฐ€ ํŠธ๋žœ์ ์…˜๊ณผ ๋งค์นญ์ด ๋˜๊ณ , ์šฐ๋ฆฌ๋Š” 9๊ฐœ์˜ ํ›„๋ณด ์•„์ดํ…œ์…‹์ค‘์— 5๊ฐœ์˜ ๋ฆฌํ”„๋…ธ๋“œ์— ๋Œ€ํ•ด support ๋ฅผ ๊ฒŒ์‚ฐํ•ด๋‚ธ๋‹ค. 

 

 

 

  Apriori ์˜ Complexity ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์š”์†Œ๋“ค 

 

   a. Choice of minimum support threshold : ๋‚ฎ์€ minsup ์„ ๊ณ ๋ฅด๋ฉด ๋” ๋งŽ์€ ๋นˆ๋ฐœ์ง‘ํ•ฉ๋“ค์„ ์–ป์„ ์ˆ˜ ์žˆ๊ณ , ์ด๋Š” ๋นˆ๋ฐœ์ง‘ํ•ฉ์˜ ์ตœ๋Œ€ ๊ธธ์ด์™€ ๊ฐ€๋Šฅํ•œ ํ›„๋ณด์ง‘ํ•ฉ์˜ ์ˆ˜๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. 

   b. Dimensionality (number of items) of the data set : More space is needed to store the support count of each item. ๋งŒ์•ฝ item ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•œ๋‹ค๋ฉด ๊ณ„์‚ฐ ๋ณต์žก๋„๋„ ์ฆ๊ฐ€ํ•  ๊ฒƒ์ด๋‹ค. 

   c. Size of database : apriori ๋Š” ๋‹ค์–‘ํ•œ ๊ฒฝ๋กœ๋กœ ๋งŒ๋“ค์–ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ํŠธ๋žœ์ ์…˜์˜ ์ˆ˜๋งŒํผ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ run time ๋„ ์ฆ๊ฐ€ํ•  ๊ฒƒ์ด๋‹ค. 

   d. Average transaction width : Transaction width increases with denser data sets. ์ด๋Š” ๋นˆ๋ฐœ์ง‘ํ•ฉ์˜ ์ตœ๋Œ€ ํฌ๊ธฐ๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. 

 

โ‡จ ์ตœ์†Œ ์ง€์ง€๋„ ์ž„๊ณ„๊ฐ’ ์„ ํƒ, ๋ฐ์ดํ„ฐ์ง‘ํ•ฉ์˜ ์ฐจ์› (ํ•ญ๋ชฉ์˜์ˆ˜), ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์˜ ํฌ๊ธฐ, ํ‰๊ท  ํŠธ๋žœ์žญ์…˜ ๋„ˆ๋น„ 

 

 

 

 

 

โ‘ข FP-growth 

 

•  Motivation 

 

> ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ์€ ๋น„์šฉ์ด ๋งŽ์ด ๋“ ๋‹ค. ๋˜ํ•œ ๊ธด ํŒจํ„ด์„ ์ฐพ๋Š” ๊ฒƒ์€ ๋งŽ์€ scanning ์ด ํ•„์š”ํ•˜๊ณ  ๋งŽ์€ ํ›„๋ณด์ง‘ํ•ฉ๋“ค์„ ์ƒ์„ฑํ•ด๋‚ธ๋‹ค. (# of whole database scan is proportional to the longest pattern

 

Bottleneck: candidate-generation-and-test

 

> candidate generation ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ณต์žก๋„๋ฅผ ํ”ผํ•ด๊ฐˆ ์ˆ˜๋Š” ์—†์„๊นŒ โ‡จ FP-growth ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๋“ฑ์žฅ ์ด์œ  

 

 

 

•  Frequent-Pattern Mining without Candidate Generation

 

> Heuristic : P ๋ฅผ ๋นˆ๋ฐœ์ง‘ํ•ฉ, S๋ฅผ P๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ๋Š” ํŠธ๋žœ์žญ์…˜ ์…‹, x๋ฅผ item ์ด๋ผ๊ณ  ํ•˜์ž. ๋งŒ์•ฝ x๊ฐ€ ๋นˆ๋ฐœ ์•„์ดํ…œ์— ํ•ด๋‹นํ•œ๋‹ค๋ฉด {x}∪P ๋„ ๋ฐ˜๋“œ์‹œ ๋นˆ๋ฐœ์ง‘ํ•ฉ์ด ๋œ๋‹ค. 

> FP grwoth ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ candidate generation ์ด ํ•„์š”์น˜ ์•Š๋‹ค โญ

> compact ํ•œ ๋ฐ์ดํ„ฐ๊ตฌ์กฐ์ธ FP-tree ๋Š” frequent pattern mining ์„ ์œ„ํ•œ ์ •๋ณด ์ €์žฅ ๊ตฌ์กฐ์ด๋‹ค. 

> ๋นˆ๋ฐœ ํŒจํ„ด์˜ set ์„ ์‚ดํŽด๋ณด๊ธฐ ์œ„ํ•ด Recursive mining algorithm ๊ตฌ์กฐ๋ฅผ ์ทจํ•œ๋‹ค. 

 

 

•  Intuition 

 

 

universe ๊ฐ€ ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์ด๊ณ , P ๋Š” ๋นˆ๋ฐœ์ง‘ํ•ฉ์ด๊ณ , S ๋Š” ํŠธ๋žœ์žญ์…˜์ด๊ณ , S ๋Š” P ๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค. x ๋Š” ๊ฐ ์•„์ดํ…œ element ๋ฅผ ์˜๋ฏธํ•˜๊ณ , S ์—์„œ ๋นˆ๋ฐœํ•œ ์•„์ดํ…œ์— ํ•ด๋‹นํ•˜๋Š” ๊ฒฝ์šฐ์ธ x๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•  ๋•Œ, P+x ๋Š” ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์—์„œ ๋นˆ๋ฐœํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. P+x ๊ฐ€ ๋นˆ๋ฐœํ•œ์ง€ ์•„๋‹Œ์ง€๋ฅผ ์ฒดํฌํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ์‚ดํŽด๋ณผ ํ•„์š”๋Š” ์—†๋‹ค. S ๋งŒ ์‚ดํŽด๋ณด๋ฉด ๋œ๋‹ค. 

 

 

 

•  FP-Tree 

 

FP-growth ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„  FP-tree ๋ฅผ ๋งŒ๋“ค์–ด์•ผ ํ•œ๋‹ค. 

 

1) ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ํ•œ ๋ฒˆ๋งŒ ํ›‘์–ด๋ณด๋ฉด์„œ, ๋นˆ๋ฐœํ•œ 1-itemset ๋“ค์„ ํ™•์ธํ•œ๋‹ค. (single item patterns) 

     a. ๋นˆ๋ฐœํ•˜์ง€ ์•Š์€ itemset ์€ ์ œ๊ฑฐํ•œ๋‹ค. 

2) frequency descending order ์ธ f list ์— ๊ทผ๊ฑฐํ•˜์—ฌ ๋นˆ๋ฐœ์ง‘ํ•ฉ๋“ค์„ ์ •๋ ฌํ•œ๋‹ค. 

3) DB ๋ฅผ ๋‹ค์‹œ ํ•œ๋ฒˆ ๋” ํ›‘์œผ๋ฉด์„œ FP tree ๋ฅผ ๋งŒ๋“ ๋‹ค. ๋ถ€๋ชจ๋…ธ๋“œ๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ํŠธ๋žœ์žญ์…˜์„ ์ž์‹๋…ธ๋“œ๋กœ ์ถ”๊ฐ€ํ•˜๋ฉฐ ํŠธ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

 

 

 

์œ„์™€ ๊ฐ™์€ ordered frequent items ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•˜์—ฌ, ๊ฐ ํŠธ๋žœ์žญ์…˜๋ณ„๋กœ ์‚ดํŽด๋ณด๋ฉฐ FP tree ๋ฅผ ๋งŒ๋“œ๋Š” ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 

 

 

TID = 200 ์„ ์Šค์บ๋‹ ํ•˜๋Š” ๊ณผ์ •์„ ์ž์„ธํžˆ ์‚ดํŽด๋ณด๋ฉด, ์•„์ดํ…œ์ด ๊ธฐ์กด ๋…ธ๋“œ์— ์žˆ๋Š” ๊ฒฝ์šฐ์—๋Š” count ๋ฅผ ํ•˜๋‚˜์”ฉ ์ฆ๊ฐ€์‹œํ‚ค๊ณ , ๊ธฐ์กด ๋…ธ๋“œ๊ฐ€ ์—†๋Š” ๊ฒฝ์šฐ๋Š” ์ƒˆ๋กœ์šด path ๋กœ ์ถ”๊ฐ€์‹œํ‚ค๊ณ  ์žˆ์Œ์„ ํ™•์ธํ•ด ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

์ตœ์ข…์ ์œผ๋กœ ์ƒ์„ฑ๋œ FP tree ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 

 

 

 

์ด๋•Œ Header table ์˜ ๋…ธ๋“œ ์ˆ˜์™€ FP tree ์—์„œ์˜ ๋…ธ๋“œ์˜ frequency ๋ฅผ ํ—ท๊ฐˆ๋ฆฌ์ง€ ์•Š๋„๋ก ์ฃผ์˜ํ•˜์ž! 

 

 

 

 

•  FP-Tree Construction

 

์ด์ฒ˜๋Ÿผ FP-tree ๋ฅผ ์ƒ์„ฑํ•˜๋Š”๋ฐ ์žˆ์–ด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ๋”ฑ 2๋ฒˆ๋งŒ ๋ณด๋ฉด ๋œ๋‹ค. 1-iemset ๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ํ•œ ๋ฒˆ, ํŠธ๋ฆฌ์—์„œ ๊ฐ ํŠธ๋žœ์žญ์…˜์„ ์‚ฝ์ž…ํ•˜๊ธฐ ์œ„ํ•ด ํ•œ๋ฒˆ! 

 

์ด์–ด์ง€๋Š” mining ์ž‘์—…์€ FP-tree (very compact, not using the database) ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค. 

 

 

์‰ฌ์šด์„ค๋ช…๋ธ”๋กœ๊ทธ

 

 

•  How to mine an FP-Tree 

 

(1) forming conditional pattern bases

 

 

๋นˆ๋ฐœ์ง‘ํ•ฉ์ธ α ๋ฅผ ํฌํ•จํ•˜๋Š” FP-tree ์˜ prefix path ๋ฅผ ๋”ฐ๋ผ๊ฐ€๋ฉฐ support ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

 

์˜ˆ๋ฅผ๋“ค์–ด α = {m} ์ธ ๊ฒฝ์šฐ, {m} ์€ ๋นˆ๋ฐœ์ง‘ํ•ฉ์— ํ•ด๋‹นํ•˜๊ณ , {m} ์˜ conditional pattern base ๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. m ์„ ํฌํ•จํ•˜๋Š” path ๋Š” FP-tree ์—์„œ ์‚ดํŽด๋ดค์„ ๋•Œ <f,c,a,m> ์™€ <f,c,a,b,m> ์ด ์žˆ๊ณ , support ๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉด (m ๋…ธ๋“œ์˜ frequency ๊ฐ€ support ๊ฐ€ ๋œ๋‹ค) ๊ฐ๊ฐ 2์™€ 1์ด ๋œ๋‹ค. 

 

 

 

(2) constructing conditional FP-trees

 

์•ž์—์„œ ์‚ดํŽด๋ณธ conditional pattern base (α = {m}) ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์กฐ๊ฑด๋ถ€ FP tree ๋ฅผ ์ƒ์„ฑํ•˜๋Š”๋ฐ, ์•„๋ž˜ ๊ทธ๋ฆผ์„ ์ฐธ๊ณ ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ์ด๋•Œ b ๋Š” frequency ๊ฐ€ 1์ด๋ฏ€๋กœ minsup ์„ ์ถฉ์กฑ์‹œํ‚ค์ง€ ๋ชปํ•ด ์ œ์™ธ์‹œํ‚จ๋‹ค. 

 

 

 

 

 

(3) recursively mining conditional FP-trees

 

(1),(2) ์— ํ•ด๋‹นํ•˜๋Š” ๊ณผ์ •์„ ๋ฐ˜๋ณต์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๋ฉฐ ๋นˆ๋ฐœ์ง‘ํ•ฉ๋“ค์„ ์ฐพ์•„๊ฐ€๋ฉด ๋œ๋‹ค.

 

 

 

 

 

•  Pattern Growth 

 

 

 

โ‡จ α  ๊ฐ€ DB ์˜ ๋นˆ๋ฐœ์ง‘ํ•ฉ์ด๊ณ , B ๋Š” α ์˜ conditional pattern base ์ด๊ณ , β ๋Š” B ์˜ itemset ์— ํ•ด๋‹นํ•  ๋•Œ, αUβ ๋Š” DB ์—์„œ ๋นˆ๋ฐœํ•˜๊ณ  β ๋Š” B์—์„œ ๋นˆ๋ฐœํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

โ‡จ Process of mining frequent patterns can be viewed as first mining frequent 1-itemsets and then progressively growing each such itemset by mining its conditional pattern base, which can in turn be done similarly

 

โ‡จ We successfully transform a frequent k-itemset mining problem into a sequence of k frequent 1-itemset mining problems via a set of conditional pattern bases 

 

 

 

 

 

โ‡จ {m}๊ณผ {f,c,a} ์˜ join ์˜ ๋ชจ๋“  ๊ฒฐ๊ณผ๋„ ๋‹ค ๋นˆ๋ฐœ์ง‘ํ•ฉ {mf}, {mc}, {ma}, {mfc}, {mfa}, {mca}, {mfca} 

 

 

 

 

•  Single path Tree 

 

- Suppose an FP-tree T has a single path P

- The complete set of the frequent patterns of T can be generated by the enumeration of all the combinations of the subpaths of P

- The support is the minimum support of the items contained in the subpath

 

 

•  FP-growth algorithm 

 

 

•  FP-growth algorithm vs Apriori 

 

 

→ support threshold ๊ฐ’์ด ๋‚ฎ์œผ๋ฉด ๋” ๋งŽ์€ ๋นˆ๋ฐœ์ง‘ํ•ฉ๋“ค์ด ์ƒ๊ฒจ๋‚˜๊ฒŒ ๋˜๊ณ , ๊ทธ๋ž˜์„œ ๊ฐ’์ด ๋งค์šฐ ๋‚ฎ์€ ๊ฒฝ์šฐ์—๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ์ฃผ์–ด์ง„ ์‹œ๊ฐ„ ๋‚ด์— ๋™์ž‘ํ•˜์ง€ ๋ชปํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

FP-Growth : Divide and conquer (conditional pattern base ๋งŒ ๋ณด๋ฉด ๋œ๋‹ค), No candidate generation, no candidate test, compressed database, No repeated scan of the entire database, Cheap operations

 

 

 

 

 

โ‘ฃ Maximal, Closed frequent itemsets 

 

•  Definisions 

 

 

โ–ธ Closed : An itemset X is closed,  if X is frequent and there exists no superset Y⊃X, with the same support as X : ๊ฐ€์žฅ support ๊ฐ’์ด ํฌ๋ฉด์„œ ํ•ด๋‹น itemset ์ด ๊ฐ€์žฅ ์›์†Œ์˜ ๊ฐœ์ˆ˜๊ฐ€ ์ž‘์•„์•ผ ํ•จ 

 

โ–ธ Maximal : An itemset X is maximal if X is frequent and there exists no superset Y⊃X : support ๊ฐ’์ด ๊ฐ™์•„๋„ ๋จ 

 

 

 

 

 

Closed itemset์€ frequent itemset์˜ ํ•˜์œ„์ง‘ํ•ฉ์ด์ง€๋งŒ, ๋‹ค๋ฅธ frequent itemset์œผ๋กœ๋ถ€ํ„ฐ ๋…๋ฆฝ์ ์ž…๋‹ˆ๋‹ค. ์ฆ‰, closed itemset์€ ์ž์‹ ์˜ support์™€ ๊ฐ™์€ support๋ฅผ ๊ฐ–๋Š” ๋‹ค๋ฅธ frequent itemset์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š” frequent itemset์ž…๋‹ˆ๋‹ค. 

 

•  maximal itemset์€ ๋” ์ด์ƒ ํ™•์žฅ๋  ์ˆ˜ ์—†๋Š” frequent itemset์ž…๋‹ˆ๋‹ค. 

 

 closed itemset๊ณผ maximal itemset์€ frequent itemset์˜ ํŠน์ˆ˜ํ•œ ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค. closed itemset์€ frequent itemset ์ค‘์—์„œ ๋…๋ฆฝ์ ์ธ ๊ฒƒ์„ ์ฐพ๊ณ , maximal itemset์€ frequent itemset ์ค‘์—์„œ ๋” ์ด์ƒ ํ™•์žฅ๋  ์ˆ˜ ์—†๋Š” ๊ฒƒ์„ ์ฐพ์Šต๋‹ˆ๋‹ค.

 

superset : superset์€ ์–ด๋–ค ์•„์ดํ…œ ์ง‘ํ•ฉ์— ๋Œ€ํ•ด, ํ•ด๋‹น ์•„์ดํ…œ ์ง‘ํ•ฉ์„ ํฌํ•จํ•˜๋Š” ๋” ํฐ ์•„์ดํ…œ ์ง‘ํ•ฉ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, {์šฐ์œ , ๊ณ„๋ž€}์ด๋ผ๋Š” ์•„์ดํ…œ ์ง‘ํ•ฉ์ด ์žˆ์„ ๋•Œ, {์šฐ์œ , ๊ณ„๋ž€, ๋นต}์€ {์šฐ์œ , ๊ณ„๋ž€}์˜ superset์ด ๋ฉ๋‹ˆ๋‹ค.

 

 

 

3. Rule Generation 


 

๋นˆ๋ฐœ์ง‘ํ•ฉ์„ ์ƒ์„ฑํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ (Apriori, FP-Growth) ์— ๋Œ€ํ•ด ๋ฐฐ์› ๋‹ค. ์ด์ œ ํ•ด๋‹น ๋นˆ๋ฐœ์ง‘ํ•ฉ์…‹์„ ๊ฐ€์ง€๊ณ  ์—ฐ๊ด€๊ทœ์น™์„ ๋งŒ๋“œ๋Š” ๊ณผ์ •์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด์ž. 

 

•  Rule generation 

 

 

โ‡จ  L : ๋นˆ๋ฐœ์ง‘ํ•ฉ ์•„์ดํ…œ์…‹ 

โ‡จ  k : L ์˜ ์›์†Œ์˜ ๊ฐœ์ˆ˜ 

โ‡จ  2^k - 2 : ๊ฐ€๋Šฅํ•œ ์—ฐ๊ด€๊ทœ์น™์˜ ๊ฐœ์ˆ˜ (ํ›„๋ณด ์—ฐ๊ด€๊ทœ์น™์˜ ๊ฐœ์ˆ˜) 

 

 

•  ํšจ์œจ์ ์œผ๋กœ ๊ทœ์น™ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ• 

 

 

 

โ‡จ confidence ๋Š” support ์™€ ๋‹ฌ๋ฆฌ, anti-monotone ํ•œ ์„ฑ๊ฒฉ์„ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š๋‹ค. (ABC → D) ์˜ confidence ๋Š” (AB → D) ์˜ confidence ๋ณด๋‹ค ํด ์ˆ˜๋„ ์ž‘์„์ˆ˜๋„ ์žˆ๋‹ค. 

 

โ‡จ ๊ทธ๋Ÿฌ๋‚˜ ๊ฐ™์€ itemset ์œผ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋œ ๊ทœ์น™์˜ confidence ๋Š” antimonotone ํ•œ ์„ฑ๊ฒฉ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์˜ˆ๋ฅผ๋“ค์–ด L = {A,B,C,D} ๋นˆ๋ฐœ์ง‘ํ•ฉ์—์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ทœ์น™๋“ค์€ anti-monotone ํ•œ  confidence ๊ด€๊ณ„๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค. (ABC → D) โ‰ฝ (AB → CD) โ‰ฝ (A → BCD) . ํ•ด๋‹น ๊ทœ์น™๋“ค์€ ๊ทœ์น™ ๋‚ด์—์„œ ๋ชจ๋‘ ABCD ์ „์ฒด๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค. Right hand side (RHS) 

 

 

 

•  Rule Generation for Apriori Algorithm

 

 

 

 

โ‡จ RHS ๋ฅผ ๋งŒ์กฑํ•˜๋Š” ๊ทœ์น™๋“ค ์‚ฌ์ด์—์„  anti-monotone ์„ฑ์งˆ์ด ์ ์šฉ๋  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์œ„์˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ์ œ์ผ ์ƒ๋‹จ์— ์žˆ๋Š” ๊ทœ์น™์˜ confidence ๊ฐ€ ๋‚ฎ๋‹ค๋ฉด, ๊ทธ ํ•˜์œ„ ๊ทœ์น™๋“ค ({ABCD} set ์„ ๋งŒ์กฑํ•˜๋Š” ๊ทœ์น™๋“ค) ๋„ confidence ๊ฐ€ ๋‚ฎ๊ธฐ ๋•Œ๋ฌธ์—, Prune ํ•  ์ˆ˜ ์žˆ๋‹ค. (๊ฐ€์ง€์น˜๊ธฐ, ์ œ๊ฑฐ) 

 

 

โ‡จ  consequent (ํ™”์‚ดํ‘œ ๋’ค์— ์žˆ๋Š” ๊ทœ์น™) ๊ทœ์น™์—์„œ ๊ฐ™์€ prefix ๋ฅผ ๊ณต์œ ํ•˜๊ณ  ์žˆ๋Š” ๋‘ ๊ทœ์น™์„ merge ํ•˜์—ฌ ์ƒˆ๋กœ์šด ํ›„๋ณด๊ทœ์น™์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค. ๊ฐ€๋ น (CD → AB) ์™€ (BD → AC) ๊ทœ์น™์˜ ๊ฒฝ์šฐ, consequent ์—์„œ A๋ผ๋Š” prefix ๋ฅผ ๊ณต์œ ํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, D → ABC ๋ผ๋Š” ๊ทœ์น™์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๊ณ , confidence ์˜ RHS ์˜ anti-monotone ํ•œ ์„ฑ์งˆ์— ์˜ํ•ด์„œ ๋งŒ์•ฝ AD → BC ์˜ confidence ๊ฐ’์ด ๋†’์ง€ ์•Š๋‹ค๋ฉด, ํ›„๋ณด ๊ทœ์น™์—์„œ ์ œ๊ฑฐ๋  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

4. Interestingness measures 


 

โ‘  Applications of interestingness measures 

 

•  interestingness measures 

 

| ์ถ”์ถœ๋œ ์ •๋ณด๊ฐ€ ์–ผ๋งˆ๋‚˜ ํฅ๋ฏธ๋กœ์šด์ง€, ํ˜น์€ ์–ผ๋งˆ๋‚˜ ์œ ์šฉํ•œ์ง€ ํ‰๊ฐ€ํ•˜๋Š” ์ง€ํ‘œ์ด๋‹ค. 

 

 

 

 

โ‘ก  Lift  

 

•  Lift = confidence / probability of consequent 

 

| A→B ์˜ ์—ฐ๊ด€๊ทœ์น™์—์„œ ์ž„์˜๋กœ B๊ฐ€ ๊ตฌ๋งค๋˜๋Š” ๊ฒฝ์šฐ์— ๋น„ํ•ด, A์™€์˜ ๊ด€๊ณ„๊ฐ€ ๊ณ ๋ ค๋˜์–ด ๊ตฌ๋งค๋˜๋Š” ๊ฒฝ์šฐ์˜ ๋น„์œจ 

 

 

A์—์„œ 0์œผ๋กœ ๊ฐ€๋Š” ๊ทœ์น™์— ๋Œ€ํ•ด, support = 3/7 (์ „์ฒด์—์„œ A0 ๊ฐ€ ๋“ฑ์žฅํ•œ ๋น„์œจ), confidence = 3/4 (A๊ฐ€ ๋‚˜ํƒ€๋‚ฌ์„ ๋•Œ A0 ์ด ๋“ฑ์žฅํ•œ ๋น„์œจ) ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๊ณ  B์—์„œ 1๋กœ ๊ฐ€๋Š” ๊ทœ์น™์— ๋Œ€ํ•ด์„œ๋„ ๋™์ผํ•˜๊ฒŒ support ์™€ confidence ๋ฅผ ๊ณ„์‚ฐํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

Lift ๋Š” confidence ๋ฅผ probability of consequent ๋กœ ๋‚˜๋ˆˆ ๊ฐ’์ด๋‹ค. Rule 1์—์„œ probability of consequent ๋ฅผ ๊ตฌํ•˜๋ฉด, ์ „์ฒด์—์„œ 0์ด ๋“ฑ์žฅํ•œ ๋น„์œจ์ด๋ฏ€๋กœ 4/7์ด๋‹ค. Rule 2์—์„œ๋Š” ์ „์ฒด์—์„œ 1์ด ๋“ฑ์žฅํ•œ ๋น„์œจ๋กœ 3/7์ด๋‹ค. 

 

•  Lift ๊ฐ’์ด 1๋ณด๋‹ค ํฌ๋‹ค๋ฉด, Antecedent ์™€ Consequent ์˜ ๋ฐœ์ƒ์ด ์„œ๋กœ ์—ฐ๊ด€ (correlated / dependent) ๋˜์–ด ์žˆ๋Š”์ง€ ์•Œ๋ ค์ค€๋‹ค. 

๊ทœ์น™1 (A→0) ์€ ๊ทœ์น™2 (B→1) ๋ณด๋‹ค ๋†’์€ confidence ๋ฅผ ๊ฐ€์ง€์ง€๋งŒ, lift ๋Š” ๋” ๋‚ฎ๋‹ค. 

 

 

 

 

•  Interestingness measure : Correlation (Lift) 

 

 

 

(๋†๊ตฌ๋ฅผํ•œ๋‹ค → ์‹œ๋ฆฌ์–ผ์„ ๋จน๋Š”๋‹ค) ๋ผ๋Š” ๊ทœ์น™์ด ์žˆ์„ ๋•Œ, (support, confidence) ๊ฐ€ (40%, 66.7%) ๋กœ ๊ณ„์‚ฐ๋œ ๊ฒฐ๊ณผ๋Š” ์˜คํ•ด๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ๋‹ค. (high confidence is misleading) ์ด๋ฏธ ์‹œ๋ฆฌ์–ผ์„ ๋จน๋Š” ํ•™์ƒ์˜ ์ „์ฒด ๋น„์œจ์ด 75% ์ด๋ฏ€๋กœ, ๊ตณ์ด ๋†๊ตฌ๋ฅผ ํ•œ๋‹ค๋Š” ๋‹จ์„œ๊ฐ€ ๋ถ™์ง€ ์•Š์•„๋„ ์‹œ๋ฆฌ์–ผ์„ ๋จน๋Š” ํ•™์ƒ๋“ค์˜ %๊ฐ€ ๋†’๋‹ค. 

 

(๋†๊ตฌ๋ฅผํ•œ๋‹ค → ์‹œ๋ฆฌ์–ผ์„ ๋จน์ง€ ์•Š๋Š”๋‹ค) ๋ผ๋Š” ๊ทœ์น™์ด ์žˆ์„ ๋•Œ, (support, confidence) ๊ฐ€ (20%, 33.3%) ๋กœ ๊ณ„์‚ฐ๋œ ๊ฒฐ๊ณผ๊ฐ€, ๋น„๋ก ๊ฐ’์€ ๋‚ฎ๋”๋ผ๋„ ์ข€ ๋” ์ •ํ™•ํ•˜๋‹ค. 

 

Lift ๊ฐ’์„ ๊ณ„์‚ฐํ•ด๋ณด๋ฉด, ์ฒซ๋ฒˆ์งธ ๊ทœ์น™์€ 0.89 (1๋ณด๋‹ค ๊ฐ’์ด ์ž‘์œผ๋ฏ€๋กœ ๋†๊ตฌ๋ฅผ ํ•˜๋Š” ๊ฒƒ๊ณผ ์‹œ๋ฆฌ์–ผ์„ ๋จน๋Š” ๊ฒƒ์€ ์—ฐ๊ด€์„ฑ์ด ์—†์Œ์„ ์˜๋ฏธ), ๋‘๋ฒˆ์งธ ๊ทœ์น™์€ 1.33 (1๋ณด๋‹ค ํฌ๋ฏ€๋กœ ์‹œ๋ฆฌ์–ผ์„ ๋จน์ง€ ์•Š์„ ํ™•๋ฅ ๋ณด๋‹ค, ๋†๊ตฌ๋ฅผ ํ•˜๊ณ  ์‹œ๋ฆฌ์–ผ์„ ๋จน์ง€ ์•Š์„ ํ™•๋ฅ ์ด ๋” ๋†’์€, ๋‘˜ ์‚ฌ์ด์˜ ์—ฐ๊ด€์„ฑ์ด ๋งค์šฐ ๋†’์€) ์˜ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ๋‹ค.

 

 

•  Good measures for correlation? 

 

โ†ช  (์›”๋„›์„ ์‚ฐ๋‹ค → ์šฐ์œ ๋ฅผ ์‚ฐ๋‹ค) ์˜ ๊ทœ์น™์ด (support, confidence) = (1%, 80%) ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”์„ ๋•Œ, ๋งŒ์•ฝ 85%์˜ ์†Œ๋น„์ž๋“ค์ด ์šฐ์œ ๋ฅผ ์‚ฐ๋‹ค๋ฉด, ํ•ด๋‹น ๊ฒฐ๊ณผ๋Š” ์˜คํ•ด๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ๋‹ค. 

 

โ†ช  support ์™€ confidence ๋Š” antecedent ์™€ consequent ์‚ฌ์ด์˜ correlation ์„ ํ•ด์„ํ•˜๋Š”๋ฐ ์ ์ ˆํ•œ ์ง€ํ‘œ๊ฐ€ ์•„๋‹ˆ๋‹ค. ์ ์ ˆํ•œ ์ง€ํ‘œ์— ๋Œ€ํ•ด 20๊ฐœ ์ด์ƒ์˜ interestingness measure ๊ฐ€ ์ œ์•ˆ๋œ ๋ฐ” ์žˆ๋‹ค. 

 

 

 

โ‘ข  Interestingness measures 

 

 

 

 

 

•  Null-invariant measures 

 

 

โ†ช  ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์ง€ํ‘œ๋“ค ์ค‘์—, null variance ์„ฑ์งˆ์„ ๊ฐ€์ง„ ์ง€ํ‘œ์— ๋Œ€ํ•ด ์‚ดํŽด๋ณด์ž. 

 

 

•  Comparison of interestingness measures 

 

โ†ช ์ƒ๊ด€๋ถ„์„์—์„œ null-invariance ์„ฑ์งˆ์€ ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. Lift ์™€ x2 ์€ null-invariant ๊ฐ€ ์•„๋‹ˆ๋‹ค. 

โ†ช null invariance : not dependent on the null transaction 

 

 

โ†ช ์œ„์˜ ์˜ˆ์‹œ์—์„œ, no milk & no coffee ์— ํ•ด๋‹นํ•˜๋Š” ์…€์ด null transaction ์˜ ๊ฐœ์ˆ˜์ด๋‹ค. chi-square ์™€ Lift ๊ฐ’์€ null transaction ์— ์˜ํ–ฅ์„ ๋ฐ›๋Š” ์ง€ํ‘œ๋“ค์ด๊ธฐ ๋•Œ๋ฌธ์—, ์„œ๋กœ ๊ฒฐ๊ณผ๊ฐ€ ๋งค์šฐ ์ƒ๋ฐ˜๋˜๊ฒŒ ๋‚˜์˜ด์„ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋ฐ˜๋ฉด AllConf ์ง€ํ‘œ์˜ ๊ฒฝ์šฐ, Data set 1 ๊ณผ 2 ์— ๋Œ€ํ•ด ๊ฐ™์€ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค. ์ด๋Ÿฌํ•œ ์ง€ํ‘œ์˜ ๊ฒฝ์šฐ๋Š” null transaction ์— ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๋Š” null-invariant ํ•œ ์ง€ํ‘œ๋ผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

โ†ช D1์™€ D2 : positive correlated (๋‘๊ฐœ๋ฅผ ๋™์‹œ์— ๊ตฌ๋งคํ•œ ๊ฒฝ์šฐ๊ฐ€ ํ•œ์ชฝ๋งŒ ๊ตฌ๋งคํ•œ ๊ฒฝ์šฐ๋ณด๋‹ค ๋งŽ์Œ) 

โ†ช D3 : negative correlated (๋‘๊ฐœ๋ฅผ ๋™์‹œ์— ๊ตฌ๋งคํ•œ ๊ฒฝ์šฐ๊ฐ€ ํ•œ์ชฝ๋งŒ ๊ตฌ๋งคํ•œ ๊ฒฝ์šฐ๋ณด๋‹ค ํ›จ์”ฌ ์ ์Œ) 

โ†ช D4,D5,D6 : neutral (ํŠน์ •ํ•œ ํŒจํ„ด์ด ๋ณด์ด์ง€ ์•Š์Œ) 

 

 

 

•  Example : analysis of DBLP Coauthor relationships 

 

์ปดํ“จํ„ฐ๊ณผํ•™ ๋ถ„์•ผ์—์„œ ๋…ผ๋ฌธ ์ถœ๊ฐ„์— ๊ด€๋ จ๋œ ๋ฐ์ดํ„ฐ์…‹์— ๊ด€ํ•œ ์˜ˆ์‹œ์ด๋‹ค. 

 

 

๋‹ค์–‘ํ•œ ์ง€ํ‘œ๋ฅผ ์‚ดํŽด๋ด„์œผ๋กœ์จ ๋‹ค์–‘ํ•œ ๊ด€๊ณ„์— ๋Œ€ํ•ด ํ•ด์„ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

โ†ช Kulc ์˜ ๊ฒฝ์šฐ, ์ƒ์œ„ 3๊ฐœ์˜ sup ์˜ ๋ถ„ํฌ๋ฅผ ๋ณด๋ฉด, sup(ab) ๊ฐ€ sup(b) ์™€ ๊ฐ™์Œ์„ ๋ณผ ์ˆ˜ ์žˆ๊ณ , sup(b) ๊ฐ€ sup(a) ๋ณด๋‹ค ํ˜„์ €ํžˆ ๋‚ฎ์€ ์ˆซ์ž์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฐ๊ณผ๋ฅผ ํ•ด์„ํ•ด๋ณด๋ฉด, advisor (๊ต์ˆ˜) - advisee (๋Œ€ํ•™์›์ƒ) ๊ด€๊ณ„๋ฅผ ๋„์ถœํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ด€๊ณ„๋Š” skewed ํ•œ ๊ด€๊ณ„๋ผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

โ†ช coherence ์˜ ๊ฒฝ์šฐ, ์ƒ์œ„ 3๊ฐœ์˜ sup ๋ถ„ํฌ๋ฅผ ๋ณด๋ฉด, sup(b) ๊ฐ€ sup(ab) ๋ณด๋‹ค ๊ฐ’์ด ๋†’์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Š” ์ €์ž b ๊ฐ€ ๋…๋ฆฝ์ ์ธ ์—ฐ๊ตฌ์ž์— ํ•ด๋‹นํ•œ๋‹ค๊ณ  ์œ ์ถ”ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. (๊ต์ˆ˜๋‹˜์˜ ๋„์›€์„ ๋ฐ›์ง€ ์•Š๊ณ  ์Šค์Šค๋กœ ์ถœ๊ฐ„ํ•œ ๋…ผ๋ฌธ์ด ๋งŽ๋‹ค) 

 

โ†ช kulc tends to more credit on skewed pattern , coherence prefers balanced pattern, cosine in between 

 

 

 

•  Null-invariant ์ง€ํ‘œ๊ฐ€ ๋” ์ข‹์€ ๊ฒƒ์ธ๊ฐ€? 

 

๋ช…ํ™•ํ•˜๊ฒŒ ๊ทธ๋ ‡๋‹ค๊ณ  ๋งํ•˜๊ธฐ๋Š” ์–ด๋ ต๋‹ค. 

 

โ†ช IR (imbalance ratio) : A์™€ B์˜ ๋ถˆ๊ท ํ˜•์„ ์ธก์ •ํ•˜๋Š” ์ง€ํ‘œ 

 

 

 

 

 

โญ Programming 

 

 

 

5. Advanced association analysis 


 

Pattern mining roadmap 

 

 

 

 

โ‘  Rare or Negative 

 

•  Infrequent or rare patterns 

ex. Rolex ์‹œ๊ณ„๋ฅผ ๊ตฌ๋งคํ•˜๋Š” ๊ฒƒ → ๋“œ๋ฌผ๊ฒŒ ๋‚˜ํƒ€๋‚˜์ง€๋งŒ, ๋งค์ถœ์— ์žˆ์–ด์„  ์ค‘์š”ํ•œ ๋ฐœ์ƒ

 

•  Negative patterns 

ex. SUV ์ฐจ๋ฅผ ๊ตฌ๋งคํ•˜๋Š” ๊ฒƒ๊ณผ Toyota ์ฐจ๋ฅผ ๊ตฌ๋งคํ•˜๋Š” ๊ฒƒ์€ ์Œ์˜ ์ƒ๊ด€ํŒจํ„ด์„ ๋ณด์ธ๋‹ค → ์ƒํ’ˆ ๊ฐ„์— ์Œ์˜ ๊ด€๊ณ„์„ฑ์ด ๋ณด์ผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

โ‘ก  Abstraction levels 

 

•  Multi-level association rules 

 

 

โ†ช ๋‚ฎ์€ ๋ ˆ๋ฒจ์— ์žˆ๋Š” ์•„์ดํ…œ์ผ์ˆ˜๋ก support ๊ฐ’ ๋˜ํ•œ ์ž‘์„ ์ˆ˜ ์žˆ๋‹ค. (Items at the lower level are expected to have lower support)

 

 

โ‘ข  Flexible support and redundancy filtering 

 

• Flexible min-support thresholds

 

๋ช‡๋ช‡ ์ƒํ’ˆ๊ตฐ์€ ์ ์€ ๋นˆ๋„๋กœ ๋‚˜ํƒ€๋‚˜์ง€๋งŒ, ๋” ๊ฐ€์น˜์žˆ์„ ์ˆ˜ ์žˆ๋‹ค. ์ด๋Ÿฐ ๊ฒฝ์šฐ์—๋Š” non-uniformํ•œ group-based min-support ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค. 

 

ex. {๋‹ค์ด์•„๋ชฌ๋“œ,์‹œ๊ณ„,์นด๋ฉ”๋ผ} = 0.05%, {๋นต, ์šฐ์œ } = 5% 

โ†ช ๋น„์‹ผ ์•„์ดํ…œ ๊ตฐ์ง‘์€ ์ƒ๋Œ€์ ์œผ๋กœ ๋ฐœ์ƒ ๋นˆ๋„๊ฐ€ ๋‚ฎ์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋‚ฎ์€ support threshold ๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹๊ณ , ๋น„๊ต์  ์ €๋ ดํ•œ ์•„์ดํ…œ ๊ตฐ์ง‘์€ ์ƒ๋Œ€์ ์œผ๋กœ ๋ฐœ์ƒ ๋นˆ๋„๊ฐ€ ๋†’์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋†’์€ support threshold ๋ฅผ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ ์ ˆํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

• Redundancy Filtering : ๋ช‡๋ช‡ ๊ทœ์น™์€ ancestor ์˜ ํŠน์„ฑ์œผ๋กœ ์ธํ•ด ์ค‘๋ณต๋  ์ˆ˜ ์žˆ๋‹ค. 

 

 

2% ์ง€๋ฐฉ ์šฐ์œ ์™€ ๋นต์˜ ๊ทœ์น™์€ ์šฐ์œ ์™€ ๋นต์˜ ๊ทœ์น™๊ณผ ์ค‘๋ณต๋  ์ˆ˜ ์žˆ๋‹ค. ๊ทœ์น™์˜ ancestor ์— ๊ธฐ๋ฐ˜ํ•œ expected support value (์œ„์—์„œ๋Š” 8%/4 = 2%๋กœ ์™„์ „ ์ผ์น˜) ๊ฐ€ ๋น„์Šทํ•˜๋‹ค๋ฉด, ํ•ด๋‹น ๊ทœ์น™์€ ์ค‘๋ณต (redundant) ๋ฌ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 2% milk ๊ทœ์น™์€ ์ค‘๋ณต๋œ ๊ทœ์น™์ด๋‹ค. 

 

 

 

 

โ‘ฃ  # of dimensions 

 

•  Multi-dimensional association rules 

 

2๊ฐœ ์ด์ƒ์˜ ์ฐจ์›์„ ๋‚˜ํƒ€๋‚ด๋Š” ์—ฐ๊ด€๊ทœ์น™์ด ์กด์žฌํ•  ์ˆ˜ ์žˆ๋‹ค.

 

single dimension ์„ ๊ฐ€์ง€๋Š” ๊ทœ์น™์˜ ์˜ˆ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ์•„๋ž˜์˜ ๊ทœ์น™์€ buys ์—์„œ buys ๋กœ ๊ฐ€๋Š” ๊ทœ์น™์ด๋‹ค. 

 

 

โ‘ค  Types of values 

 

•  Quantitative association rules 

 

์ˆ˜์น˜ํ˜• ๊ฐ’์„ ๊ฐ€์ง„ ์ƒํ’ˆ๊ตฐ๊ณผ ๋ฒ”์ฃผํ˜• ๊ฐ’์„ ๊ฐ€์ง„ ์ƒํ’ˆ๊ตฐ ์‚ฌ์ด์˜ ์—ฐ๊ด€๊ทœ์น™

 

 

 

โ‘ฅ  Constraints or Criteria 

 

• Constraint-based 

โ†ช Those satisfying a set of user-defined constraints

 

• Approximate, compressed, near-match

โ†ช Those that tally the support count of the near or almost matching itemsets

 

• Top-k

โ†ช The k most frequent itemsets for a user-specified value k

 

• Redundancy-aware top-k

โ†ช The top-k patterns with similar or redundant patterns excluded

 

 

โ‘ฆ  Constraints or Criteria  : Constraints in data mining 

 

•  Knowledge type constraint : Classification, association

 

• Data constraint - using SQL like queries : Finding product pairs sold together in stores in Chicago this year → Region area constraints (Chicago), temporal constraints (this year) 

 

• Dimension/level constraint: In relevance to region, price, brand, customer category

 

• Rule (or pattern) constraint : Small sales (price < $10) trigger big sales (sum > $200)

 

• Interestingness constraint : Strong rules: min_support โ‰ฝ 3%, min_confidence โ‰ฝ 60%

 

 

 

 

โ‘ง  Constraints or Criteria  : Compressed Patterns 

•  we do not keep all frequent patterns 

 

 

 

•  closed frequent patterns ๋ฅผ ๊ธฐ์ค€์œผ๋กœ compression ์„ ํ•˜๊ณ ์ž ํ•  ๋•Œ๋Š”, P1, P2, P3, P4, P5 ๋ชจ๋‘๊ฐ€ closed form ์ด๊ธฐ ๋•Œ๋ฌธ์—, ์••์ถ•ํ•  ๊ฒƒ์ด ์—†๋‹ค. 

•  Maximal frequent patterns ๋ฅผ ๊ธฐ์ค€์œผ๋กœ compression ์„ ํ•˜๊ณ ์ž ํ•  ๋•Œ, maximal form ์— ํ•ด๋‹นํ•˜๋Š” ID ๋Š” P3 ์ด๊ธฐ ๋•Œ๋ฌธ์—, 5๊ฐ€์ง€ ID ์— ๋Œ€ํ•ด P3 ๋งŒ์„ ์„ ํƒํ•˜๊ฒŒ ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐP1,P2,P4,P5 ๋ฅผ ๋ชจ๋‘ ๋ฒ„๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— ์ •๋ณด์˜ ์†์‹ค์ด ๋ฐœ์ƒํ•œ๋‹ค. 

•  ๊ฐ€์žฅ ์ข‹์€ compression ๋ฐฉ๋ฒ•์€, pattern ๊ฐ„์˜ distance ๋ฅผ ๊ตฌํ•ด์„œ clustering ์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. 

 

 

 

 

โ‘จ  Constraints or Criteria  : Redundancy-Award Top-k patterns 

 

•  ๋†’์€ significance & ๋‚ฎ์€ redundancy : redundancy award top-k ๋ฐฉ์‹์„ ์ ์šฉํ•œ (b) ๊ฐ€ ํ•ด๋‹น ์กฐ๊ฑด์„ ๋งŒ์กฑํ•œ๋‹ค. 

 

 

 

•  (c)์—์„œ๋Š” significance (๊ฐ€๋ น lift ์ง€ํ‘œ) ๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๊ฒƒ๋“ค์„ ๊ณ ๋ฅด๊ณ , (d) ์—์„œ๋Š” ์ •๋ณด๊ฐ€ ์ค‘๋ณต๋˜์ง€ ์•Š๋„๋ก clustering ์—์„œ ํ•œ๊ฐœ์”ฉ ๊ณ ๋ฅธ๋‹ค. ์ด ๋‘˜์„ ์ ์ ˆํžˆ ์„ž์€ ๊ฒƒ์ด (b) ์ด๋‹ค. ์„ ํƒ๋œ 3๊ฐœ๋ฅผ ๋ณด๋ฉด, significance ๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๊ฒƒ ๋“ค ์ค‘ ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ๋‘ ๊ฐœ๋ฅผ ๊ณ ๋ฅด๊ณ , ๊ทธ ๋‹ค์Œ lift (์ง„ํšŒ์ƒ‰) ๋“ค ์ค‘ ๊ฐ€์žฅ ๊ฑฐ๋ฆฌ๊ฐ€ ์žˆ๋Š” ์˜ค๋ฅธ์ชฝ ๋™๊ทธ๋ผ๋ฏธ๋ฅผ ๊ณ ๋ฅธ๋‹ค. 

 

โ‘ฉ  Kinds of Data and Features 

•  Frequent itemset mining 

•  Sequential patterns : Frequent sequences of ordered events (ex. PC ๋ฅผ ๋จผ์ € ์‚ฌ๊ณ , ๋””์ง€ํ„ธ ์นด๋ฉ”๋ผ๋ฅผ ์‚ฌ๊ณ , ๊ทธ ๋‹ค์Œ์— ๋ฉ”๋ชจ๋ฆฌ์นด๋“œ๋ฅผ ์‚ฌ๊ณ )

•  Structural patterns : Frequent substructures (ex. ํ™”ํ•™ ๊ตฌ์กฐ๋ฌผ)

 

 

•  Application-Domain specific 

โ†ช ๋„๋ฉ”์ธ์— ๋”ฐ๋ผ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ๋Š” ํŒจํ„ด์€ ๋‹ค๋ฅด๋‹ค. 

 

•  Data analysis Usage 

โ†ช ๋นˆ๋ฐœํŒจํ„ด๋งˆ์ด๋‹์€ ๋ฐ์ดํ„ฐ ๋ถ„์„์˜ ์ค‘๊ฐ„๋‹จ๊ณ„๋กœ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

728x90

๋Œ“๊ธ€