Ana səhifə

Association Rules


Yüklə 124.32 Kb.
tarix01.05.2016
ölçüsü124.32 Kb.
Association Rules


1. Objectives 2

2. Definitions 2

3. Type of Association Rules 7

4. Frequent Itemset generation 10

5. Apriori Algorithm: Mining Single-Dimension Boolean AR 14

5.1. Join Step: 16

5.2. Prune step 18

5.3. Example 19

5.4. Pseudo-code 20

5.5. Challenges 20

5.6. Improving the Efficiency of Apriori 21

6. Mining Frequent Itemsets without Candidate Generation 23

6.1. Mining Frequent patterns using FP-Tree 26

6.2. Major steps to mine FP-trees 26

7. Multiple-Level Association Rules 32

7.1. Approach 32

7.2. Redundancy Filtering 37

1.Objectives





  • Increase sales and reduce costs

  • What products were often purchased together?

          • Beer and diapers?!

  • What are the subsequent purchases after buying a PC?

  • What kinds of DNA are sensitive to this new drug?

  • Can we automatically classify web documents?

  • Broad applications:

          • Basket data analysis, cross-marketing, catalog design, sale campaign analysis

          • Web log (click stream) analysis, DNA sequence analysis, etc.

      • Example: Items frequently purchased together:

      • Bread  PeanutButter



      • Why associations:

          • Placement

          • Advertising

          • Sales

          • Coupons


2.Definitions





  • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.






  • Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database.



  • Basic Concepts:

    • A set of items: I={x1, …, xk}




    • Transactions: D={t1,t2, …, tn}, tj  I




  • A k-Itemset: {Ii1,Ii2, …, Iik}  I



  • Support of an itemset: Percentage of transactions that contain that itemset.






    • Large (Frequent) itemset: Itemset whose number of occurrences is above a threshold.




    • Example:


I = { Beer, Bread, Jelly, Milk, PeanutButter}

Support of {Bread,PeanutButter} = 3/5 = 60%






  • Association Rules

    • Implication: X  Y where X,Y  I and X Y = ;

    • Support of AR (s) X  Y:

      • Percentage of transactions that contain

X  Y

      • Probability that a transaction contains XY.




    • Confidence of AR (a) X  Y:

      • Ratio of number of transactions that contain X  Y to the number that contain X

      • Conditional probability that a transaction having X also contains Y.



    • Example:







  • For rule AC:

Support(AC) = P(AC) = support({A}{C}) = 50%

confidence (A C) = P(C|A)

= support({A}{C})/support({A}) = 66.6%








  • Another Example:





X Y

Support

Confidence

Bread  Peanutbutter

= 3/5 %= 60%

= (3/5)/(4/5)%=75%

Peanutbutter  Bread

60%

= (3/5)/(3/5)%=100%

Jelly  Milk

0%

0%

Jelly  Peanutbutter

=1/5 % = 20%

= (1/5)/(1/5) % = 100%




  • Association Rule Problem:

    • Given a set of items I={I1,I2,…,Im} and a database of transactions D={t1,t2, …, tn} where ti={Ii1,Ii2, …, Iik} and Iij  I, the Association Rule Problem is to identify all association rules X  Y with a minimum support and confidence.


    • NOTE: Support of X  Y is same as support of X  Y.



  • Association Rules techniques:

      • Find all frequent itemsets.

      • Generate strong association rules from the frequent itemsets: those rules must satisfy minimum support and minimum confidence.



3.Type of Association Rules





  • Boolean AR:

          • It is a rule that checks whether an item is present or absent.

          • All the examples we have seen so far are Boolean AR.




  • Quantitative AR:

          • It describes associations between quantitative items or attributes.

          • Generally, quantitative values are partitioned into intervals.

          • Example:

Age(X,”30..39”)  income(X,”80K..100K”)

 buys(X, High Resolution TV)


  • Single-Dimension AR:

          • It is a rule that references only one dimension.

          • Example:

buys(X,”computer”)

 buys(X,”financial_software”)


The single dimension is “buys”


          • The following rule is a multi-dimensional AR:

Age(X,”30..39”)  income(X,”80K..100K”)

 buys(X, High Resolution TV)


  • Multi-level AR

          • It is a set of rules that reference different levels of abstraction.

          • Example:

Age(X,”30..39”)  buys(X, “desktop”)

Age(X,”20..29”)  buys(X, “laptop”)


Laptop  desktop  computer

4.Frequent Itemset generation





  • Given d items, there are 2d possible candidate itemsets






  • Brute-force approach:

    • Each itemset in the lattice is a candidate frequent itemset

    • Count the support of each candidate by scanning the database




    • Match each transaction against every candidate

    • Complexity ~ O(NMw) => Expensive since M = 2d !!!




  • Complexiy:

    • Given d unique items:

    • Total number of itemsets = 2d

    • Total number of possible association rules:




    • If d=6, R = 602 rules





  • Frequent Itemset Generation Strategies

    • Reduce the number of candidates (M)

    • Reduce the number of transactions (N)

      • Reduce size of N as the size of itemset increases

    • Reduce the number of comparisons (NM)

      • Use efficient data structures to store the candidates or transactions

      • No need to match every candidate against every transaction


5.Apriori Algorithm: Mining Single-Dimension Boolean AR





  • It is used to mine Boolean, single-level, and single-dimension ARs.

  • Apriori Principle








  • Apriori algorithm:




          • Uses prior knowledge of frequent itemset properties.




          • It is an iterative algorithm known as level-wise search.




          • The search proceeds level-by-level as follows:

            • First determine the set of frequent 1-itemset; L1

            • Second determine the set of frequent 2-itemset using L1: L2

            • Etc.




          • The complexity of computing Li is O(n) where n is the number of transactions in the transaction database.




          • Reduction of search space:

            • In the worst case what is the number of itemsets in a level Li?

            • Apriori uses “Apriori Property”:




          • Apriori Property:

            • It is an anti-monotone property: if a set cannot pass a test, all of its supersets will fail the same test as well.

            • It is called anti-monotone because the property is monotonic in the context of failing a test.

            • All nonempty subsets of a frequent itemset must also be frequent.

            • An itemset I is not frequent if it does not satisfy the minimum support threshold:

P(I) < min_sup




            • If an item A is added to the itemset I, then the resulting itemset I  A cannot occur more frequently than I:

I  A is not frequent
Therefore, P(I  A) < min_sup

        • How Apriori algorithm uses “Apriori property”?

          • In the computation of the itemsets in Lk using Lk-1

          • It is done in two steps:

            • Join

            • Prune

5.1.Join Step:


        • The set of candidate k-itemsets (element of Lk), Ck, is generated by joining Lk-1 with itself:

Lk-1 ∞ Lk-1




        • Given l1 and l2 of Lk-1

Li=li1,li2,li3,…,li(k-2),li(k-1)

Lj=lj1,lj2,lj3,…,lj(k-2),lj(k-1)
Where Li and Lj are sorted.


        • Li and Lj are joined if there are different (no duplicate generation). Assume the following:

li1=lj1, li2=lj1, …, li(k-2)=lj(k-2) and li(k-1) < lj(k-1)




  • The resulting itemset is:

li1,li2,li3,…,li(k-1),lj(k-1)




  • Example of Candidate-generation:

L3={abc, abd, acd, ace, bcd}


Self-joining: L3 ∞ L3

abcd from abc and abd

acde from acd and ace

5.2.Prune step





  • Ck is a superset of Lk  some itemset in Ck may or may not be frequent.

  • Lk: Test each generated itemset against the database:

      • Scan the database to determine the count of each generated itemset and include those that have a count no less than the minimum support count.

      • This may require intensive computation.




  • Use Apriori property to reduce the search space:

      • Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset.

      • Remove from Ck any k-itemset that has a (k-1)-subset not in Lk-1 (itemsets that are not frequent)

      • Efficiently implemented: maintain a hash table of all frequent itemset.



  • Example of Candidate-generation and Pruning:

L3={abc, abd, acd, ace, bcd}


Self-joining: L3 ∞ L3


abcd from abc and abd

acde from acd and ace
Pruning:

acde is removed because ade is not in L3

C4={abcd}

5.3.Example




5.4.Pseudo-code

Ck: Candidate itemset of size k

Lk : frequent itemset of size k

L1 = {frequent items};



for (k = 1; Lk !=Æ; k++) do

-

Ck+1 = candidates generated from Lk;

- for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t;



endfor;

- Lk+1 = candidates in Ck+1 with min_support



endfor;

returnk Lk;

5.5.Challenges





  • Multiple scans of transaction database

  • Huge number of candidates

  • Tedious workload of support counting for candidates

  • Improving Apriori:

          • general ideas

          • Reduce passes of transaction database scans

          • Shrink number of candidates

          • Facilitate support counting of candidates




  • Easily parallelized




5.6.Improving the Efficiency of Apriori





  • Several attempts have been introduced to improve the efficiency of Apriori:

    • Hash-based technique

      • Hashing itemset counts

      • Example:



        • Transaction DB:




TID

List of Transactions

T100

I1,I2,I5

T200

I2,I4

T300

I2,I3

T400

I1,I2,I4

T500

I1,I3

T600

I2,I3

T700

I1,I3

T800

I1,I2,I3,I5

T900

I1,I2,I3


    • Create a hash table for candidate 2-itemsets:




      • Generate all 2-itemsets for each transaction in the transaction DB




      • H(x,y) = ((order of x) * 10 + (order of y)) mod 7




      • A 2-itemset whose corresponding bucket count is below the support threshold cannot be frequent.



Bucket @

0

1

2

3

4

5

6

Bucket count

2

2

4

2

2

4

4

Content

{I1,I4}

{I1,I5}

{I2,I3}

{I2,I4}

{I2,I5}

{I1,I2}

{I1,I3}




{I3,I5}

{I1,I5}

{I2,I3}

{I2,I4}

{I2,I5}

{I1,I2}

{I1,I3}










{I2,I3}







{I1,I2}

{I1,I3}










{I2,I3}







{I1,I2}

{I1,I3}




  • Remember: support(xy) = percentage number of transactions that contain x and y. Therefore, if the minimum support is 3, then the itemsets in buckets 0, 1, 3, and 4 cannot be frequent and so they should not be included in C2.



    • Transaction reduction

      • Reduce the number of transactions scanned in future iterations.

      • A transaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets: Do not include such transaction in subsequent scans.




    • Other techniques include:

      • Partitioning (partition the data to find candidate itemsets)

      • Sampling (Mining on a subset of the given data)

      • Dynamic itemset counting (Adding candidate itemsets at different points during a scan)

6.Mining Frequent Itemsets without Candidate Generation





  • Objectives:

    • The bottleneck of Apriori: candidate generation

    • Huge candidate sets:

      • For 104 frequent 1-itemset, Apriori will generate 107 candidate 2-itemsets.

      • To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate

2100  1030 candidates.

    • Multiple scans of database:

      • Needs (n +1) scans, n is the length of the longest pattern.



  • Principal

    • Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure

      • Highly condensed, but complete for frequent pattern mining

      • Avoid costly database scans

    • Develop an efficient, FP-tree-based frequent pattern mining method

      • A divide-and-conquer methodology: decompose mining tasks into smaller ones

      • Avoid candidate generation: sub-database test only!

  • Algorithm:

  1. Scan DB once, find frequent 1-itemset (single item pattern)

  2. Order frequent items in frequency descending order, called L order: (in the example below: F(4), c(4), a(3), etc.)

  3. Scan DB again and construct FP-tree

    1. Create the root of the tree and label it null or {}

    2. The items in each transaction are processed in the L order (sorted according to descending support count).

    3. Create a branch for each transaction

    4. Branches share common prefixes



  • Example: min_support = 0.5






TID

Items bought

(Ordered) frequent items

100

{f, a, c, d, g, i, m, p}

{f, c, a, m, p}

200

{a, b, c, f, l, m, o}

{f, c, a, b, m}

300

{b, f, h, j, o}

{f, b}

400

{b, c, k, s, p}

{c, b, p}

500

{a, f, c, e, l, p, m, n}

{f, c, a, m, p}




  • Node Structure:


Item

count

node pointer

child pointers

parent pointer


6.1.Mining Frequent patterns using FP-Tree


  • General idea (divide-and-conquer)

    • Recursively grow frequent pattern path using the FP-tree

  • Method

    • For each item, construct its conditional pattern-base, and then its conditional FP-tree

    • Recursion: Repeat the process on each newly created conditional FP-tree

    • Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)


6.2.Major steps to mine FP-trees





  • Main Steps:

  1. Construct conditional pattern base for each node in the FP-tree

  2. Construct conditional FP-tree from each conditional pattern-base

  3. Recursively mine conditional FP-trees and grow frequent patterns obtained so far If the conditional FP-tree contains a single path, simply enumerate all the patterns



  • Step 1: From FP-tree to Conditional Pattern Base



      • Starting at the frequent header table in the FP-tree

      • Traverse the FP-tree by following the link of each frequent item, starting by the item with the highest frequency.

      • Accumulate all of transformed prefix paths of that item to form a conditional pattern base


  • Example:



Conditional pattern bases

Item

Conditional pattern base

c

f:3

a

fc:3

b

fca:1, f:1, c:1

m

fca:2, fcab:1

p

fcam:2, cb:1



  • Properties of FP-tree for Conditional Pattern Base Construction:

    • Node-link property

      • For any frequent item ai, all the possible frequent patterns that contain ai can be obtained by following ai's node-links, starting from ai's head in the FP-tree header.

    • Prefix path property

      • To calculate the frequent patterns for a node ai in a path P, only the prefix sub-path of ai in P need to be accumulated and its frequency count should carry the same count as node ai.


  • Step 2: Construct Conditional FP-tree




    • For each pattern-base

      • Accumulate the count for each item in the base

      • Construct the FP-tree for the frequent items of the pattern base

    • Example:

m-conditional pattern base: fca:2, fcab:1




All frequent patterns concerning m:

m,

fm, cm, am,

fcm, fam, cam,

fcam



  • Mining Frequent Patterns by Creating Conditional Pattern-Bases:


Item

Conditional pattern-base

Conditional FP-tree

p

{(fcam:2), (cb:1)}

{(c:3)}|p

m

{(fca:2), (fcab:1)}

{(f:3, c:3, a:3)}|m

b

{(fca:1), (f:1), (c:1)}

Empty

a

{(fc:3)}

{(f:3, c:3)}|a

c

{(f:3)}

{(f:3)}|c

f

Empty

Empty








  • Why is FP-Tree mining fast?

    • The performance study shows FP-growth is an order of magnitude faster than Apriori

    • Reasoning:

      • No candidate generation, no candidate test

      • Use compact data structure

      • Eliminate repeated database scan

      • Basic operation is counting and FP-tree building


  • FP-Growth vs. Apriori: Scalability with the support Threshold [Jiawei Han and Micheline Kamber]



7.Multiple-Level Association Rules





  • Items often form hierarchy.

  • Items at the lower level are expected to have lower support.

  • Rules regarding itemsets at appropriate levels could be quite useful.

  • Transaction database can be encoded based on dimensions and levels

  • We can explore shared multi-level mining






7.1.Approach





  • A top-down, progressive deepening approach:

      • First find high-level strong rules:

milk  bread [20%, 60%].




      • Then find their lower-level “weaker” rules:

2% milk wheat bread [6%, 50%].




  • Variations at mining multiple-level association rules.

      • Level-crossed association rules:

2% milk Wonder wheat bread




  • Association rules with multiple, alternative hierarchies:

2% milk Wonder bread




  • Two multiple-level mining associations strategies:

      • Uniform Support

      • Reduced support




  • Uniform Support: the same minimum support for all levels




      • One minimum support threshold.

      • No need to examine itemsets containing any item whose ancestors do not have minimum support.

      • Drawback:

          • Lower level items do not occur as frequently. If support threshold

too high  miss low level associations

too low  generate too many high level assoc.



  • Reduced Support: reduced minimum support at lower levels

      • There are 4 search strategies:

          • Level-by-level independent

          • Level-cross filtering by k-itemset

          • Level-cross filtering by single item

          • Controlled level-cross filtering by single item




      • Level-by-Level independent:

          • Full-breadth search

          • No background knowledge is used.

          • Each node is examined regardless the frequency of its parent.




      • Level-cross filtering by single item:

          • An item at the ith level is examined if and only if its parent node at the (i-1)th level is frequent.

      • Level-cross filtering by k-itemset:

          • A k-itemset at the ith level is examined if and only if its corresponding parent k-itemset at the (i-1)th level is frequent.

          • This restriction is stronger than the one in level-cross filtering by single item

          • They are not usually many k-itemsets that, when combined, are also frequent:

 Many valuable patterns can be mined




      • Controlled level-cross filtering by single item:

          • A variation of the level-cross filtering by single item: Relax the constraint in this approach

          • Allow the children of items that do not satisfy the minimum support threshold to be examined if these items satisfy the level passage threshold:


level_passage_supp


          • level_passage_sup Value: It is typically set between the min_sup value of the given level and the min_sup of the next level.

          • Example:




7.2.Redundancy Filtering





  • Some rules may be redundant due to “ancestor” relationships between items.




  • Definition: A rule R1 is an ancestor of a rule, R2, if R1 can be obtained by replacing the items in R2 by their ancestors in a concept hierarchy.




  • Example

R1: milk  wheat bread [support = 8%, confidence = 70%]





R2: 2% milk  wheat bread [support = 2%, confidence = 72%]
Milk in R1 is an ancestor of 2% milk in R2.


  • We say the first rule is an ancestor of the second rule.

  • A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor:



      • R2 is redundant since its confidence is close to the confidence of R1 (kind of expected) and its support is around 2% = (8% * ¼)

      • R2 does not add any additional information.



A. Bellaachia Page:



Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©anasahife.org 2016
rəhbərliyinə müraciət