如何寻找最大频繁项集,从大的交易数据文件频繁、文件、数据、最大

2023-09-07 00:06:31 作者:梓陌

我的输入文件包含大量的交易像

I have the input file contains large amount of transactions like

事务ID项目

T1 Bread, milk, coffee, juice
T2 Juice, milk, coffee
T3 Bread, juice
T4 Coffee, milk
T5 Bread, Milk
T6 Coffee, Bread
T7 Coffee, Bread, Juice
T8 Bread, Milk, Juice
T9 Milk, Bread, Coffee, 
T10 Bread
T11 Milk
T12 Milk, Coffee, Bread, Juice

我想每一个独特的项目发生像

i want the occurrence of every unique item like

Item Name Count
Bread  9
Milk   8
Coffee 7
Juice  6

和距离,我想一个一个的FP-tree,现在通过遍历此树是我想要的最大频繁项集如下:

and from that i want an a fp-tree now by traversing this tree i want the maximal frequent itemsets as follows

方法的基本思想是从底部处置在每一个层节点上。 层的概念是不同的层的一个树的共同概念。在一个层的意思的节点中的节点对应相同的项目,并在一个链表中的头表。为在一个层NBN方法的节点将被用于处理节点从左沿链表权利。用NBN方法,两个额外的字段将被添加到在有序的FP树的每个节点。节点N存储的字段标识的N是否是最大的频繁项集信息,与外地伯爵专卖店的支持计数节点的信息在左边。

The basic idea of method is to dispose nodes in each "layer" from bottom to up. The concept of "layer" is different to the common concept of layer in a tree. Nodes in a "layer" mean the nodes correspond to the same item and be in a linked list from the "Head Table". For nodes in a "layer" NBN method will be used to dispose the nodes from left to right along the linked list. To use NBN method, two extra fields will be added to each node in the ordered FP-Tree. The field tag of node N stores the information of whether N is maximal frequent itemset, and the field count’ stores the support count information in the nodes at left.

在图中,以设置在所述第一节点是果汁:2。如果min_sup等于或小于2,那么面包,牛奶,咖啡,果汁是一个最大频繁项目集。首先输出汁:2和设置的咖啡:3的字段标识为假(每个节点的字段标记是真最初)。接着检查是否正确4项集汁:1是果汁的子集:2。如果项集的一个节点果汁:1对应的是果汁的子集:2设置节点的字段标识假。在下面的过程中,当设置节点的字段标识为FALSE我们可以在相同的标记后省略节点。如果min_sup超过2,然后检查是否正确4汁:1是果汁的子集:2。如果项集的一个节点果汁:1对应的是果汁的子集:2然后设置和2'与前一计的总和的节点的'扫描场数的所有节点之后果汁布置,开始处置节点咖啡:3。

In Figure, the first node to be disposed is "juice: 2". If the min_sup is equal to or less than 2 then "bread, milk, coffee, juice" is a maximal frequent itemset. Firstly output juice:2 and set the field tag of "coffee:3" as "false" (the field tag of each node is "true" initially ). Next check whether the right four itemsets juice:1 be the subset of juice:2. If the itemset one node "juice:1" corresponding to is the subset of juice:2 set the field tag of the node "false". In the following process when the field tag of the disposed node is FALSE we can omit the node after the same tagging. If the min_sup is more than 2 then check whether the right four juice:1 is the subset of juice:2. If the itemset one node "juice:1" corresponding to is the subset of juice:2 then set the field count’ of the node with the sum of the former count’ and 2 After all the nodes "juice" disposed ,begin to dispose the node "coffee:3".

任何建议或可用的源$ C ​​$ C,欢迎选购。

Any suggestions or available source code, welcome.

在此先感谢

推荐答案

这可以直接在SQL完成

This can be done directly in SQL

CREATE TABLE dbo.TestTable
( FIELD1 VARCHAR(256) )
GO

INSERT INTO dbo.TestTable(FIELD1) VALUES
('T1 Bread, milk, coffee, juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T2 Juice, milk, coffee')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T3 Bread, juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T4 Coffee, milk')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T5 Bread, Milk')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T6 Coffee, Bread')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T7 Coffee, Bread, Juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T8 Bread, Milk, Juice')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T9 Milk, Bread, Coffee,')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T10 Bread')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T11 Milk')
INSERT INTO dbo.TestTable(FIELD1) VALUES
('T12 Milk, Coffee, Bread, Juice')
GO

--CREATE INDEX TestIndex ON dbo.TestTable(FIELD1)
--GO

;WITH Numbers AS
(
    SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS N
    FROM dbo.TestTable T1
    CROSS JOIN dbo.TestTable T2
),
Base AS
(
    SELECT SUBSTRING(FIELD1, 0, CHARINDEX(' ', FIELD1, 0)) AS TRANID,
    UPPER(REPLACE(SUBSTRING(FIELD1, CHARINDEX(' ', FIELD1, 0)+1, DATALENGTH(FIELD1)), ' ', '')) AS ITEMS
    FROM dbo.TestTable
),
Split AS
(
    SELECT TRANID, ITEMS, N, SUBSTRING(ITEMS, N, CHARINDEX(',', ITEMS + ',', N) - N) AS ELEMENT
    FROM Base 
    JOIN Numbers ON N <= DATALENGTH(Base.ITEMS) + 1
    AND SUBSTRING(',' + Base.ITEMS, N, 1) = ','
)
SELECT ELEMENT, COUNT(*) AS TOTAL
FROM Split
GROUP BY ELEMENT
ORDER BY TOTAL DESC

这将返回

BREAD   9
MILK    8
COFFEE  7
JUICE   6
        1  

空项来自逗号交易T9

The empty entry comes from the comma at the end of transaction T9

T9 Milk, Bread, Coffee,