數(shù)據(jù)挖掘之Apriori算法詳解和Python實(shí)現(xiàn)代碼分享
關(guān)聯(lián)規(guī)則挖掘(Association rule mining)是數(shù)據(jù)挖掘中最活躍的研究方法之一,可以用來(lái)發(fā)現(xiàn)事情之間的聯(lián)系,最早是為了發(fā)現(xiàn)超市交易數(shù)據(jù)庫(kù)中不同的商品之間的關(guān)系。(啤酒與尿布)
基本概念
1、支持度的定義:support(X-->Y) = |X交Y|/N=集合X與集合Y中的項(xiàng)在一條記錄中同時(shí)出現(xiàn)的次數(shù)/數(shù)據(jù)記錄的個(gè)數(shù)。例如:support({啤酒}-->{尿布}) = 啤酒和尿布同時(shí)出現(xiàn)的次數(shù)/數(shù)據(jù)記錄數(shù) = 3/5=60%。
2、自信度的定義:confidence(X-->Y) = |X交Y|/|X| = 集合X與集合Y中的項(xiàng)在一條記錄中同時(shí)出現(xiàn)的次數(shù)/集合X出現(xiàn)的個(gè)數(shù) 。例如:confidence({啤酒}-->{尿布}) = 啤酒和尿布同時(shí)出現(xiàn)的次數(shù)/啤酒出現(xiàn)的次數(shù)=3/3=100%;confidence({尿布}-->{啤酒}) = 啤酒和尿布同時(shí)出現(xiàn)的次數(shù)/尿布出現(xiàn)的次數(shù) = 3/4 = 75%
同時(shí)滿(mǎn)足最小支持度閾值(min_sup)和最小置信度閾值(min_conf)的規(guī)則稱(chēng)作強(qiáng)規(guī)則 ,如果項(xiàng)集滿(mǎn)足最小支持度,則稱(chēng)它為頻繁項(xiàng)集
“如何由大型數(shù)據(jù)庫(kù)挖掘關(guān)聯(lián)規(guī)則?”關(guān)聯(lián)規(guī)則的挖掘是一個(gè)兩步的過(guò)程:
1、找出所有頻繁項(xiàng)集:根據(jù)定義,這些項(xiàng)集出現(xiàn)的頻繁性至少和預(yù)定義的最小支持計(jì)數(shù)一樣。
2、由頻繁項(xiàng)集產(chǎn)生強(qiáng)關(guān)聯(lián)規(guī)則:根據(jù)定義,這些規(guī)則必須滿(mǎn)足最小支持度和最小置信度。
Apriori定律
為了減少頻繁項(xiàng)集的生成時(shí)間,我們應(yīng)該盡早的消除一些完全不可能是頻繁項(xiàng)集的集合,Apriori的兩條定律就是干這事的。
Apriori定律1:如果一個(gè)集合是頻繁項(xiàng)集,則它的所有子集都是頻繁項(xiàng)集。舉例:假設(shè)一個(gè)集合{A,B}是頻繁項(xiàng)集,即A、B同時(shí)出現(xiàn)在一條記錄的次數(shù)大于等于最小支持度min_support,則它的子集{A},{B}出現(xiàn)次數(shù)必定大于等于min_support,即它的子集都是頻繁項(xiàng)集。
Apriori定律2:如果一個(gè)集合不是頻繁項(xiàng)集,則它的所有超集都不是頻繁項(xiàng)集。舉例:假設(shè)集合{A}不是頻繁項(xiàng)集,即A出現(xiàn)的次數(shù)小于min_support,則它的任何超集如{A,B}出現(xiàn)的次數(shù)必定小于min_support,因此其超集必定也不是頻繁項(xiàng)集。

上面的圖演示了Apriori算法的過(guò)程,注意看由二級(jí)頻繁項(xiàng)集生成三級(jí)候選項(xiàng)集時(shí),沒(méi)有{牛奶,面包,啤酒},那是因?yàn)閧面包,啤酒}不是二級(jí)頻繁項(xiàng)集,這里利用了Apriori定理。最后生成三級(jí)頻繁項(xiàng)集后,沒(méi)有更高一級(jí)的候選項(xiàng)集,因此整個(gè)算法結(jié)束,{牛奶,面包,尿布}是最大頻繁子集。
Python實(shí)現(xiàn)代碼:
Skip to content
Sign up Sign in This repository
Explore
Features
Enterprise
Blog
Star 0 Fork 0 taizilongxu/datamining
branch: master datamining / apriori / apriori.py
hackerxutaizilongxu 20 days ago backup
1 contributor
156 lines (140 sloc) 6.302 kb RawBlameHistory
#-*- encoding: UTF-8 -*-
#---------------------------------import------------------------------------
#---------------------------------------------------------------------------
class Apriori(object):
def __init__(self, filename, min_support, item_start, item_end):
self.filename = filename
self.min_support = min_support # 最小支持度
self.min_confidence = 50
self.line_num = 0 # item的行數(shù)
self.item_start = item_start # 取哪行的item
self.item_end = item_end
self.location = [[i] for i in range(self.item_end - self.item_start + 1)]
self.support = self.sut(self.location)
self.num = list(sorted(set([j for i in self.location for j in i])))# 記錄item
self.pre_support = [] # 保存前一個(gè)support,location,num
self.pre_location = []
self.pre_num = []
self.item_name = [] # 項(xiàng)目名
self.find_item_name()
self.loop()
self.confidence_sup()
def deal_line(self, line):
"提取出需要的項(xiàng)"
return [i.strip() for i in line.split(' ') if i][self.item_start - 1:self.item_end]
def find_item_name(self):
"根據(jù)第一行抽取item_name"
with open(self.filename, 'r') as F:
for index,line in enumerate(F.readlines()):
if index == 0:
self.item_name = self.deal_line(line)
break
def sut(self, location):
"""
輸入[[1,2,3],[2,3,4],[1,3,5]...]
輸出每個(gè)位置集的support [123,435,234...]
"""
with open(self.filename, 'r') as F:
support = [0] * len(location)
for index,line in enumerate(F.readlines()):
if index == 0: continue
# 提取每信息
item_line = self.deal_line(line)
for index_num,i in enumerate(location):
flag = 0
for j in i:
if item_line[j] != 'T':
flag = 1
break
if not flag:
support[index_num] += 1
self.line_num = index # 一共多少行,出去第一行的item_name
return support
def select(self, c):
"返回位置"
stack = []
for i in self.location:
for j in self.num:
if j in i:
if len(i) == c:
stack.append(i)
else:
stack.append([j] + i)
# 多重列表去重
import itertools
s = sorted([sorted(i) for i in stack])
location = list(s for s,_ in itertools.groupby(s))
return location
def del_location(self, support, location):
"清除不滿(mǎn)足條件的候選集"
# 小于最小支持度的剔除
for index,i in enumerate(support):
if i < self.line_num * self.min_support / 100:
support[index] = 0
# apriori第二條規(guī)則,剔除
for index,j in enumerate(location):
sub_location = [j[:index_loc] + j[index_loc+1:]for index_loc in range(len(j))]
flag = 0
for k in sub_location:
if k not in self.location:
flag = 1
break
if flag:
support[index] = 0
# 刪除沒(méi)用的位置
location = [i for i,j in zip(location,support) if j != 0]
support = [i for i in support if i != 0]
return support, location
def loop(self):
"s級(jí)頻繁項(xiàng)級(jí)的迭代"
s = 2
while True:
print '-'*80
print 'The' ,s - 1,'loop'
print 'location' , self.location
print 'support' , self.support
print 'num' , self.num
print '-'*80
# 生成下一級(jí)候選集
location = self.select(s)
support = self.sut(location)
support, location = self.del_location(support, location)
num = list(sorted(set([j for i in location for j in i])))
s += 1
if location and support and num:
self.pre_num = self.num
self.pre_location = self.location
self.pre_support = self.support
self.num = num
self.location = location
self.support = support
else:
break
def confidence_sup(self):
"計(jì)算confidence"
if sum(self.pre_support) == 0:
print 'min_support error' # 第一次迭代即失敗
else:
for index_location,each_location in enumerate(self.location):
del_num = [each_location[:index] + each_location[index+1:] for index in range(len(each_location))] # 生成上一級(jí)頻繁項(xiàng)級(jí)
del_num = [i for i in del_num if i in self.pre_location] # 刪除不存在上一級(jí)頻繁項(xiàng)級(jí)子集
del_support = [self.pre_support[self.pre_location.index(i)] for i in del_num if i in self.pre_location] # 從上一級(jí)支持度查找
# print del_num
# print self.support[index_location]
# print del_support
for index,i in enumerate(del_num): # 計(jì)算每個(gè)關(guān)聯(lián)規(guī)則支持度和自信度
index_support = 0
if len(self.support) != 1:
index_support = index
support = float(self.support[index_location])/self.line_num * 100 # 支持度
s = [j for index_item,j in enumerate(self.item_name) if index_item in i]
if del_support[index]:
confidence = float(self.support[index_location])/del_support[index] * 100
if confidence > self.min_confidence:
print ','.join(s) , '->>' , self.item_name[each_location[index]] , ' min_support: ' , str(support) + '%' , ' min_confidence:' , str(confidence) + '%'
def main():
c = Apriori('basket.txt', 14, 3, 13)
d = Apriori('simple.txt', 50, 2, 6)
if __name__ == '__main__':
main()
############################################################################
Status API Training Shop Blog About
© 2014 GitHub, Inc. Terms Privacy Security Contact
Apriori算法
Apriori(filename, min_support, item_start, item_end)
參數(shù)說(shuō)明
filename:(路徑)文件名
min_support:最小支持度
item_start:item起始位置
item_end:item結(jié)束位置
使用例子:
import apriori
c = apriori.Apriori('basket.txt', 11, 3, 13)
輸出:
--------------------------------------------------------------------------------
The 1 loop
location [[0], [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]]
support [299, 183, 177, 303, 204, 302, 293, 287, 184, 292, 276]
num [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
The 2 loop
location [[0, 9], [3, 5], [3, 6], [5, 6], [7, 10]]
support [145, 173, 167, 170, 144]
num [0, 3, 5, 6, 7, 9, 10]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
The 3 loop
location [[3, 5, 6]]
support [146]
num [3, 5, 6]
--------------------------------------------------------------------------------
frozenmeal,beer ->> cannedveg min_support: 14.6% min_confidence: 0.858823529412
cannedveg,beer ->> frozenmeal min_support: 14.6% min_confidence: 0.874251497006
cannedveg,frozenmeal ->> beer min_support: 14.6% min_confidence: 0.843930635838
--------------------------------------------------------------------------------
相關(guān)文章
python編程開(kāi)發(fā)之textwrap文本樣式處理技巧
這篇文章主要介紹了python編程開(kāi)發(fā)之textwrap文本樣式處理技巧,實(shí)例分析了Python中textwrap的常用方法與處理文本樣式的相關(guān)使用技巧,需要的朋友可以參考下2015-11-11
Python中Socket編程底層原理解析與應(yīng)用實(shí)戰(zhàn)
Socket編程是網(wǎng)絡(luò)通信的基礎(chǔ),Python通過(guò)內(nèi)置的socket模塊提供了強(qiáng)大的網(wǎng)絡(luò)編程接口,本文將結(jié)合實(shí)際案例,詳細(xì)介紹Python中Socket編程的基本概念、常用方法和實(shí)際應(yīng)用,需要的朋友可以參考下2024-08-08
Pytorch 實(shí)現(xiàn)權(quán)重初始化
今天小編就為大家分享一篇Pytorch 實(shí)現(xiàn)權(quán)重初始化,具有很好的參考價(jià)值,希望對(duì)大家有所幫助。一起跟隨小編過(guò)來(lái)看看吧2019-12-12
Python?eval()和exec()函數(shù)使用詳解
exec函數(shù)執(zhí)行的是python語(yǔ)句,沒(méi)有返回值,eval函數(shù)執(zhí)行的是python表達(dá)式,有返回值,exec函數(shù)和eval函數(shù)都可以傳入命名空間作為參數(shù),本文給大家介紹下Python?eval()和exec()函數(shù),感興趣的朋友跟隨小編一起看看吧2022-11-11
python爬蟲(chóng)之BeautifulSoup 使用select方法詳解
本篇文章主要介紹了python爬蟲(chóng)之BeautifulSoup 使用select方法詳解,具有一定的參考價(jià)值,感興趣的小伙伴們可以參考一下2017-10-10
python GUI庫(kù)圖形界面開(kāi)發(fā)之PyQt5 MDI(多文檔窗口)QMidArea詳細(xì)使用方法與實(shí)例
這篇文章主要介紹了python GUI庫(kù)圖形界面開(kāi)發(fā)之PyQt5 MDI(多文檔窗口)QMidArea詳細(xì)使用方法與實(shí)例,需要的朋友可以參考下2020-03-03
淺談Python數(shù)學(xué)建模之?dāng)?shù)據(jù)導(dǎo)入
數(shù)據(jù)導(dǎo)入是所有數(shù)模編程的第一步,比你想象的更重要。Python 語(yǔ)言中數(shù)據(jù)導(dǎo)入的方法很多。對(duì)于數(shù)學(xué)建模問(wèn)題編程來(lái)說(shuō),選擇什么方法最好呢?答案是:沒(méi)有最好的,只有最合適的。對(duì)于不同的問(wèn)題,不同的算法,以及所調(diào)用工具包的不同實(shí)現(xiàn)方法,對(duì)于數(shù)據(jù)就會(huì)有不同的要求2021-06-06
Python讀取Ansible?playbooks返回信息示例解析
這篇文章主要為大家介紹了Python讀取Ansible?playbooks返回信息示例解析,有需要的朋友可以借鑒參考下,希望能夠有所幫助,祝大家多多進(jìn)步,早日升職加薪2023-12-12

