Python實現(xiàn)K-近鄰算法的示例代碼

更新時間：2022年09月07日 15:57:09 作者：AI悅創(chuàng)

k-近鄰算法（K-Nearest Neighbour algorithm），又稱 KNN 算法，是數(shù)據(jù)挖掘技術(shù)中原理最簡單的算法。本文將介紹實現(xiàn)K-近鄰算法的示例代碼，需要的可以參考一下

一、介紹

k-近鄰算法（K-Nearest Neighbour algorithm），又稱 KNN 算法，是數(shù)據(jù)挖掘技術(shù)中原理最簡單的算法。

工作原理：給定一個已知標簽類別的訓(xùn)練數(shù)據(jù)集，輸入沒有標簽的新數(shù)據(jù)后，在訓(xùn)練數(shù)據(jù)集中找到與新數(shù)據(jù)最鄰近的 k 個實例，如果這 k 個實例的多數(shù)屬于某個類別，那么新數(shù)據(jù)就屬于這個類別。簡單理解為：由那些離 X 最近的 k 個點來投票決定 X 歸為哪一類。

二、k-近鄰算法的步驟

（1）計算已知類別數(shù)據(jù)集中的點與當(dāng)前點之間的距離；

（2）按照距離遞增次序排序；

（3）選取與當(dāng)前點距離最小的 k 個點；

（4）確定前k個點所在類別的出現(xiàn)頻率；

（5）返回前 k 個點出現(xiàn)頻率最高的類別作為當(dāng)前點的預(yù)測類別。

三、Python 實現(xiàn)

判斷一個電影是愛情片還是動作片。

電影名稱	搞笑鏡頭	擁抱鏡頭	打斗鏡頭	電影類型
0	功夫熊貓	39	0	31	喜劇片
1	葉問3	3	2	65	動作片
2	倫敦陷落	2	3	55	動作片
3	代理情人	9	38	2	愛情片
4	新步步驚心	8	34	17	愛情片
5	諜影重重	5	2	57	動作片
6	功夫熊貓	39	0	31	喜劇片
7	美人魚	21	17	5	喜劇片
8	寶貝當(dāng)家	45	2	9	喜劇片
9	唐人街探案	23	3	17	？

歐氏距離

構(gòu)建數(shù)據(jù)集

rowdata?=?{
????"電影名稱":?['功夫熊貓',?'葉問3',?'倫敦陷落',?'代理情人',?'新步步驚心',?'諜影重重',?'功夫熊貓',?'美人魚',?'寶貝當(dāng)家'],
????"搞笑鏡頭":?[39,3,2,9,8,5,39,21,45],
????"擁抱鏡頭":?[0,2,3,38,34,2,0,17,2],
????"打斗鏡頭":?[31,65,55,2,17,57,31,5,9],
????"電影類型":?["喜劇片",?"動作片",?"動作片",?"愛情片",?"愛情片",?"動作片",?"喜劇片",?"喜劇片",?"喜劇片"]
}

計算已知類別數(shù)據(jù)集中的點與當(dāng)前點之間的距離

new_data?=?[24,67]
dist?=?list((((movie_data.iloc[:6,1:3]-new_data)**2).sum(1))**0.5)

將距離升序排列，然后選取距離最小的 k 個點「容易擬合·以后專欄再論」

k?=?4
dist_l?=?pd.DataFrame({'dist':?dist,?'labels':?(movie_data.iloc[:6,?3])})?
dr?=?dist_l.sort_values(by='dist')[:k]

確定前 k 個點的類別的出現(xiàn)概率

re?=?dr.loc[:,'labels'].value_counts()
re.index[0]

選擇頻率最高的類別作為當(dāng)前點的預(yù)測類別

result?=?[]
result.append(re.index[0])
result

四、約會網(wǎng)站配對效果判定

#?導(dǎo)入數(shù)據(jù)集
datingTest?=?pd.read_table('datingTestSet.txt',header=None)
datingTest.head()

#?分析數(shù)據(jù)
%matplotlib?inline
import?matplotlib?as?mpl
import?matplotlib.pyplot?as?plt

#把不同標簽用顏色區(qū)分
Colors?=?[]
for?i?in?range(datingTest.shape[0]):
????m?=?datingTest.iloc[i,-1]??#?標簽
????if?m=='didntLike':
????????Colors.append('black')
????if?m=='smallDoses':
????????Colors.append('orange')
????if?m=='largeDoses':
????????Colors.append('red')

#繪制兩兩特征之間的散點圖
plt.rcParams['font.sans-serif']=['Simhei']?#圖中字體設(shè)置為黑體
pl=plt.figure(figsize=(12,8))??#?建立一個畫布

fig1=pl.add_subplot(221)??#?建立兩行兩列畫布，放在第一個里面
plt.scatter(datingTest.iloc[:,1],datingTest.iloc[:,2],marker='.',c=Colors)
plt.xlabel('玩游戲視頻所占時間比')
plt.ylabel('每周消費冰淇淋公升數(shù)')

fig2=pl.add_subplot(222)
plt.scatter(datingTest.iloc[:,0],datingTest.iloc[:,1],marker='.',c=Colors)
plt.xlabel('每年飛行?？屠锍?)
plt.ylabel('玩游戲視頻所占時間比')

fig3=pl.add_subplot(223)
plt.scatter(datingTest.iloc[:,0],datingTest.iloc[:,2],marker='.',c=Colors)
plt.xlabel('每年飛行常客里程')
plt.ylabel('每周消費冰淇淋公升數(shù)')
plt.show()


#?數(shù)據(jù)歸一化
def?minmax(dataSet):
????minDf?=?dataSet.min()
????maxDf?=?dataSet.max()
????normSet?=?(dataSet?-?minDf?)/(maxDf?-?minDf)
????return?normSet

datingT?=?pd.concat([minmax(datingTest.iloc[:,?:3]),?datingTest.iloc[:,3]],?axis=1)
datingT.head()

#?切分訓(xùn)練集和測試集
def?randSplit(dataSet,rate=0.9):
????n?=?dataSet.shape[0]?
????m?=?int(n*rate)
????train?=?dataSet.iloc[:m,:]
????test?=?dataSet.iloc[m:,:]
????test.index?=?range(test.shape[0])
????return?train,test

train,test?=?randSplit(datingT)


#?分類器針對約會網(wǎng)站的測試代碼
def?datingClass(train,test,k):
????n?=?train.shape[1]?-?1??#?將標簽列減掉
????m?=?test.shape[0]??#?行數(shù)
????result?=?[]
????for?i?in?range(m):
????????dist?=?list((((train.iloc[:,?:n]?-?test.iloc[i,?:n])?**?2).sum(1))**5)
????????dist_l?=?pd.DataFrame({'dist':?dist,?'labels':?(train.iloc[:,?n])})
????????dr?=?dist_l.sort_values(by?=?'dist')[:?k]
????????re?=?dr.loc[:,?'labels'].value_counts()
????????result.append(re.index[0])
????result?=?pd.Series(result)??
????test['predict']?=?result??#?增加一列
????acc?=?(test.iloc[:,-1]==test.iloc[:,-2]).mean()
????print(f'模型預(yù)測準確率為{acc}')
????return?test


datingClass(train,test,5)??#?95%

五、手寫數(shù)字識別

import?os


#得到標記好的訓(xùn)練集
def?get_train():
????path?=?'digits/trainingDigits'
????trainingFileList?=?os.listdir(path)
????train?=?pd.DataFrame()
????img?=?[]??#?第一列原來的圖像轉(zhuǎn)換為圖片里面0和1，一行
????labels?=?[]??#?第二列原來的標簽
????for?i?in?range(len(trainingFileList)):
????????filename?=?trainingFileList[i]
????????txt?=?pd.read_csv(f'digits/trainingDigits/{filename}',?header?=?None)?#32行
????????num?=?''
????????#?將32行轉(zhuǎn)變?yōu)?行
????????for?i?in?range(txt.shape[0]):
????????????num?+=?txt.iloc[i,:]
????????img.append(num[0])
????????filelable?=?filename.split('_')[0]
????????labels.append(filelable)
????train['img']?=?img
????train['labels']?=?labels
????return?train
????
train?=?get_train()???



#?得到標記好的測試集
def?get_test():
????path?=?'digits/testDigits'
????testFileList?=?os.listdir(path)
????test?=?pd.DataFrame()
????img?=?[]??#?第一列原來的圖像轉(zhuǎn)換為圖片里面0和1，一行
????labels?=?[]??#?第二列原來的標簽
????for?i?in?range(len(testFileList)):
????????filename?=?testFileList[i]
????????txt?=?pd.read_csv(f'digits/testDigits/{filename}',?header?=?None)?#32行
????????num?=?''
????????#?將32行轉(zhuǎn)變?yōu)?行
????????for?i?in?range(txt.shape[0]):
????????????num?+=?txt.iloc[i,:]
????????img.append(num[0])
????????filelable?=?filename.split('_')[0]
????????labels.append(filelable)
????test['img']?=?img
????test['labels']?=?labels
????return?test

test?=?get_test()

#?分類器針對手寫數(shù)字的測試代碼
from?Levenshtein?import?hamming

def?handwritingClass(train,?test,?k):
????n?=?train.shape[0]
????m?=?test.shape[0]
????result?=?[]
????for?i?in?range(m):
????????dist?=?[]
????????for?j?in?range(n):
????????????d?=?str(hamming(train.iloc[j,0],?test.iloc[i,0]))
????????????dist.append(d)
????????dist_l?=?pd.DataFrame({'dist':dist,?'labels':(train.iloc[:,1])})
????????dr?=?dist_l.sort_values(by='dist')[:k]
????????re?=?dr.loc[:,'labels'].value_counts()
????????result.append(re.index[0])
????result?=?pd.Series(result)
????test['predict']?=?result
????acc?=?(test.iloc[:,-1]?==?test.iloc[:,-2]).mean()
????print(f'模型預(yù)測準確率為{acc}')
????return?test

handwritingClass(train,?test,?3)??#?97.8%

六、算法優(yōu)缺點

優(yōu)點

（1）簡單好用，容易理解，精度高，理論成熟，既可以用來做分類也可以用來做回歸；

（2）可用于數(shù)值型數(shù)據(jù)和離散型數(shù)據(jù)；

（3）無數(shù)據(jù)輸入假定；

（4）適合對稀有事件進行分類。

缺點

（1）計算復(fù)雜性高；空間復(fù)雜性高；

（2）計算量大，所以一般數(shù)值很大的適合不用這個，但是單個樣本又不能太少，否則容易發(fā)生誤分；

（3）樣本不平衡問題（即有些類別的樣本數(shù)量很多，而其他樣本的數(shù)量很少）；

（4）可理解性比較差，無法給出數(shù)據(jù)的內(nèi)在含義

到此這篇關(guān)于Python實現(xiàn)K-近鄰算法的示例代碼的文章就介紹到這了,更多相關(guān)Python K近鄰算法內(nèi)容請搜索腳本之家以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持腳本之家！

您可能感興趣的文章:

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

Python實現(xiàn)K-近鄰算法的示例代碼

目錄

一、介紹

二、k-近鄰算法的步驟

三、Python 實現(xiàn)

四、約會網(wǎng)站配對效果判定

五、手寫數(shù)字識別

六、算法優(yōu)缺點

優(yōu)點

缺點

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线 免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

Python實現(xiàn)K-近鄰算法的示例代碼

目錄

一、介紹

二、k-近鄰算法的步驟

三、Python 實現(xiàn)

四、約會網(wǎng)站配對效果判定

五、手寫數(shù)字識別

六、算法優(yōu)缺點

優(yōu)點

缺點

相關(guān)文章

最新評論

大家感興趣的內(nèi)容

最近更新的內(nèi)容

常用在線小工具

国产无遮挡裸体免费直播视频,久久精品国产蜜臀av,动漫在线视频一区二区,欧亚日韩一区二区三区,久艹在线免费视频,国产精品美女网站免费,正在播放 97超级视频在线观看,斗破苍穹年番在线观看免费,51最新乱码中文字幕

二、k-近鄰算法的步驟

四、約會網(wǎng)站配對效果判定