基于Python和Scikit-Learn的機(jī)器學(xué)習(xí)探索

更新時間：2017年10月16日 17:18:41 作者：Alex

這篇文章主要介紹了基于Python和Scikit-Learn的機(jī)器學(xué)習(xí)探索的相關(guān)內(nèi)容，小編覺得還是挺不錯的，這里分享給大家，供需要的朋友學(xué)習(xí)和參考。

你好，%用戶名%！

我叫Alex，我在機(jī)器學(xué)習(xí)和網(wǎng)絡(luò)圖分析（主要是理論）有所涉獵。我同時在為一家俄羅斯移動運營商開發(fā)大數(shù)據(jù)產(chǎn)品。這是我第一次在網(wǎng)上寫文章，不喜勿噴。

現(xiàn)在，很多人想開發(fā)高效的算法以及參加機(jī)器學(xué)習(xí)的競賽。所以他們過來問我：”該如何開始？”。一段時間以前，我在一個俄羅斯聯(lián)邦政府的下屬機(jī)構(gòu)中領(lǐng)導(dǎo)了媒體和社交網(wǎng)絡(luò)大數(shù)據(jù)分析工具的開發(fā)。我仍然有一些我團(tuán)隊使用過的文檔，我樂意與你們分享。前提是讀者已經(jīng)有很好的數(shù)學(xué)和機(jī)器學(xué)習(xí)方面的知識（我的團(tuán)隊主要由MIPT（莫斯科物理與技術(shù)大學(xué)）和數(shù)據(jù)分析學(xué)院的畢業(yè)生構(gòu)成）。

這篇文章是對數(shù)據(jù)科學(xué)的簡介，這門學(xué)科最近太火了。機(jī)器學(xué)習(xí)的競賽也越來越多（如，Kaggle, TudedIT），而且他們的資金通常很可觀。

R和Python是提供給數(shù)據(jù)科學(xué)家的最常用的兩種工具。每一個工具都有其優(yōu)缺點，但Python最近在各個方面都有所勝出（僅為鄙人愚見，雖然我兩者都用）。這一切的發(fā)生是因為Scikit-Learn庫的騰空出世，它包含有完善的文檔和豐富的機(jī)器學(xué)習(xí)算法。
請注意，我們將主要在這篇文章中探討機(jī)器學(xué)習(xí)算法。通常用Pandas包去進(jìn)行主數(shù)據(jù)分析會比較好，而且這很容易你自己完成。所以，讓我們集中精力在實現(xiàn)上。為了確定性，我們假設(shè)有一個特征-對象矩陣作為輸入，被存在一個*.csv文件中。

數(shù)據(jù)加載

首先，數(shù)據(jù)要被加載到內(nèi)存中，才能對其操作。Scikit-Learn庫在它的實現(xiàn)用使用了NumPy數(shù)組，所以我們將用NumPy來加載*.csv文件。讓我們從UCI Machine Learning Repository下載其中一個數(shù)據(jù)集。

import numpy as np
import urllib
# url with dataset
url = “http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data”
# download the file
raw_data = urllib.urlopen(url)
# load the CSV file as a numpy matrix
dataset = np.loadtxt(raw_data, delimiter=“,”)
# separate the data from the target attributes
X = dataset[:,0:7]
y = dataset[:,8]

我們將在下面所有的例子里使用這個數(shù)據(jù)組，換言之，使用X特征物數(shù)組和y目標(biāo)變量的值。

數(shù)據(jù)標(biāo)準(zhǔn)化

我們都知道大多數(shù)的梯度方法（幾乎所有的機(jī)器學(xué)習(xí)算法都基于此）對于數(shù)據(jù)的縮放很敏感。因此，在運行算法之前，我們應(yīng)該進(jìn)行標(biāo)準(zhǔn)化，或所謂的規(guī)格化。標(biāo)準(zhǔn)化包括替換所有特征的名義值，讓它們每一個的值在0和1之間。而對于規(guī)格化，它包括數(shù)據(jù)的預(yù)處理，使得每個特征的值有0和1的離差。Scikit-Learn庫已經(jīng)為其提供了相應(yīng)的函數(shù)。

from sklearn
import metrics
from sklearn.ensemble
import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X, y)# display the relative importance of each attribute
print(model.feature_importances_)

特征的選取

毫無疑問，解決一個問題最重要的是是恰當(dāng)選取特征、甚至創(chuàng)造特征的能力。這叫做特征選取和特征工程。雖然特征工程是一個相當(dāng)有創(chuàng)造性的過程，有時候更多的是靠直覺和專業(yè)的知識，但對于特征的選取，已經(jīng)有很多的算法可供直接使用。如樹算法就可以計算特征的信息量。

from sklearn
import metrics
from sklearn.ensemble
import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X, y)# display the relative importance of each attribute
print(model.feature_importances_)

其他所有的方法都是基于對特征子集的高效搜索，從而找到最好的子集，意味著演化了的模型在這個子集上有最好的質(zhì)量。遞歸特征消除算法（RFE）是這些搜索算法的其中之一，Scikit-Learn庫同樣也有提供。

from sklearn.feature_selection
import RFE
from sklearn.linear_model
import LogisticRegression
model = LogisticRegression()# create the RFE model and select 3 attributes
rfe = RFE(model, 3)
rfe = rfe.fit(X, y)# summarize the selection of the attributes
print(rfe.support_)
print(rfe.ranking_)

算法的開發(fā)

正像我說的，Scikit-Learn庫已經(jīng)實現(xiàn)了所有基本機(jī)器學(xué)習(xí)的算法。讓我來瞧一瞧它們中的一些。

邏輯回歸

大多數(shù)情況下被用來解決分類問題（二元分類），但多類的分類（所謂的一對多方法）也適用。這個算法的優(yōu)點是對于每一個輸出的對象都有一個對應(yīng)類別的概率。

from sklearn
import metrics
from sklearn.linear_model
import LogisticRegression
model = LogisticRegression()
model.fit(X, y)
print(model)# make predictions
expected = y
predicted = model.predict(X)# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

樸素貝葉斯

它也是最有名的機(jī)器學(xué)習(xí)的算法之一，它的主要任務(wù)是恢復(fù)訓(xùn)練樣本的數(shù)據(jù)分布密度。這個方法通常在多類的分類問題上表現(xiàn)的很好。

from sklearn
import metrics
from sklearn.naive_bayes
import GaussianNB
model = GaussianNB()
model.fit(X, y)
print(model)# make predictions
expected = y
predicted = model.predict(X)# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

k-最近鄰

kNN（k-最近鄰）方法通常用于一個更復(fù)雜分類算法的一部分。例如，我們可以用它的估計值做為一個對象的特征。有時候，一個簡單的kNN

from sklearn
import metrics
from sklearn.neighbors
import KNeighborsClassifier# fit a k - nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(X, y)
print(model)# make predictions
expected = y
predicted = model.predict(X)# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

決策樹

分類和回歸樹（CART）經(jīng)常被用于這么一類問題，在這類問題中對象有可分類的特征且被用于回歸和分類問題。決策樹很適用于多類分類。

from sklearn
import metrics
from sklearn.tree
import DecisionTreeClassifier# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(X, y)
print(model)# make predictions
expected = y
predicted = model.predict(X)# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

支持向量機(jī)

SVM（支持向量機(jī)）是最流行的機(jī)器學(xué)習(xí)算法之一，它主要用于分類問題。同樣也用于邏輯回歸，SVM在一對多方法的幫助下可以實現(xiàn)多類分類。

from sklearn import metrics
from sklearn.svm import SVC
# fit a SVM model to the data
model = SVC()
model.fit(X, y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

除了分類和回歸問題，Scikit-Learn還有海量的更復(fù)雜的算法，包括了聚類，以及建立混合算法的實現(xiàn)技術(shù)，如Bagging和Boosting。

如何優(yōu)化算法的參數(shù)

在編寫高效的算法的過程中最難的步驟之一就是正確參數(shù)的選擇。一般來說如果有經(jīng)驗的話會容易些，但無論如何，我們都得尋找。幸運的是Scikit-Learn提供了很多函數(shù)來幫助解決這個問題。

作為一個例子，我們來看一下規(guī)則化參數(shù)的選擇，在其中不少數(shù)值被相繼搜索了：

import numpy as np
from sklearn.linear_model
import Ridge
from sklearn.grid_search
import GridSearchCV# prepare a range of alpha values to test
alphas = np.array([1, 0.1, 0.01, 0.001, 0.0001, 0])# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator = model, param_grid = dict(alpha = alphas))
grid.fit(X, y)
print(grid)# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)

有時候隨機(jī)地從既定的范圍內(nèi)選取一個參數(shù)更為高效，估計在這個參數(shù)下算法的質(zhì)量，然后選出最好的。

import numpy as np
from scipy.stats
import uniform as sp_rand
from sklearn.linear_model
import Ridge
from sklearn.grid_search
import RandomizedSearchCV# prepare a uniform distribution to sample
for the alpha parameter
param_grid = {‘
  alpha': sp_rand()
}#
create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator = model, param_distributions = param_grid, n_iter = 100)
rsearch.fit(X, y)
print(rsearch)# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)

至此我們已經(jīng)看了整個使用Scikit-Learn庫的過程，除了將結(jié)果再輸出到一個文件中。這個就作為你的一個練習(xí)吧，和R相比Python的一大優(yōu)點就是它有很棒的文檔說明。

總結(jié)

以上就是本文關(guān)于基于Python和Scikit-Learn的機(jī)器學(xué)習(xí)探索的全部內(nèi)容，感興趣的朋友可以參閱：python 排序算法總結(jié)及實例詳解、Java 蒙特卡洛算法求圓周率近似值實例詳解、Java常見數(shù)據(jù)結(jié)構(gòu)面試題（帶答案）以及本站其他相關(guān)專題，如有不足之處，歡迎留言指出，小編一定及時回復(fù)大家并改正，為廣大編程愛好者提供更優(yōu)質(zhì)的文章以及更好的幫助，感謝朋友們對本站的支持！

您可能感興趣的文章: