python目標(biāo)檢測yolo2詳解及預(yù)測代碼復(fù)現(xiàn)

更新時間：2022年05月10日 18:25:49 作者：Bubbliiiing

這篇文章主要為大家介紹了python目標(biāo)檢測yolo2詳解及其預(yù)測代碼復(fù)現(xiàn)，有需要的朋友可以借鑒參考下，希望能夠有所幫助，祝大家多多進(jìn)步，早日升職加薪

前言

……最近在學(xué)習(xí)yolo1、yolo2和yolo3，寫這篇博客主要是為了讓自己對yolo2的結(jié)構(gòu)有更加深刻的理解，同時要理解清楚先驗框的含義。

盡量配合代碼觀看會更容易理解。

下載鏈接：https://pan.baidu.com/s/1NAqme8dD2Zoeo1Yd1xQVFw

提取碼：oq05

實(shí)現(xiàn)思路

1、yolo2的預(yù)測思路（網(wǎng)絡(luò)構(gòu)建思路）

YOLOv2使用了一個新的分類網(wǎng)絡(luò)DarkNet19作為特征提取部分，DarkNet19包含19個卷積層、5個最大值池化層。網(wǎng)絡(luò)使用了較多的3 x 3卷積核，在每一次池化操作后把通道數(shù)翻倍。借鑒了network in network的思想，把1 x 1的卷積核置于3 x 3的卷積核之間，用來壓縮特征。使用batch normalization穩(wěn)定模型訓(xùn)練，加速收斂，正則化模型。

與此同時，其保留了一個shortcut用于存儲之前的特征。

最后輸出的conv_dec的shape為(13,13,425)，其中13x13是把整個圖分為13x13的網(wǎng)格用于預(yù)測，425可以分解為(85x5)，在85中，其可以分為80和5兩部分，由于yolo2常用的是coco數(shù)據(jù)集，其中具有80個類，剩余的5指的是x、y、w、h和其置信度。x5的5中，意味著預(yù)測結(jié)果包含5個框，分別對應(yīng)5個先驗框。

其實(shí)際情況就是，輸入N張416x416的圖片，在經(jīng)過多層的運(yùn)算后，會輸出一個shape為(N,13,13,425)的數(shù)據(jù)，對應(yīng)每個圖分為13x13的網(wǎng)格后5個先驗框的位置。

def conv2d(self,x,filters_num,filters_size,pad_size=0,stride=1,batch_normalize=True,activation=leaky_relu,use_bias=False,name='conv2d'):
    # 是否進(jìn)行pad
    if pad_size > 0:
        x = tf.pad(x,[[0,0],[pad_size,pad_size],[pad_size,pad_size],[0,0]])
    # pad后進(jìn)行卷積
    out = tf.layers.conv2d(x,filters=filters_num,kernel_size=filters_size,strides=stride,padding='VALID',activation=None,use_bias=use_bias,name=name)
    # BN應(yīng)該在卷積層conv和激活函數(shù)activation之間,
    # 后面有BN層的conv就不用偏置bias，并激活函數(shù)activation在后
    # 如果需要標(biāo)準(zhǔn)化則進(jìn)行標(biāo)準(zhǔn)化
    if batch_normalize:
        out = tf.layers.batch_normalization(out,axis=-1,momentum=0.9,training=False,name=name+'_bn')
    if activation:
        out = activation(out)
    return out
def maxpool(self,x, size=2, stride=2, name='maxpool'):
    return tf.layers.max_pooling2d(x, pool_size=size, strides=stride,name=name)
def passthrough(self,x, stride):
    # 變小變長
    return tf.space_to_depth(x, block_size=stride)
def darknet(self):
    x = tf.placeholder(dtype=tf.float32,shape=[None,416,416,3])
    # 416,416,3 -> 416,416,32
    net = self.conv2d(x, filters_num=32, filters_size=3, pad_size=1,
                 name='conv1')
    # 416,416,32 -> 208,208,32
    net = self.maxpool(net, size=2, stride=2, name='pool1')
    # 208,208,32 -> 208,208,64
    net = self.conv2d(net, 64, 3, 1, name='conv2')
    # 208,208,64 -> 104,104,64
    net = self.maxpool(net, 2, 2, name='pool2')
    # 104,104,64 -> 104,104,128
    net = self.conv2d(net, 128, 3, 1, name='conv3_1')
    net = self.conv2d(net, 64, 1, 0, name='conv3_2')
    net = self.conv2d(net, 128, 3, 1, name='conv3_3')
    # 104,104,128 -> 52,52,128
    net = self.maxpool(net, 2, 2, name='pool3')
    net = self.conv2d(net, 256, 3, 1, name='conv4_1')
    net = self.conv2d(net, 128, 1, 0, name='conv4_2')
    net = self.conv2d(net, 256, 3, 1, name='conv4_3')
    # 52,52,128 -> 26,26,256
    net = self.maxpool(net, 2, 2, name='pool4')
    # 26,26,256-> 26,26,512
    net = self.conv2d(net, 512, 3, 1, name='conv5_1')
    net = self.conv2d(net, 256, 1, 0, name='conv5_2')
    net = self.conv2d(net, 512, 3, 1, name='conv5_3')
    net = self.conv2d(net, 256, 1, 0, name='conv5_4')
    net = self.conv2d(net, 512, 3, 1, name='conv5_5') 
    # 這一層特征圖，要進(jìn)行后面passthrough，保留一層特征層
    shortcut = net
    # 26,26,512-> 13,13,512
    net = self.maxpool(net, 2, 2, name='pool5')  #
    # 13,13,512-> 13,13,1024
    net = self.conv2d(net, 1024, 3, 1, name='conv6_1')
    net = self.conv2d(net, 512, 1, 0, name='conv6_2')
    net = self.conv2d(net, 1024, 3, 1, name='conv6_3')
    net = self.conv2d(net, 512, 1, 0, name='conv6_4')
    net = self.conv2d(net, 1024, 3, 1, name='conv6_5')
    # 下面這部分主要是training for detection
    net = self.conv2d(net, 1024, 3, 1, name='conv7_1')
    # 13,13,1024-> 13,13,1024
    net = self.conv2d(net, 1024, 3, 1, name='conv7_2')
    # shortcut增加了一個中間卷積層，先采用64個1*1卷積核進(jìn)行卷積，然后再進(jìn)行passthrough處理
    # 得到了26*26*512 -> 26*26*64 -> 13*13*256的特征圖
    shortcut = self.conv2d(shortcut, 64, 1, 0, name='conv_shortcut')
    shortcut = self.passthrough(shortcut, 2)
    # 連接之后，變成13*13*（1024+256）
    net = tf.concat([shortcut, net],axis=-1)  
    # channel整合到一起，concatenated with the original features，passthrough層與ResNet網(wǎng)絡(luò)的shortcut類似，以前面更高分辨率的特征圖為輸入，然后將其連接到后面的低分辨率特征圖上，
    net = self.conv2d(net, 1024, 3, 1, name='conv8')
    # detection layer: 最后用一個1*1卷積去調(diào)整channel，該層沒有BN層和激活函數(shù)，變成: S*S*(B*(5+C))，在這里為：13*13*425
    output = self.conv2d(net, filters_num=self.f_num, filters_size=1, batch_normalize=False, activation=None,
                    use_bias=True, name='conv_dec')
    return output,x

2、先驗框的生成

對于yolo1來講，其最后輸出的結(jié)果的shape為(7,7,30)，對應(yīng)著兩個框及其種類，盡管網(wǎng)絡(luò)可以不斷的訓(xùn)練最后實(shí)現(xiàn)框的位置的調(diào)整，但是如果我們能夠給出一些框的尺寸備用，效果理論上會更好（實(shí)際上也是），這就是先驗框的來歷。

但是yolo2的框并不是隨便就得到的，它是通過計算得到的。

在尋常的kmean算法中，使用的是歐氏距離來完成聚類，但是先驗框顯然不可以這樣，因為大框的歐氏距離更大，yolo2使用的是處理后的IOU作為歐氏距離。

最后得到五個聚類中心便是先驗框的寬高。

import numpy as np
import xml.etree.ElementTree as ET
import glob
import random
def cas_iou(box,cluster):
    x = np.minimum(cluster[:,0],box[0])
    y = np.minimum(cluster[:,1],box[1])
    intersection = x * y
    area1 = box[0] * box[1]
    area2 = cluster[:,0] * cluster[:,1]
    iou = intersection / (area1 + area2 -intersection)
    return iou
def avg_iou(box,cluster):
    return np.mean([np.max(cas_iou(box[i],cluster)) for i in range(box.shape[0])])
def kmeans(box,k):
    # 取出一共有多少框
    row = box.shape[0]
    # 每個框各個點(diǎn)的位置
    distance = np.empty((row,k))
    # 最后的聚類位置
    last_clu = np.zeros((row,))
    np.random.seed()
    # 隨機(jī)選5個當(dāng)聚類中心
    cluster = box[np.random.choice(row,k,replace = False)]
    # cluster = random.sample(row, k)
    while True:
        # 計算每一行距離五個點(diǎn)的iou情況。
        for i in range(row):
            distance[i] = 1 - cas_iou(box[i],cluster)
        # 取出最小點(diǎn)
        near = np.argmin(distance,axis=1)
        if (last_clu == near).all():
            break
        # 求每一個類的中位點(diǎn)
        for j in range(k):
            cluster[j] = np.median(
                box[near == j],axis=0)
        last_clu = near
    return cluster
def load_data(path):
    data = []
    # 對于每一個xml都尋找box
    for xml_file in glob.glob('{}/*xml'.format(path)):
        tree = ET.parse(xml_file)
        height = int(tree.findtext('./size/height'))
        width = int(tree.findtext('./size/width'))
        # 對于每一個目標(biāo)都獲得它的寬高
        for obj in tree.iter('object'):
            xmin = int(float(obj.findtext('bndbox/xmin'))) / width
            ymin = int(float(obj.findtext('bndbox/ymin'))) / height
            xmax = int(float(obj.findtext('bndbox/xmax'))) / width
            ymax = int(float(obj.findtext('bndbox/ymax'))) / height
            xmin = np.float64(xmin)
            ymin = np.float64(ymin)
            xmax = np.float64(xmax)
            ymax = np.float64(ymax)
            # 得到寬高
            data.append([xmax-xmin,ymax-ymin])
    return np.array(data)
if __name__ == '__main__':
    anchors_num = 5
    # 載入數(shù)據(jù)集，可以使用VOC的xml
    path = '../SSD-Tensorflow-master/VOC2012/Annotations'
    # 載入所有的xml
    # 存儲格式為轉(zhuǎn)化為比例后的width,height
    data = load_data(path)
    # 使用k聚類算法
    out = kmeans(data,anchors_num)
    print('acc:{:.2f}%'.format(avg_iou(data,out) * 100))
    print(out)
    print('box',out[:,0] * 13,out[:,1] * 13)
    ratios = np.around(out[:,0]/out[:,1],decimals=2).tolist()
    print('ratios:',sorted(ratios))

得到結(jié)果為：

acc:61.32%
[[0.044      0.07733333]
 [0.106      0.17866667]
 [0.408      0.616     ]
 [0.816      0.83      ]
 [0.2        0.38933333]]
box [ 0.572  1.378  5.304 10.608  2.6  ] [ 1.00533333  2.32266667  8.008      10.79        5.06133333]
ratios: [0.51, 0.57, 0.59, 0.66, 0.98]

3、利用先驗框?qū)W(wǎng)絡(luò)的輸出進(jìn)行解碼

yolo2的解碼過程與SSD類似，但是并不太一樣，相比之下yolo2的解碼過程更容易理解，因為其僅有單層的特征層。

1、將網(wǎng)絡(luò)的輸出reshape成[-1, 13 * 13, 5, 80 + 5]，代表169個中心點(diǎn)每個中心點(diǎn)的5個先驗框的情況。

2、將80+5的5中的xywh分離出來，0、1是xy相對中心點(diǎn)的偏移量；2、3是寬和高的情況；4是置信度。

3、建立13x13的網(wǎng)格，代表圖片進(jìn)行13x13處理后網(wǎng)格的中心點(diǎn)。

4、利用計算公式計算實(shí)際的bbox的位置。

解碼部分代碼如下：

def decode(self,net):
    self.anchor_size = tf.constant(self.anchor_size,tf.float32)
    # net的shape為[batch,169,5,85]
    net = tf.reshape(net, [-1, 13 * 13, self.num_anchors, self.num_class + 5]) 
    # 85 里面 0、1為xy的偏移量，2、3是wh的偏移量，4是置信度，5->84是每個種類的概率
    # 偏移量、置信度、類別
    # 中心坐標(biāo)相對于該cell坐上角的偏移量，sigmoid函數(shù)歸一化到(0,1)
    # [batch,169,5,2]
    xy_offset = tf.nn.sigmoid(net[:, :, :, 0:2])
    wh_offset = tf.exp(net[:, :, :, 2:4])
    obj_probs = tf.nn.sigmoid(net[:, :, :, 4])
    class_probs = tf.nn.softmax(net[:, :, :, 5:])  
    # 在feature map對應(yīng)坐標(biāo)生成anchors，13，13
    height_index = tf.range(self.feature_map_size[0], dtype=tf.float32)
    width_index = tf.range(self.feature_map_size[1], dtype=tf.float32)
    x_cell, y_cell = tf.meshgrid(height_index, width_index)
    x_cell = tf.reshape(x_cell, [1, -1, 1])  # 和上面[H*W,num_anchors,num_class+5]對應(yīng)
    y_cell = tf.reshape(y_cell, [1, -1, 1])
    # x_cell和y_cell是網(wǎng)格分割中心
    # xy_offset是相對中心的偏移情況
    bbox_x = (x_cell + xy_offset[:, :, :, 0]) / 13
    bbox_y = (y_cell + xy_offset[:, :, :, 1]) / 13
    bbox_w = (self.anchor_size[:, 0] * wh_offset[:, :, :, 0]) / 13
    bbox_h = (self.anchor_size[:, 1] * wh_offset[:, :, :, 1]) / 13
    bboxes = tf.stack([bbox_x - bbox_w / 2, bbox_y - bbox_h / 2, bbox_x + bbox_w / 2, bbox_y + bbox_h / 2],
                      axis=3)
    return bboxes, obj_probs, class_probs

4、進(jìn)行得分排序與非極大抑制篩選

這一部分基本上是所有目標(biāo)檢測通用的部分。

1、將所有box還原成圖片中真實(shí)的位置。

2、得到每個box最大的預(yù)測概率對應(yīng)的種類。

3、將每個box最大的預(yù)測概率乘上置信度得到每個box的分?jǐn)?shù)。

4、對分?jǐn)?shù)進(jìn)行篩選與排序。

5、非極大抑制，去除重復(fù)率過大的框。

實(shí)現(xiàn)代碼如下：

def bboxes_cut(self,bbox_min_max, bboxes):
    bboxes = np.copy(bboxes)
    bboxes = np.transpose(bboxes)
    bbox_min_max = np.transpose(bbox_min_max)
    # cut the box
    bboxes[0] = np.maximum(bboxes[0], bbox_min_max[0])  # xmin
    bboxes[1] = np.maximum(bboxes[1], bbox_min_max[1])  # ymin
    bboxes[2] = np.minimum(bboxes[2], bbox_min_max[2])  # xmax
    bboxes[3] = np.minimum(bboxes[3], bbox_min_max[3])  # ymax
    bboxes = np.transpose(bboxes)
    return bboxes
def bboxes_sort(self,classes, scores, bboxes, top_k=400):
    index = np.argsort(-scores)
    classes = classes[index][:top_k]
    scores = scores[index][:top_k]
    bboxes = bboxes[index][:top_k]
    return classes, scores, bboxes
def bboxes_iou(self,bboxes1, bboxes2):
    bboxes1 = np.transpose(bboxes1)
    bboxes2 = np.transpose(bboxes2)
    int_ymin = np.maximum(bboxes1[0], bboxes2[0])
    int_xmin = np.maximum(bboxes1[1], bboxes2[1])
    int_ymax = np.minimum(bboxes1[2], bboxes2[2])
    int_xmax = np.minimum(bboxes1[3], bboxes2[3])
    int_h = np.maximum(int_ymax - int_ymin, 0.)
    int_w = np.maximum(int_xmax - int_xmin, 0.)
    # 計算IOU
    int_vol = int_h * int_w  # 交集面積
    vol1 = (bboxes1[2] - bboxes1[0]) * (bboxes1[3] - bboxes1[1])  # bboxes1面積
    vol2 = (bboxes2[2] - bboxes2[0]) * (bboxes2[3] - bboxes2[1])  # bboxes2面積
    IOU = int_vol / (vol1 + vol2 - int_vol)  # IOU=交集/并集
    return IOU
# NMS，或者用tf.image.non_max_suppression
def bboxes_nms(self,classes, scores, bboxes, nms_threshold=0.2):
    keep_bboxes = np.ones(scores.shape, dtype=np.bool)
    for i in range(scores.size - 1):
        if keep_bboxes[i]:
            overlap = self.bboxes_iou(bboxes[i], bboxes[(i + 1):])
            keep_overlap = np.logical_or(overlap < nms_threshold,
                                         classes[(i + 1):] != classes[i])  # IOU沒有超過0.5或者是不同的類則保存下來
            keep_bboxes[(i + 1):] = np.logical_and(keep_bboxes[(i + 1):], keep_overlap)
    idxes = np.where(keep_bboxes)
    return classes[idxes], scores[idxes], bboxes[idxes]
def postprocess(self,bboxes, obj_probs, class_probs, image_shape=(416, 416), threshold=0.5):
    bboxes = np.reshape(bboxes, [-1, 4])
    # 將所有box還原成圖片中真實(shí)的位置
    bboxes[:, 0:1] *= float(image_shape[1])
    bboxes[:, 1:2] *= float(image_shape[0])
    bboxes[:, 2:3] *= float(image_shape[1])
    bboxes[:, 3:4] *= float(image_shape[0])
    bboxes = bboxes.astype(np.int32)  # 轉(zhuǎn)int
    bbox_min_max = [0, 0, image_shape[1] - 1, image_shape[0] - 1]
    # 防止識別框炸了
    bboxes = self.bboxes_cut(bbox_min_max, bboxes)
    # 平鋪13*13*5
    obj_probs = np.reshape(obj_probs, [-1])  
    # 平鋪13*13*5,80
    class_probs = np.reshape(class_probs, [len(obj_probs), -1])
    # max類別概率對應(yīng)的index
    class_max_index = np.argmax(class_probs, axis=1)  
    class_probs = class_probs[np.arange(len(obj_probs)), class_max_index]
    # 置信度*max類別概率=類別置信度scores
    scores = obj_probs * class_probs  
    # 類別置信度scores>threshold的邊界框bboxes留下
    keep_index = scores > threshold
    class_max_index = class_max_index[keep_index]
    scores = scores[keep_index]
    bboxes = bboxes[keep_index]
    # 排序top_k(默認(rèn)為400)
    class_max_index, scores, bboxes = self.bboxes_sort(class_max_index, scores, bboxes)
    # NMS
    class_max_index, scores, bboxes = self.bboxes_nms(class_max_index, scores, bboxes)
    return bboxes, scores, class_max_index