Backpropagation(BP) 倒傳遞法 #3 貓貓分類器-N層類神經網路

2019-04-252020-04-03 Andy Wang 0 Comments Backpropagation, Gradient Descent, Logistic Regression, Machine Learning, Neural Network, Optimization Algorithm

本文會以上篇內容（2層類神經網路）為基礎加深難度與實用性，因此這次將會實作可自定層數的類神經網路以及使用倒傳遞法（Back propagation, BP）優化的方法。這次的模型作法也是使用邏輯回歸（Logistic Regression）建立貓貓分類器。當然，你想換成別種圖片也是OK的。

本文會著重於多層類神經網路中正向傳遞與反向傳遞對的計算和理解，若對於其他部分比較有興趣或是不太懂的話建議先閱讀上篇：Backpropagation(BP) 倒傳遞法 #2 貓貓分類器-2層類神經網路；如果你覺得還是不太懂推導過程可以先來讀這篇：Backpropagation(BP) 倒傳遞法 #1 工作原理與說明

先來GitHub下載這個範例吧！邊執行邊看文章比較好理解 😀
如果有需要更詳細的原理或是覺得哪裏有問題，也歡迎在文末留言哦！

演算法流程

N層類神經網路的演算法流程與上篇2層類神經網路大致上是相同的流程，差別只在於N層類神經網路中的正向傳遞（Forward pass）與反向傳遞（Backward pass）必須要能夠依據自訂層數自動增加計算次數。

$\\$
$\\$
$\rightarrow$
$\uparrow$
$\uparrow$
$\uparrow$
$\uparrow$
$\uparrow$
$\uparrow$
$\uparrow$
$\leftarrow$
(否)

參數初始化
$\downarrow$
正向傳遞(執行N次)
$\downarrow$
計算成本
$\downarrow$
反向傳遞(執行N次)
$\downarrow$
更新參數
$\downarrow$
檢查是否結束迭代
$\downarrow$(是)
結束

圖（1）：N層倒傳遞法演算法流程

傳統流程

這是傳統固定類神經網路層數的作法，下圖（2）範例是固定4層類神經網路的做法，包括正向傳遞Forward pass、反向傳遞Backward pass的計算流程與變數之間的對應關係。模型輸入會將每一張圖片大小為$64-by-64-by-3$經扁平化（Flatten）處理成大小為$12288-by-1$的向量，訓練階段一共有$209$張圖片，因此訓練過程中的模型輸入$X$矩陣大小為$12288-by-209$。各層類神經網路的神經元數量設定值從第1層至第4層依序為20、7、5、1。

其中值得一提的是正向傳遞過程中的Linear_cache、Activation_cache這兩個項目，這同時也是在實作程式所使用的變數名稱，設計這兩個變數的原因就是在反向傳遞計算過程中會再次用到正向傳遞計算過程中的部分變數，所以在此先把這些變數儲存起來。反向傳遞中的部分計算公式有少許紅字，這是代表該變數就是使用到儲存在Linear_cache與Activation_cache之中的變數。透過觀察此圖反向傳遞的計算過程可以發現計算過程是存在一套固定規則，因此我們可以將倒傳遞法延應用至優化N層類神經網路上。

圖（2）：使用倒傳遞法優化4層類神經網路計算流程圖

圖（2）中的上標$[1]$表示該模型的第$1$層同時也是反向傳播中的第$1$層計算。然而，模型輸入層被稱為第$0$層，礙於圖片寬度限制所以沒有畫。在此簡單的說明一下正向傳遞的過程：

以第$1$層的正向傳遞計算為例：$X=[x_0，x_1，…，x_ {12287}]^T $然後乘以大小$20-by-12288$的權重矩陣$W^{[1]} $。然後，再加上一個偏置量$b^{[1]}$，獲得向量$Z^{[1]}$。
再將$Z^{[1]}$帶入啟動函數ReLU以獲得以下向量：$A^{[1]}=[a_0^{[1]}, a_1^{[1]}, \cdot a_{19}^{[1]}]^T$。$A^{[1]}$的形狀應為$20-by-209$。
重複相同的過程，直到完成第$3$層的計算。第$4$層的計算基本上與前$3$層相同，但激活函數被Sigmoid取代。
最後，你取結果的sigmoid。如果它大於0.5，則將其歸類為貓。變數Linear_cahce、Activation_cache是Python的字典型別。這兩個變量的目的就是是儲存將在後向傳遞中使用的變量。

正向傳遞 Forward pass

概念

每一層類神經網路都會計算出屬於各層的$W, b, Z, A$，這些變數都要依序儲存下來。因為反向傳遞的計算過程中將會再次使用到這些變數。

實作

贊助廣告

閱讀下圖（3）之前建議先讀Notes。我們可以在此圖中發現迴圈僅作用在類神經網路的第$1$層到第$N-1$層（$N$在此圖中代表類神經網路的層數），最後一層的計算不論是正向傳遞或是反向傳遞都沒有在迴圈之中。原因是第$1$層到第$N-1$層的啟動函數皆使用ReLU，然而最後一層則是使用Sigmoid，所以計算方式不太一樣不能重複使用，但這也是為了要加深理解整個運算過程而設計的。

$N$表示類神經網路的層數
$i$是指for-loop中指標，此圖中變數上標$[i]$代表的是第$i$層中的某個變數
$m$是前一層類神經網路中變數$A$的shape[1]
類神經網路的編號是從第$1$層到第$4$層，身為模型輸入的那些圖片被稱為第$0$層。
這個N層類神經網路的各層神經元數量也是：$[12288, 20, 7, 5, 1]$，這也可以對應到上述的第$0$層到第$4$層的編號。
$Y$是相對應位置圖片的標籤，$0$表示該圖不是貓 $1$代表該圖是貓，用來作為訓練階段或測試階段的解答。

圖（3）：使用倒傳遞法優化N層類神經網路計算流程圖

下列NL_forwardpass函數就是用來計算正向傳遞。此程式與圖（3）的描述相異之處在於程式中最大類神經網路層數的變數命名為L（第18行）而非圖（3）的$N$。另外需要注意的是linear_cache和activation_cache會以tuple的型別儲存在變數caches裡面，最後隨著函數執行完畢傳遞出去。

def NL_forwardpass(X, parameters):
"""
Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation
Arguments:
X -- data, numpy array of shape (input size, number of examples)
parameters -- output of initialize_parameters_deep()
Returns:
AL -- last post-activation value
caches -- list of caches containing:
every cache of linear_relu_forward() (there are L-1 of them, indexed from 0 to L-2)
the cache of linear_sigmoid_forward() (there is one, indexed L-1)
"""
caches = []
A = X
L = len(parameters) // 2  
linear_cache = {}
activation_cache = {}
for l in range(1, L):
linear_cache = {}
activation_cache = {}
A_prev = A
W = parameters['W' + str(l)]
b = parameters['b' + str(l)]
# Linear calculation
linear_cache["A_prev" + str(l)] = A_prev
linear_cache["W" + str(l)] = W
linear_cache["b" + str(l)] = b
Z = W.dot(A_prev) + b
# ReLU calculation: ReLU
A = np.maximum(0, Z)
activation_cache["Z" + str(l)] = Z
cache = (linear_cache, activation_cache)
caches.append(cache)
## The final layer.
linear_cache = {}
activation_cache = {}
W = parameters['W' + str(L)]
b = parameters['b' + str(L)]
# Linear calculation
linear_cache["A_prev" + str(L)] = A
linear_cache["W" + str(L)] = W
linear_cache["b" + str(L)] = b
ZL = W.dot(A) + b
# Activative calculation: Sigmoid
AL = 1/(1+np.exp(-ZL))
activation_cache["Z" + str(L)] = ZL
cache = (linear_cache, activation_cache)
caches.append(cache)
assert(AL.shape == (1,X.shape[1]))
return AL, caches

def NL_forwardpass(X, parameters):

"""

Implement forward propagation for the [LINEAR->RELU]*(L-1)->LINEAR->SIGMOID computation

Arguments:

X -- data, numpy array of shape (input size, number of examples)

parameters -- output of initialize_parameters_deep()

Returns:

AL -- last post-activation value

caches -- list of caches containing:

every cache of linear_relu_forward() (there are L-1 of them, indexed from 0 to L-2)

the cache of linear_sigmoid_forward() (there is one, indexed L-1)

"""

caches = []

A = X

L = len(parameters) // 2

linear_cache = {}

activation_cache = {}

for l in range(1, L):

linear_cache = {}

activation_cache = {}

A_prev = A

W = parameters['W' + str(l)]

b = parameters['b' + str(l)]

# Linear calculation

linear_cache["A_prev" + str(l)] = A_prev

linear_cache["W" + str(l)] = W

linear_cache["b" + str(l)] = b

Z = W.dot(A_prev) + b

# ReLU calculation: ReLU

A = np.maximum(0, Z)

activation_cache["Z" + str(l)] = Z

cache = (linear_cache, activation_cache)

caches.append(cache)

## The final layer.

linear_cache = {}

activation_cache = {}

W = parameters['W' + str(L)]

b = parameters['b' + str(L)]

# Linear calculation

linear_cache["A_prev" + str(L)] = A

linear_cache["W" + str(L)] = W

linear_cache["b" + str(L)] = b

ZL = W.dot(A) + b

# Activative calculation: Sigmoid

AL = 1/(1+np.exp(-ZL))

activation_cache["Z" + str(L)] = ZL

cache = (linear_cache, activation_cache)

caches.append(cache)

assert(AL.shape == (1,X.shape[1]))

return AL, caches

反向傳遞 Backward pass

概念

其實概念沒啥變，只是在實作的過程中要反覆測試，確定參數所屬的類神經層數就是你要的。

實作

反向傳遞實作上要注意的就是下列程式碼裡面被我highlight的部分，也就是上文一直提到的部分：使用正向傳遞過程產生的變數來計算反向傳遞的偏微分。又因為正向傳遞與反向傳遞被分為兩個不同的函數，所以這些變數在使用過程中必須要特別注意索引值的操作。

def NL_backwardpass(AL, X, Y, caches):
"""
Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group
Arguments:
AL -- probability vector, output of the forward propagation (L_model_forward())
Y -- true "label" vector (containing 0 if non-cat, 1 if cat)
caches -- list of caches containing:
every cache of linear_activation_forward() with "relu" (there are (L-1) or them, indexes from 0 to L-2)
the cache of linear_activation_forward() with "sigmoid" (there is one, index L-1)
Returns:
grads -- A dictionary with the gradients
grads["dA" + str(l)] = ... 
grads["dW" + str(l)] = ...
grads["db" + str(l)] = ... 
"""
grads = {}
L = len(caches)
m = AL.shape[1]
Y = Y.reshape(AL.shape)
## Initializing the backpropagation
## Computes the gradient of AL. (AL means the y-hat of model.)
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
## L-th layer gradients.
## (Sigmoid -> Linear)
current_cache = caches[L-1]   # The index of caches is in the range of 0 to L-1.
linear_cache, activation_cache = current_cache
# dZL (Sigmoid backward)
s = 1/(1+np.exp( -activation_cache["Z" + str(L)] ))
dZL = dAL * s * (1-s)
# dA_prev (Linear backward)
A_prev = linear_cache["A_prev" + str(L)]
W = linear_cache["W" + str(L)]
b = linear_cache["b" + str(L)]
m = A_prev.shape[1]
dW = 1./m * np.dot(dZL, A_prev.T)
db = 1./m * np.sum(dZL, axis=1, keepdims=True)
dA_prev = np.dot(W.T, dZL)
## Save grads.(The L-th layer)
grads["dA" + str(L-1)], grads["dW" + str(L)], grads["db" + str(L)] = dA_prev, dW, db
## The value of l is decreased from L-1 to 1.
for l in reversed(range(1, L)):
## l-th layer gradients.
## (ReLU -> Linear), Example: caches[2] contains: A_prev3, W3, b3
current_cache = caches[l-1]
linear_cache, activation_cache = current_cache
# dZ (ReLU backward)
Z = activation_cache["Z" + str(l)]
dZ = np.array(grads["dA" + str(l)], copy=True)
dZ[Z<=0] = 0
assert (dZ.shape == Z.shape)  # check shape
# dA (Linear backward)
A_prev = linear_cache["A_prev" + str(l)]
W = linear_cache["W" + str(l)]
b = linear_cache["b" + str(l)]
m = A_prev.shape[1]
dW = 1./m * np.dot(dZ,A_prev.T)
db = 1./m * np.sum(dZ, axis = 1, keepdims = True)
dA_prev = np.dot(W.T,dZ)
assert (dA_prev.shape == A_prev.shape)  # check shape
assert (dW.shape == W.shape)
assert (db.shape == b.shape)
## Save grads.(the l-th layer)
grads["dA" + str(l-1)], grads["dW" + str(l)], grads["db" + str(l)] = dA_prev, dW, db
return grads

def NL_backwardpass(AL, X, Y, caches):

"""

Implement the backward propagation for the [LINEAR->RELU] * (L-1) -> LINEAR -> SIGMOID group

Arguments:

AL -- probability vector, output of the forward propagation (L_model_forward())

Y -- true "label" vector (containing 0 if non-cat, 1 if cat)

caches -- list of caches containing:

every cache of linear_activation_forward() with "relu" (there are (L-1) or them, indexes from 0 to L-2)

the cache of linear_activation_forward() with "sigmoid" (there is one, index L-1)

Returns:

grads -- A dictionary with the gradients

grads["dA" + str(l)] = ...

grads["dW" + str(l)] = ...

grads["db" + str(l)] = ...

"""

grads = {}

L = len(caches)

m = AL.shape[1]

Y = Y.reshape(AL.shape)

## Initializing the backpropagation

## Computes the gradient of AL. (AL means the y-hat of model.)

dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))

## L-th layer gradients.

## (Sigmoid -> Linear)

current_cache = caches[L-1] # The index of caches is in the range of 0 to L-1.

linear_cache, activation_cache = current_cache

# dZL (Sigmoid backward)

s = 1/(1+np.exp( -activation_cache["Z" + str(L)] ))

dZL = dAL * s * (1-s)

# dA_prev (Linear backward)

A_prev = linear_cache["A_prev" + str(L)]

W = linear_cache["W" + str(L)]

b = linear_cache["b" + str(L)]

m = A_prev.shape[1]

dW = 1./m * np.dot(dZL, A_prev.T)

db = 1./m * np.sum(dZL, axis=1, keepdims=True)

dA_prev = np.dot(W.T, dZL)

## Save grads.(The L-th layer)

grads["dA" + str(L-1)], grads["dW" + str(L)], grads["db" + str(L)] = dA_prev, dW, db

## The value of l is decreased from L-1 to 1.

for l in reversed(range(1, L)):

## l-th layer gradients.

## (ReLU -> Linear), Example: caches[2] contains: A_prev3, W3, b3

current_cache = caches[l-1]

linear_cache, activation_cache = current_cache

# dZ (ReLU backward)

Z = activation_cache["Z" + str(l)]

dZ = np.array(grads["dA" + str(l)], copy=True)

dZ[Z<=0] = 0

assert (dZ.shape == Z.shape) # check shape

# dA (Linear backward)

A_prev = linear_cache["A_prev" + str(l)]

W = linear_cache["W" + str(l)]

b = linear_cache["b" + str(l)]

m = A_prev.shape[1]

dW = 1./m * np.dot(dZ,A_prev.T)

db = 1./m * np.sum(dZ, axis = 1, keepdims = True)

dA_prev = np.dot(W.T,dZ)

assert (dA_prev.shape == A_prev.shape) # check shape

assert (dW.shape == W.shape)

assert (db.shape == b.shape)

## Save grads.(the l-th layer)

grads["dA" + str(l-1)], grads["dW" + str(l)], grads["db" + str(l)] = dA_prev, dW, db

return grads

結果比較

我們將本文4層類神經網路與上篇2層類神經網路的辨識正確率來進行比較，結果是4層的效果會比較好，這也應證了Yann LeCun大神曾說過『適度地加深類神經網路階層能夠提升模型效能』這句話是正確的。

正確率	Training phase	Testing phase
2-layer	99.99%	74%
4-layer	99.52%	82%

References

概念

實作

概念

實作

Andy Wang

You May Also Like

Convolutional Neural Networks(CNN) #6 Pooling in Backward pass

Convolutional Neural Networks(CNN) #5 特徵圖&偏差值的導數

Convolutional Neural Networks(CNN) #4 卷積核的Back propagation

Convolutional Neural Networks(CNN) #3 計算參數量

Convolutional Neural Networks(CNN) #2 池化層(Pooling layer)

Convolutional Neural Networks(CNN) #1 Kernel, Stride, Padding

發表迴響 取消回覆

發表迴響取消回覆