Backpropagation(BP) 倒傳遞法 #2 貓貓分類器-2層類神經網路
本篇會介紹在機器學習(machine learning)與深度學習(deep learning)領域裡很流行的倒傳遞法(Back Propagation/ Backpropagation, BP)的演算法流程與實作方法:正向傳遞(Forward pass)、反向傳遞(Backward pass)、邏輯回歸(Logistic regression)。除此之外,本篇會用簡易的2層類神經網路建立一個『貓貓分類器』。
先來GitHub下載這個範例吧!邊執行邊看文章比較好理解 😀
如果你覺得還是不太懂推導過程可以先來讀這篇:Backpropagation(BP) 倒傳遞法 #1 工作原理與說明;你想要知道該如何優化多層類神經網路可以讀這篇:Backpropagation(BP) 倒傳遞法 #3 貓貓分類器-N層類神經網路
$\\$ $\\$ $\rightarrow$ $\uparrow$ $\uparrow$ $\uparrow$ $\uparrow$ $\uparrow$ $\uparrow$ $\uparrow$ $\leftarrow$ (否) |
參數初始化 $\downarrow$ 正向傳遞 $\downarrow$ 計算成本 $\downarrow$ 反向傳遞 $\downarrow$ 更新參數 $\downarrow$ 檢查是否結束迭代 $\downarrow$(是) 結束 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
import time import numpy as np import h5py import matplotlib.pyplot as plt import scipy from PIL import Image from scipy import ndimage from dnn_app_utils_v3 import * %matplotlib inline plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots plt.rcParams['image.interpolation'] = 'nearest' plt.rcParams['image.cmap'] = 'gray' %load_ext autoreload %autoreload 2 np.random.seed(1) |
Coursera課程中所使用的資料集為$64\times 64$的彩色圖片,一共有$209$張圖片作為訓練資料集、$50$張圖片為測試資料集。我們可以透過下列程式觀察。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# Load dataset. train_x_orig, train_y, test_x_orig, test_y, classes = load_data() # Example of a picture index = 166 plt.figure(num=1, figsize=(3,3)) plt.imshow(train_x_orig[index]) print ("y = " + str(train_y[0,index]) + ". It's a " + classes[train_y[0,index]].decode("utf-8") + " picture.") # Explore your dataset m_train = train_x_orig.shape[0] num_px = train_x_orig.shape[1] m_test = test_x_orig.shape[0] print ("train_x_orig.shape: " + str(train_x_orig.shape)) print("") print ("Number of training examples: " + str(m_train)) print ("Number of testing examples: " + str(m_test)) print ("Each image is: (" + str(num_px) + ", " + str(num_px) + ", 3)") print("") print ("train_x_orig shape: " + str(train_x_orig.shape)) print ("train_y shape: " + str(train_y.shape)) print ("test_x_orig shape: " + str(test_x_orig.shape)) print ("test_y shape: " + str(test_y.shape)) |
1 2 3 4 5 6 7 8 9 10 11 |
y = 1. It's a cat picture. train_x_orig.shape: (209, 64, 64, 3) Number of training examples: 209 Number of testing examples: 50 Each image is: (64, 64, 3) train_x_orig shape: (209, 64, 64, 3) train_y shape: (1, 209) test_x_orig shape: (50, 64, 64, 3) test_y shape: (1, 50) |
這段程式可以發現 train_x_orig的維度排列依序是 每一張圖片、長度、寬度、RGB。接下來,我們要對這些圖片資料做前處理,先將每一張圖片壓縮成$12288-by-1$的向量。為什麼是$12288$?因為長x寬xRGB$=64\times64\times3=12288$,如圖(2)所示。
如下方程式,首先將train_x_orig改變形狀成$12288\times209$,為了加強模型效用與收斂速度再將所有的剛才轉換好的$209$張圖片內的數值做Normalization運算(資料集中每筆數據 $\div$ 資料集中的最大值),這讓所有資料都會介於$0$到$1$之間。相同的做法也需要再對test_x_orig做一次。如此一來,我們就算是把資料集完成前處理了。
1 2 3 4 5 6 7 8 9 10 11 |
# Preprocess input data(images) # Reshape the training and test examples train_x_flatten = train_x_orig.reshape(train_x_orig.shape[0], -1).T test_x_flatten = test_x_orig.reshape(test_x_orig.shape[0], -1).T # Normalization data to have feature values between 0 and 1. train_x = train_x_flatten/255. test_x = test_x_flatten/255. print ("train_x's shape: " + str(train_x.shape)) print ("test_x's shape: " + str(test_x.shape)) |
1 2 |
train_x's shape: (12288, 209) test_x's shape: (12288, 50) |
- 上標$[1]$代表第1層類神經網路
- $n^{[1]}$表示的是模型中第1層類神經網路的神經元輸出的數量
- $a_5^{[1]}$代表第1層類神經網路神經元的第$5+1$筆輸出(因為神經元輸出編號是從$0$開始,所以要$+1$,又因為神經元只有$1$顆所以沒有標記號)
- $a^{[2]}$表示類神經網路中整個第2層的所有輸出
- 在圖中還可以發現有個大圓上方標著Linear Relu表示該層網路所採用的激勵函數(activation function)為ReLU,然而本架構所使用的激勵函數分別為第1層為ReLU、第2層為Sigmoid。
- 各層神經元內部計算則是這樣,第1層的神經元計算以公式(1)、(2)所示
$$Z^{[1]}=w^{[1]}x+b^{[1]}$$ | $(1)$ |
$$a^{[1]}=ReLU(Z^{[1]})$$ | $(2)$ |
第1層類神經網路中每顆神經元會有$n^{[1]}$個神經元輸出,分別是$a_{0}^{[1]}\cdots a_{(n^{[1]}-1)}$。
最後,這張圖還沒加上成本函數,應該要把$0.73$改成$a_0^{[2]}$然後畫一個箭頭指向成本函數$L(y,a^{[2]})$代表輸入成本函數$L$,這邊的$a^{[2]}$少了下標的原因是指整個第2層的輸出。就像這樣:$a_0^{[2]}\rightarrow L(y, a^{[2]})$
圖(3)第2層神經元計算出來的$0.73$是輸入成本函數(Cost function)計算之前的數值,僅能算是$a_0^{[2]}$。那成本函數是什麼?成本函數就是用來判斷這些參數對於這個模型好壞的依據。本文所使用的成本函數如公式(3)所示,本公式裡有個變數$m$指的是樣本數,例如:訓練階段有$209$張圖片則$m=209$、測試階段有$50$張圖片則$m=50$。如果看不懂,也可以先忽略$m$的存在。(此處成本函數的$\log$是以$e$為底的Natural log($\ln$),使用Natural log的原因很簡單,因為$\ln$的微分規則比較簡單,而這也算是使用Backpropagation的慣例,很多時候發論文的作者也都不太提了)
$$L(y,a^{[2]})=- \frac{1}{m} \sum\limits_{i = 0}^{m} \large{(} \small y^{(i)}\log\left(a^{[2] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[2] (i)}\right) \large{)} \small$$ | $(3)$ |
參數/變數名稱 | 形狀 |
train_x($x$) | $12288\times209$ |
W1($w^{[1]}$) | n_h$\times12288$ |
b1($b^{[1]}$) | n_h$\times1$ |
W2($w^{[2]}$) | n_y$\times$n_h |
b2($b^{[2]}$) | n_y$\times1$ |
Layer-1 | 步驟1 | $Z^{[1]}=w^{[1]}x+b^{[1]}$ |
步驟1形狀 | $[$n_h$\times12288]\cdot[ 12288\times209 ]+b^{[1]}\Rightarrow[$n_h$\times209]$ | |
步驟2 | $a^{[1]}=ReLU(Z^{[1]})$ | |
步驟2形狀 | $[$n_h$\times209]$(ReLU輸出形狀不變) | |
Layer-2 | 步驟1 | $Z^{[2]}=w^{[2]}a^{[1]}+b^{[2]}$ |
步驟1形狀 | $[$n_y$\times$n_h$]\cdot[$n_h$\times209]+b^{[2]}\Rightarrow[$n_y$\times209]$ | |
步驟2 | $a^{[2]}=Sigmoid(Z^{[2]})$ | |
步驟2形狀 | $[$n_y$\times209]$(Sigmoid輸出形狀不變) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
def initialize_parameters(n_x, n_h, n_y): """ Argument: n_x -- size of the input layer n_h -- size of the hidden layer n_y -- size of the output layer Returns: parameters -- python dictionary containing your parameters: W1 -- weight matrix of shape (n_h, n_x) b1 -- bias vector of shape (n_h, 1) W2 -- weight matrix of shape (n_y, n_h) b2 -- bias vector of shape (n_y, 1) """ np.random.seed(1) W1 = np.random.randn(n_h, n_x)*0.01 b1 = np.zeros([n_h, 1]) W2 = np.random.randn(n_y, n_h)*0.01 b2 = np.zeros([n_y, 1]) parameters = {"W1": W1, "b1": b1, "W2": W2, "b2": b2} return parameters |
正向傳遞(Forward pass)就是從圖片輸入模型開始到計算出成本的過程,但是我們把計算成本的函數獨立出來,所以此函數裡面沒有計算成本的部分。forwardpass這個函數除了會回傳第2層的計算結果($a^{[2]}$)之外,還會回傳一個變數cache,這個變數的用途是反向傳遞計算過程的參數。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
def forwardpass(X, parameters): """ Argument: X -- input data of size (n_x, m) parameters -- python dictionary containing your parameters (output of initialization function) Returns: A2 -- The sigmoid output of the second activation cache -- a dictionary containing "Z1", "A1", "Z2" and "A2" """ W1 = parameters["W1"] b1 = parameters["b1"] W2 = parameters["W2"] b2 = parameters["b2"] Z1 =, X) + b1 A1 = np.maximum(0,Z1) # ReLU Z2 =, A1) + b2 A2 = 1/(1+np.exp(-Z2)) # Sigmoid cache = {"Z1": Z1, "A1": A1, "Z2": Z2, "A2": A2} return A2, cache |
公式(3)這是本次Logistic Regression模型的成本函數,這公式裡面有個討人厭的$\sum$,和一個意味不明的變數$m$。(公式(3)和公式(1)是一樣的公式,為方便閱讀再複製到這)其實$m$是指輸入照片的數量,若是訓練階段$m$就是$209$,測試階段的話$m$則是$50$。公式(4)是為把$\sum$去除掉的作法。其中公式(3)、公式(4)都可以看見變數$m$,要將成本通除以$m$的原因是要取得平均成本。最後,$y$指的又是什麼?$y$是解答,train_y的形狀就是一個$1\times209$的向量、test_y的形狀則是$1\times50$的向量,數值的話就只有$1$和$0$兩種,$1$表示這張圖片是貓,$0$代表不是貓。
$$L(y,a^{[2]})=- \frac{1}{m} \sum\limits_{i = 0}^{m} \large{(} \small y^{(i)}\log\left(a^{[2] (i)}\right) + (1-y^{(i)})\log\left(1- a^{[2] (i)}\right) \large{)} \small$$ | $(3)$ |
$$L(y,a^{[2]})=(ya^{[2]T}-(1-y)(\log{(1-a^{[2]})})^{T})/m$$ | $(4)$ |
1 2 3 4 5 6 7 8 |
def compute_cost(A2, Y, parameters): m = Y.shape[1] # number of calss #cost = - np.sum(np.multiply(np.log(A2), Y) + np.multiply(1-Y, np.log(1-A2))) / m cost = (1./m) * (,np.log(A2).T) -, np.log(1-A2).T)) cost = np.squeeze(cost) # makes sure cost is the dimension we expect. return cost |
反向傳遞(Backward pass),這是整個倒傳遞法最困難的部分!
但是,我們可以參考上一篇中連鎖率的概念最終計算出這幾個參數:$\frac{\partial L}{\partial w^{[1]}}$、$\frac{\partial L}{\partial b^{[1]}}$、$\frac{\partial L}{\partial w^{[2]}}$、$\frac{\partial L}{\partial b^{[2]}}$
$$\frac{\partial L}{\partial a^{[2]}}=-(\frac{y}{a^{[2]}} – \frac{1-y}{1-a^{[2]}})$$ | $(5)$ |
$$\frac{\partial L}{\partial Z^{[2]}}=\frac{\partial L}{\partial a^{[2]}}(\frac{1}{1+e^{(-Z^{[2]})}})(1-\frac{1}{1+e^{(-Z^{[2]})}})$$ | $(6)$ |
$$\frac{\partial L}{\partial w^{[2]}}=(\frac{\partial L}{\partial Z^{[2]}}a^{[1]T})/m$$ | $(7)$ |
$$\frac{\partial L}{\partial b^{[2]}}=(\frac{\partial L}{\partial Z^{[2]}})/m$$ | $(8)$ |
$$\frac{\partial L}{\partial a^{[1]}}=w^{[2]T}\frac{\partial L}{\partial Z^{[2]}}$$ | $(9)$ |
$$\frac{\partial L}{\partial Z^{[1]}}=$$ $\frac{\partial L}{\partial a^{[1]}}$對應至$Z^{[1]}$相同位置的數值$\leq0$的都改成$0$ |
$(10)$ |
$$\frac{\partial L}{\partial w^{[1]}}=\frac{\partial L}{\partial Z^{[1]}}x^{T}/m$$ | $(11)$ |
$$\frac{\partial L}{\partial b^{[1]}}=\frac{\partial L}{\partial Z^{[1]}}/m$$ | $(12)$ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
def backwardpass(parameters, cache, X, Y): """ Implement the backward propagation using the instructions above. Arguments: parameters -- python dictionary containing our parameters cache -- a dictionary containing "Z1", "A1", "Z2" and "A2". X -- input data of shape (2, number of examples) Y -- "true" labels vector of shape (1, number of examples) Returns: grads -- python dictionary containing your gradients with respect to different parameters """ m = X.shape[1] W1 = parameters["W1"] W2 = parameters["W2"] A1 = cache["A1"] A2 = cache["A2"] Z1 = cache["Z1"] Z2 = cache["Z2"] dA2 = - (np.divide(Y, A2) - np.divide(1 - Y, 1 - A2)) temp_s = 1/(1+np.exp(-Z2)) dZ2 = dA2 * temp_s * (1-temp_s) # Sigmoid (back propagation) dW2 = 1/m *, A1.T) db2 = 1/m * np.sum(dZ2, axis=1, keepdims=True) dA1 =,dZ2) # ReLU (back propagation) dZ1 = np.array(dA1, copy=True) # just converting dz to a correct object. dZ1[Z1 <= 0] = 0 # When z <= 0, you should set dz to 0 as well. dW1 = 1/m *, X.T) db1 = 1/m * np.sum(dZ1, axis=1, keepdims=True) grads = {"dW1": dW1, "db1": db1, "dW2": dW2, "db2": db2} return grads |
參數更新只有一個原則,就是現在的參數減掉Backward pass計算出來的梯度值乘上學習速率(Learning rate)。
$$w^{[1]}=w^{[1]}-\alpha(\frac{\partial L}{\partial w^{[1]}})$$ | $(13)$ |
$$b^{[1]}=b^{[1]}-\alpha(\frac{\partial L}{\partial b^{[1]}})$$ | $(14)$ |
$$w^{[2]}=w^{[2]}-\alpha(\frac{\partial L}{\partial w^{[2]}})$$ | $(15)$ |
$$b^{[2]}=b^{[2]}-\alpha(\frac{\partial L}{\partial b^{[2]}})$$ | $(16)$ |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
def update_parameters(parameters, grads, learning_rate = 1.2): """ Updates parameters using the gradient descent update rule given above Arguments: parameters -- python dictionary containing your parameters grads -- python dictionary containing your gradients Returns: parameters -- python dictionary containing your updated parameters """ W1 = parameters["W1"] b1 = parameters["b1"] W2 = parameters["W2"] b2 = parameters["b2"] dW1 = grads["dW1"] db1 = grads["db1"] dW2 = grads["dW2"] db2 = grads["db2"] W1 = W1 - learning_rate*dW1 b1 = b1 - learning_rate*db1 W2 = W2 - learning_rate*dW2 b2 = b2 - learning_rate*db2 parameters = {"W1": W1, "b1": b1, "W2": W2, "b2": b2} return parameters |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
def nn_model(X, Y, n_h, num_iterations = 5000, learning_rate=0.08, print_cost = False): """ Arguments: X -- dataset of shape (2, number of examples) Y -- labels of shape (1, number of examples) n_h -- size of the hidden layer num_iterations -- Number of iterations in gradient descent loop print_cost -- if True, print the cost every 1000 iterations Returns: parameters -- parameters learnt by the model. They can then be used to predict. """ costs = [] np.random.seed(1) n_x = X.shape[0] n_y = Y.shape[0] # Initialize W1, b1, W2, b2 parameters = initialize_parameters(n_x, n_h, n_y) #print("W1.shape: " + str(parameters["W1"].shape)) #print("W2.shape: " + str(parameters["W2"].shape)) for i in range(0, num_iterations): A2, cache = forwardpass(X, parameters) #print("Z1.shape: " + str(cache["Z1"].shape)) #print("A1.shape: " + str(cache["A1"].shape)) #print("Z2.shape: " + str(cache["Z2"].shape)) #print("A2.shape: " + str(cache["A2"].shape)) cost = compute_cost(A2, Y, parameters) grads = backwardpass(parameters, cache, X, Y) parameters = update_parameters(parameters, grads, learning_rate) if i % 500 == 0: costs.append(cost) if print_cost: print("Cost after iteration {}: {}".format(i, cost)) # The latest iteration. print("Cost after iteration {}: {}".format(i, cost)) costs.append(cost) plt.figure(num=1, figsize=(8,5)) plt.semilogy(costs) plt.xlabel("Iterations") plt.ylabel("Cost") plt.title("Learning Rate = " + str(learning_rate)) return parameters |
判斷準確度的函數是要利用正向傳遞(Forward pass)協助完成,做法是把訓練好的參數(parameters)和向量化的圖片(每張圖的維度是$12288\times1$)丟進forwardpass,回傳值則是probas($a^{[2]}$)和cache(cache在預測精準度時沒有用到)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
def predict(X, y, parameters): """ This function is used to predict the results of a L-layer neural network. Arguments: X -- data set of examples you would like to label parameters -- parameters of the trained model Returns: p -- predictions for the given dataset X """ m = X.shape[1] n = len(parameters) // 2 # number of layers in the neural network p = np.zeros((1,m)) # Forward propagation probas, caches = forwardpass(X, parameters) # convert probas to 0/1 predictions for i in range(0, probas.shape[1]): if probas[0,i] > 0.5: p[0,i] = 1 else: p[0,i] = 0 #print results #print ("predictions: " + str(p)) #print ("true labels: " + str(y)) print("Accuracy: " + str(np.sum((p == y)/m))) return p |
1 |
parameters = nn_model(train_x, train_y, 12, num_iterations=2500, learning_rate=0.007, print_cost=True) |
1 2 3 4 5 6 |
Cost after iteration 0: 0.6933973875299138 Cost after iteration 500: 0.5054817305127275 Cost after iteration 1000: 0.3024003130312214 Cost after iteration 1500: 0.10870519536443567 Cost after iteration 2000: 0.05241476625572783 Cost after iteration 2499: 0.030590797593466793 |
判斷準確度的工作就交給函數predict囉~這應該不需要多解釋了 🙂
1 2 3 4 |
print("Training accuracy:") predictions_train = predict(train_x, train_y, parameters) print("Testing accuracy:") predictions_test = predict(test_x, test_y, parameters) |
1 2 3 4 |
Training accuracy: Accuracy: 0.9999999999999998 Testing accuracy: Accuracy: 0.74 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
## START CODE HERE ## # cat.jpg my_image.jpg people.jpeg my_image = "cat.jpg" # change this to the name of your image file my_label_y = [1] # the true class of your image (1 -> cat, 0 -> non-cat) ## END CODE HERE ## fname = "images/" + my_image image = np.array(ndimage.imread(fname, flatten=False)) my_image = scipy.misc.imresize(image, size=(num_px,num_px)).reshape((num_px*num_px*3,1)) my_image = my_image/255. my_predicted_image = predict(my_image, my_label_y, parameters) plt.figure(num=1, figsize=(3,3)) plt.imshow(image) print ("y = " + str(np.squeeze(my_predicted_image)) + ", your L-layer model predicts a \"" + classes[int(np.squeeze(my_predicted_image)),].decode("utf-8") + "\" picture.") |
1 2 |
Accuracy: 1.0 y = 1.0, your L-layer model predicts a "cat" picture. |
Pingback:Backpropagation(BP) 倒傳遞法 #3 貓貓分類器-N層類神經網路 -
Pingback:Backpropagation(BP) 倒傳遞法 #1 工作原理與說明 -