1. 基于深度学习的python速通(一)
  2. 基于深度学习的python速通(七)
  3. 基于深度学习的python速通(三)
  4. 基于深度学习的python速通(二)
  5. 基于深度学习的python速通(五)
  6. 基于深度学习的python速通(六)
  7. 基于深度学习的python速通(六)-与学习相关的技巧

误差反向传播法概述

误差反向传播法(Backpropagation, BP)是训练多层神经网络的核心算法。它基于计算图与链式法则,将损失对输出的梯度沿网络结构反向传播到每一层参数,实现高效的梯度计算与参数更新。

本章目标:

  • 从计算图与局部导数出发,推导反向传播的数学原理;
  • 以“层(Layer)”为抽象,构建可复用的前向/反向模块;
  • 实现两层神经网络的完整反向传播与梯度校验;
  • 使用Softmax+交叉熵进行分类训练,并给出代码示例。

环境与依赖

本文示例基于 NumPy,需先导入:

1
import numpy as np

计算图与链式法则

计算图(Computational Graph)

计算图以有向无环图表示复合函数的计算流程。每个节点是基本运算(加、乘、激活、仿射等),边表示数据流动。

通过在图上记录“局部导数”(节点输出对输入的导数),可以利用链式法则在图上进行高效的梯度传播。

链式法则(Chain Rule)

设有复合函数 (y = f(g(x))),损失为 (L(y))。则
$$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x} = \underbrace{\frac{\partial L}{\partial y}}{\text{来自下游}} \cdot \underbrace{\frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x}}{\text{局部导数}}. $$

在网络中,( \partial L / \partial y ) 从输出向输入层逐层反向传播,层的“反向函数”只关注本层的局部导数与上游梯度相乘即可。

层(Layer)设计与局部梯度

我们将常用运算封装为层,统一提供 forward(x)backward(dout) 接口:

  • forward(x):前向计算,返回输出并缓存必要的中间变量;
  • backward(dout):反向计算,接收来自上游的梯度 dout,返回对输入的梯度,并累计本层参数梯度。

乘法层与加法层

乘法层(MulLayer)与加法层(AddLayer)用于演示计算图的基本反向传播原理。

局部导数:

  • 乘法:( y = x_1 x_2 \Rightarrow \partial y/\partial x_1 = x_2, \partial y/\partial x_2 = x_1 )
  • 加法:( y = x_1 + x_2 \Rightarrow \partial y/\partial x_1 = 1, \partial y/\partial x_2 = 1 )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class MulLayer:
def forward(self, x, y):
self.x = x
self.y = y
return x * y
def backward(self, dout):
dx = dout * self.y
dy = dout * self.x
return dx, dy

class AddLayer:
def forward(self, x, y):
return x + y
def backward(self, dout):
return dout, dout

激活层:Sigmoid 与 ReLU

对非线性激活的局部导数:

  • Sigmoid:( \sigma(x) = 1/(1+e^{-x}) \Rightarrow \sigma’(x) = \sigma(x)(1-\sigma(x)) )
  • ReLU:( \text{ReLU}(x) = \max(0, x) \Rightarrow \text{ReLU}’(x) = \begin{cases}1, & x>0 \ 0, & x\le 0\end{cases} )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class Sigmoid:
def forward(self, x):
self.out = 1 / (1 + np.exp(-np.clip(x, -500, 500)))
return self.out
def backward(self, dout):
return dout * self.out * (1 - self.out)

class ReLU:
def forward(self, x):
self.mask = (x <= 0)
out = x.copy()
out[self.mask] = 0
return out
def backward(self, dout):
dout[self.mask] = 0
return dout

仿射层(Affine)与 Softmax-交叉熵

仿射层实现 ( y = xW + b )。其局部导数:

  • ( \partial L/\partial W = x^\top \cdot \delta )
  • ( \partial L/\partial b = \sum \delta )
  • ( \partial L/\partial x = \delta W^\top )

Softmax 与交叉熵的组合有重要性质:对每个样本
$$ \frac{\partial L}{\partial z} = \hat{y} - y, $$
其中 (z) 是未归一化的 logits,( \hat{y} = \text{softmax}(z) ),( y ) 是 one-hot 真实标签。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def softmax(x):
x = x - np.max(x, axis=1, keepdims=True)
exp_x = np.exp(x)
return exp_x / np.sum(exp_x, axis=1, keepdims=True)

def cross_entropy(y_true, y_pred):
eps = 1e-15
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

class Affine:
def __init__(self, W, b):
self.W = W
self.b = b
def forward(self, x):
self.x = x
return np.dot(x, self.W) + self.b
def backward(self, dout):
dx = np.dot(dout, self.W.T)
self.dW = np.dot(self.x.T, dout)
self.db = np.sum(dout, axis=0)
return dx

class SoftmaxWithLoss:
def forward(self, x, y_true):
self.y_true = y_true
self.y_pred = softmax(x)
return cross_entropy(y_true, self.y_pred)
def backward(self, dout=1.0):
batch = self.y_true.shape[0]
return (self.y_pred - self.y_true) / batch

两层神经网络的反向传播实现

两层网络结构:Affine -> ReLU -> Affine -> SoftmaxWithLoss

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
class TwoLayerNet:
def __init__(self, input_size, hidden_size, output_size, weight_scale=0.01):
self.params = {
'W1': np.random.randn(input_size, hidden_size) * weight_scale,
'b1': np.zeros(hidden_size),
'W2': np.random.randn(hidden_size, output_size) * weight_scale,
'b2': np.zeros(output_size)
}
self.layers = [
Affine(self.params['W1'], self.params['b1']),
ReLU(),
Affine(self.params['W2'], self.params['b2'])
]
self.last = SoftmaxWithLoss()

def predict(self, X):
out = X
for layer in self.layers:
out = layer.forward(out)
return out

def loss(self, X, y):
scores = self.predict(X)
return self.last.forward(scores, y)

def accuracy(self, X, y):
scores = self.predict(X)
y_pred = np.argmax(scores, axis=1)
y_true = np.argmax(y, axis=1)
return np.mean(y_pred == y_true)

def gradient(self, X, y):
# 前向记录损失
self.loss(X, y)
# 输出层反向
dout = self.last.backward(1.0)
# 隐藏层反向
for layer in self.layers[::-1]:
dout = layer.backward(dout)
# 收集参数梯度
grads = {
'W1': self.layers[0].dW,
'b1': self.layers[0].db,
'W2': self.layers[2].dW,
'b2': self.layers[2].db,
}
return grads

数值梯度与梯度校验

为验证反向传播实现的正确性,使用中心差分进行数值梯度近似。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
def numerical_gradient_array(f, x, h=1e-4):
grad = np.zeros_like(x)
it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
idx = it.multi_index
old_val = x[idx]
x[idx] = old_val + h
fxh1 = f(x)
x[idx] = old_val - h
fxh2 = f(x)
x[idx] = old_val
grad[idx] = (fxh1 - fxh2) / (2 * h)
it.iternext()
return grad

def gradient_check(net, X, y):
# 仅在小网络与小批量上进行
def loss_W1(W):
net.layers[0].W = W
return net.loss(X, y)
def loss_b1(b):
net.layers[0].b = b
return net.loss(X, y)
def loss_W2(W):
net.layers[2].W = W
return net.loss(X, y)
def loss_b2(b):
net.layers[2].b = b
return net.loss(X, y)

grads_bp = net.gradient(X, y)
grads_num = {
'W1': numerical_gradient_array(loss_W1, net.layers[0].W.copy()),
'b1': numerical_gradient_array(loss_b1, net.layers[0].b.copy()),
'W2': numerical_gradient_array(loss_W2, net.layers[2].W.copy()),
'b2': numerical_gradient_array(loss_b2, net.layers[2].b.copy())
}
def rel_error(a, b):
return np.linalg.norm(a - b) / (np.linalg.norm(a) + np.linalg.norm(b) + 1e-12)
return {k: rel_error(grads_bp[k], grads_num[k]) for k in grads_bp}

训练示例与应用

计算图示例:购买苹果加消费税

1
2
3
4
5
6
7
8
9
10
11
12
def apple_tax_example():
mul_apple = MulLayer(); mul_tax = MulLayer(); add_total = AddLayer()
apple_price = 100; apple_num = 2; tax = 1.1
# 前向
apple_cost = mul_apple.forward(apple_price, apple_num)
total = mul_tax.forward(add_total.forward(apple_cost, 0), tax)
# 反向(总价对各输入的梯度)
dtotal = 1
dadd, dtax = mul_tax.backward(dtotal)
dapple_cost, dzero = add_total.backward(dadd)
dprice, dnum = mul_apple.backward(dapple_cost)
return total, dprice, dnum, dtax

分类训练示例

使用二维高斯数据的二分类任务,One-hot 标签,SGD 更新。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def one_hot(y, num_classes):
m = y.shape[0]
out = np.zeros((m, num_classes))
out[np.arange(m), y] = 1
return out

def make_toy_data(n=200):
np.random.seed(42)
c0 = np.random.randn(n//2, 2) + np.array([-1.0, 0.5])
c1 = np.random.randn(n//2, 2) + np.array([ 1.0, -0.5])
X = np.vstack([c0, c1])
y = np.array([0]*(n//2) + [1]*(n//2))
return X, one_hot(y, 2)

def train_demo():
X, y = make_toy_data(200)
net = TwoLayerNet(input_size=2, hidden_size=8, output_size=2, weight_scale=0.1)
lr = 0.1
loss_hist = []
for epoch in range(1000):
grads = net.gradient(X, y)
for k in ['W1','b1','W2','b2']:
net.params[k] -= lr * grads[k]
# 同步到层参数
net.layers[0].W, net.layers[0].b = net.params['W1'], net.params['b1']
net.layers[2].W, net.layers[2].b = net.params['W2'], net.params['b2']
loss = net.loss(X, y)
loss_hist.append(loss)
if epoch % 100 == 0:
acc = net.accuracy(X, y)
print(f"Epoch {epoch}, Loss {loss:.4f}, Acc {acc:.3f}")
return net, loss_hist

总结

误差反向传播法通过在计算图上应用链式法则,将损失的梯度高效地传递到各层参数。以层为抽象单元能够清晰分离前向与反向逻辑,便于组合复杂网络结构。通过数值梯度校验可以确保反向传播实现的正确性,Softmax+交叉熵在分类任务中提供了稳定且简洁的梯度形式。把握这些要点,就能在纯Python/Numpy环境下实现可训练的神经网络并为后续更复杂模型打下基础。