- 基于深度学习的python速通(一)
- 基于深度学习的python速通(七)
- 基于深度学习的python速通(三)
- 基于深度学习的python速通(二)
- 基于深度学习的python速通(五)
- 基于深度学习的python速通(六)
- 基于深度学习的python速通(六)-与学习相关的技巧
误差反向传播法概述
误差反向传播法(Backpropagation, BP)是训练多层神经网络的核心算法。它基于计算图与链式法则,将损失对输出的梯度沿网络结构反向传播到每一层参数,实现高效的梯度计算与参数更新。
本章目标:
- 从计算图与局部导数出发,推导反向传播的数学原理;
- 以“层(Layer)”为抽象,构建可复用的前向/反向模块;
- 实现两层神经网络的完整反向传播与梯度校验;
- 使用Softmax+交叉熵进行分类训练,并给出代码示例。
环境与依赖
本文示例基于 NumPy,需先导入:
计算图与链式法则
计算图(Computational Graph)
计算图以有向无环图表示复合函数的计算流程。每个节点是基本运算(加、乘、激活、仿射等),边表示数据流动。
通过在图上记录“局部导数”(节点输出对输入的导数),可以利用链式法则在图上进行高效的梯度传播。
链式法则(Chain Rule)
设有复合函数 (y = f(g(x))),损失为 (L(y))。则
$$ \frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x} = \underbrace{\frac{\partial L}{\partial y}}{\text{来自下游}} \cdot \underbrace{\frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x}}{\text{局部导数}}. $$
在网络中,( \partial L / \partial y ) 从输出向输入层逐层反向传播,层的“反向函数”只关注本层的局部导数与上游梯度相乘即可。
层(Layer)设计与局部梯度
我们将常用运算封装为层,统一提供 forward(x) 与 backward(dout) 接口:
forward(x):前向计算,返回输出并缓存必要的中间变量;
backward(dout):反向计算,接收来自上游的梯度 dout,返回对输入的梯度,并累计本层参数梯度。
乘法层与加法层
乘法层(MulLayer)与加法层(AddLayer)用于演示计算图的基本反向传播原理。
局部导数:
- 乘法:( y = x_1 x_2 \Rightarrow \partial y/\partial x_1 = x_2, \partial y/\partial x_2 = x_1 )
- 加法:( y = x_1 + x_2 \Rightarrow \partial y/\partial x_1 = 1, \partial y/\partial x_2 = 1 )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| class MulLayer: def forward(self, x, y): self.x = x self.y = y return x * y def backward(self, dout): dx = dout * self.y dy = dout * self.x return dx, dy
class AddLayer: def forward(self, x, y): return x + y def backward(self, dout): return dout, dout
|
激活层:Sigmoid 与 ReLU
对非线性激活的局部导数:
- Sigmoid:( \sigma(x) = 1/(1+e^{-x}) \Rightarrow \sigma’(x) = \sigma(x)(1-\sigma(x)) )
- ReLU:( \text{ReLU}(x) = \max(0, x) \Rightarrow \text{ReLU}’(x) = \begin{cases}1, & x>0 \ 0, & x\le 0\end{cases} )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| class Sigmoid: def forward(self, x): self.out = 1 / (1 + np.exp(-np.clip(x, -500, 500))) return self.out def backward(self, dout): return dout * self.out * (1 - self.out)
class ReLU: def forward(self, x): self.mask = (x <= 0) out = x.copy() out[self.mask] = 0 return out def backward(self, dout): dout[self.mask] = 0 return dout
|
仿射层(Affine)与 Softmax-交叉熵
仿射层实现 ( y = xW + b )。其局部导数:
- ( \partial L/\partial W = x^\top \cdot \delta )
- ( \partial L/\partial b = \sum \delta )
- ( \partial L/\partial x = \delta W^\top )
Softmax 与交叉熵的组合有重要性质:对每个样本
$$ \frac{\partial L}{\partial z} = \hat{y} - y, $$
其中 (z) 是未归一化的 logits,( \hat{y} = \text{softmax}(z) ),( y ) 是 one-hot 真实标签。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| def softmax(x): x = x - np.max(x, axis=1, keepdims=True) exp_x = np.exp(x) return exp_x / np.sum(exp_x, axis=1, keepdims=True)
def cross_entropy(y_true, y_pred): eps = 1e-15 y_pred = np.clip(y_pred, eps, 1 - eps) return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
class Affine: def __init__(self, W, b): self.W = W self.b = b def forward(self, x): self.x = x return np.dot(x, self.W) + self.b def backward(self, dout): dx = np.dot(dout, self.W.T) self.dW = np.dot(self.x.T, dout) self.db = np.sum(dout, axis=0) return dx
class SoftmaxWithLoss: def forward(self, x, y_true): self.y_true = y_true self.y_pred = softmax(x) return cross_entropy(y_true, self.y_pred) def backward(self, dout=1.0): batch = self.y_true.shape[0] return (self.y_pred - self.y_true) / batch
|
两层神经网络的反向传播实现
两层网络结构:Affine -> ReLU -> Affine -> SoftmaxWithLoss
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
| class TwoLayerNet: def __init__(self, input_size, hidden_size, output_size, weight_scale=0.01): self.params = { 'W1': np.random.randn(input_size, hidden_size) * weight_scale, 'b1': np.zeros(hidden_size), 'W2': np.random.randn(hidden_size, output_size) * weight_scale, 'b2': np.zeros(output_size) } self.layers = [ Affine(self.params['W1'], self.params['b1']), ReLU(), Affine(self.params['W2'], self.params['b2']) ] self.last = SoftmaxWithLoss()
def predict(self, X): out = X for layer in self.layers: out = layer.forward(out) return out
def loss(self, X, y): scores = self.predict(X) return self.last.forward(scores, y)
def accuracy(self, X, y): scores = self.predict(X) y_pred = np.argmax(scores, axis=1) y_true = np.argmax(y, axis=1) return np.mean(y_pred == y_true)
def gradient(self, X, y): self.loss(X, y) dout = self.last.backward(1.0) for layer in self.layers[::-1]: dout = layer.backward(dout) grads = { 'W1': self.layers[0].dW, 'b1': self.layers[0].db, 'W2': self.layers[2].dW, 'b2': self.layers[2].db, } return grads
|
数值梯度与梯度校验
为验证反向传播实现的正确性,使用中心差分进行数值梯度近似。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
| def numerical_gradient_array(f, x, h=1e-4): grad = np.zeros_like(x) it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite']) while not it.finished: idx = it.multi_index old_val = x[idx] x[idx] = old_val + h fxh1 = f(x) x[idx] = old_val - h fxh2 = f(x) x[idx] = old_val grad[idx] = (fxh1 - fxh2) / (2 * h) it.iternext() return grad
def gradient_check(net, X, y): def loss_W1(W): net.layers[0].W = W return net.loss(X, y) def loss_b1(b): net.layers[0].b = b return net.loss(X, y) def loss_W2(W): net.layers[2].W = W return net.loss(X, y) def loss_b2(b): net.layers[2].b = b return net.loss(X, y)
grads_bp = net.gradient(X, y) grads_num = { 'W1': numerical_gradient_array(loss_W1, net.layers[0].W.copy()), 'b1': numerical_gradient_array(loss_b1, net.layers[0].b.copy()), 'W2': numerical_gradient_array(loss_W2, net.layers[2].W.copy()), 'b2': numerical_gradient_array(loss_b2, net.layers[2].b.copy()) } def rel_error(a, b): return np.linalg.norm(a - b) / (np.linalg.norm(a) + np.linalg.norm(b) + 1e-12) return {k: rel_error(grads_bp[k], grads_num[k]) for k in grads_bp}
|
训练示例与应用
计算图示例:购买苹果加消费税
1 2 3 4 5 6 7 8 9 10 11 12
| def apple_tax_example(): mul_apple = MulLayer(); mul_tax = MulLayer(); add_total = AddLayer() apple_price = 100; apple_num = 2; tax = 1.1 apple_cost = mul_apple.forward(apple_price, apple_num) total = mul_tax.forward(add_total.forward(apple_cost, 0), tax) dtotal = 1 dadd, dtax = mul_tax.backward(dtotal) dapple_cost, dzero = add_total.backward(dadd) dprice, dnum = mul_apple.backward(dapple_cost) return total, dprice, dnum, dtax
|
分类训练示例
使用二维高斯数据的二分类任务,One-hot 标签,SGD 更新。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
| def one_hot(y, num_classes): m = y.shape[0] out = np.zeros((m, num_classes)) out[np.arange(m), y] = 1 return out
def make_toy_data(n=200): np.random.seed(42) c0 = np.random.randn(n//2, 2) + np.array([-1.0, 0.5]) c1 = np.random.randn(n//2, 2) + np.array([ 1.0, -0.5]) X = np.vstack([c0, c1]) y = np.array([0]*(n//2) + [1]*(n//2)) return X, one_hot(y, 2)
def train_demo(): X, y = make_toy_data(200) net = TwoLayerNet(input_size=2, hidden_size=8, output_size=2, weight_scale=0.1) lr = 0.1 loss_hist = [] for epoch in range(1000): grads = net.gradient(X, y) for k in ['W1','b1','W2','b2']: net.params[k] -= lr * grads[k] net.layers[0].W, net.layers[0].b = net.params['W1'], net.params['b1'] net.layers[2].W, net.layers[2].b = net.params['W2'], net.params['b2'] loss = net.loss(X, y) loss_hist.append(loss) if epoch % 100 == 0: acc = net.accuracy(X, y) print(f"Epoch {epoch}, Loss {loss:.4f}, Acc {acc:.3f}") return net, loss_hist
|
总结
误差反向传播法通过在计算图上应用链式法则,将损失的梯度高效地传递到各层参数。以层为抽象单元能够清晰分离前向与反向逻辑,便于组合复杂网络结构。通过数值梯度校验可以确保反向传播实现的正确性,Softmax+交叉熵在分类任务中提供了稳定且简洁的梯度形式。把握这些要点,就能在纯Python/Numpy环境下实现可训练的神经网络并为后续更复杂模型打下基础。