1. 基于深度学习的python速通(一)
  2. 基于深度学习的python速通(七)
  3. 基于深度学习的python速通(三)
  4. 基于深度学习的python速通(二)
  5. 基于深度学习的python速通(五)
  6. 基于深度学习的python速通(六)
  7. 基于深度学习的python速通(六)-与学习相关的技巧

与学习相关的技巧概述

本章围绕多层神经网络训练中的关键工程技巧展开:权重初始化、批归一化(Batch Normalization)、Dropout、权重衰减(L2 正则)、学习率与调度、早停与验证集、超参数优化等。它们从不同侧面改善梯度传递与泛化能力,使训练更稳定更高效。

本文沿用前文的层化实现(forward/backward)思想,在纯 Python/NumPy 环境下给出完整可运行的代码示例,辅以严谨的数学推导。

环境与依赖

示例统一使用 NumPy:

1
import numpy as np

权重初始化策略

深层网络中,若权重初始化不当会导致激活的方差在层间指数式增长或衰减,从而引发“梯度爆炸/消失”。合理初始化的目标是在层间保持激活与梯度的方差稳定。

Xavier 初始化(适于 tanh/sigmoid)

设某层输入维度为 (n_{\text{in}}),输出为 (n_{\text{out}})。在理想化假设下,令权重独立同分布、输入近似零均值,要求前向保持 (\operatorname{Var}(y)) 稳定,得到权重方差近似:
$$ \operatorname{Var}(W) = \frac{1}{n_{\text{in}}}. $$

常用采样:( W \sim \mathcal{N}(0, \frac{1}{n_{\text{in}}}) ) 或均匀分布 ( U\big(-\sqrt{\frac{6}{n_{\text{in}}+n_{\text{out}}}},\ \sqrt{\frac{6}{n_{\text{in}}+n_{\text{out}}}}\big) )。

He 初始化(适于 ReLU)

ReLU 会使输出一半期望为零(负半轴截断),为补偿此“有效通道数减半”,初始化方差加倍:
$$ \operatorname{Var}(W) = \frac{2}{n_{\text{in}}}. $$

Python实现

1
2
3
4
5
6
7
def xavier_init(n_in, n_out):
std = np.sqrt(1.0 / n_in)
return np.random.randn(n_in, n_out) * std

def he_init(n_in, n_out):
std = np.sqrt(2.0 / n_in)
return np.random.randn(n_in, n_out) * std

批归一化(Batch Normalization)

BN 通过在每层对激活进行标准化,稳定了梯度分布,缓解梯度消失/爆炸问题,并允许使用更高的学习率。

前向传播

对一个批次 ({x_i}_{i=1}^m),BN 的标准化与仿射变换:
$$ \mu_B = \frac{1}{m}\sum_i x_i,\quad \sigma_B^2 = \frac{1}{m}\sum_i (x_i - \mu_B)^2 $$
$$ \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}},\quad y_i = \gamma\hat{x}_i + \beta $$
其中 (\gamma, \beta) 为可学习参数;推理阶段使用滑动平均的 (\mu,\sigma^2)。

反向传播

设上游梯度为 (dY)。BN 的梯度(对训练阶段)为:
$$ d\beta = \sum_i dY_i,\quad d\gamma = \sum_i dY_i,\hat{x}_i $$
$$ d\hat{x}_i = dY_i,\gamma $$
$$ dx_i = \frac{1}{m}\frac{1}{\sqrt{\sigma_B^2+\varepsilon}}\Big(m,d\hat{x}_i - \sum_j d\hat{x}_j - \hat{x}_i\sum_j d\hat{x}_j,\hat{x}_j\Big) $$

Python实现(训练/推理)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
class BatchNorm:
def __init__(self, dim, eps=1e-5, momentum=0.9):
self.gamma = np.ones(dim)
self.beta = np.zeros(dim)
self.eps = eps
self.momentum = momentum
self.running_mean = np.zeros(dim)
self.running_var = np.ones(dim)

def forward(self, x, training=True):
self.training = training
if training:
self.mu = x.mean(axis=0)
self.var = x.var(axis=0)
self.x_hat = (x - self.mu) / np.sqrt(self.var + self.eps)
out = self.gamma * self.x_hat + self.beta
# 更新滑动平均
self.running_mean = self.momentum * self.running_mean + (1 - self.momentum) * self.mu
self.running_var = self.momentum * self.running_var + (1 - self.momentum) * self.var
else:
x_hat = (x - self.running_mean) / np.sqrt(self.running_var + self.eps)
out = self.gamma * x_hat + self.beta
self.x = x
return out

def backward(self, dout):
m = dout.shape[0]
dgamma = np.sum(dout * self.x_hat, axis=0)
dbeta = np.sum(dout, axis=0)
dxhat = dout * self.gamma
inv_std = 1.0 / np.sqrt(self.var + self.eps)
sum_dxhat = np.sum(dxhat, axis=0)
sum_dxhat_xhat = np.sum(dxhat * self.x_hat, axis=0)
dx = (inv_std / m) * (m * dxhat - sum_dxhat - self.x_hat * sum_dxhat_xhat)
self.dgamma, self.dbeta = dgamma, dbeta
return dx

Dropout

Dropout 在训练阶段随机“屏蔽”一部分神经元,以减少共适应并增强泛化。为了保持期望不变,训练时需按保留率进行缩放。

前向/反向与推理

1
2
3
4
5
6
7
8
9
10
11
12
13
class Dropout:
def __init__(self, keep_prob=0.5):
self.keep_prob = keep_prob

def forward(self, x, training=True):
if training:
self.mask = (np.random.rand(*x.shape) < self.keep_prob) / self.keep_prob
return x * self.mask
else:
return x

def backward(self, dout):
return dout * self.mask

权重衰减(L2 正则化)

总损失:( L = L_{\text{data}} + \frac{\lambda}{2}\sum|W|^2 )。其对 (W) 的梯度为:
$$ \frac{\partial L}{\partial W} = \frac{\partial L_{\text{data}}}{\partial W} + \lambda W. $$

Python整合

1
2
3
4
5
6
7
def l2_regularization(params, lam):
return 0.5 * lam * sum(np.sum(W*W) for W in params)

def add_l2_to_grads(grads, params_dict, lam):
for k in params_dict:
if k.startswith('W'):
grads[k] += lam * params_dict[k]

学习率与优化器

学习率调度

  • 常数学习率:lr = lr0
  • 阶梯衰减:每 step 轮乘以系数 (\gamma)
  • 指数衰减:( lr_t = lr_0 \cdot \gamma^t )
1
2
3
4
5
def step_decay(epoch, lr0=0.1, drop=0.5, step=200):
return lr0 * (drop ** (epoch // step))

def exp_decay(epoch, lr0=0.1, gamma=0.99):
return lr0 * (gamma ** epoch)

优化器实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class SGD:
def __init__(self, lr=0.1):
self.lr = lr
def step(self, params, grads):
for k in params:
params[k] -= self.lr * grads[k]

class SGDMomentum:
def __init__(self, lr=0.1, momentum=0.9):
self.lr = lr; self.m = momentum; self.v = {}
def step(self, params, grads):
for k in params:
v = self.v.get(k, np.zeros_like(params[k]))
v = self.m * v - self.lr * grads[k]
params[k] += v
self.v[k] = v

class Adam:
def __init__(self, lr=1e-3, beta1=0.9, beta2=0.999, eps=1e-8):
self.lr, self.b1, self.b2, self.eps = lr, beta1, beta2, eps
self.m, self.v, self.t = {}, {}, 0
def step(self, params, grads):
self.t += 1
for k in params:
m = self.m.get(k, np.zeros_like(params[k]))
v = self.v.get(k, np.zeros_like(params[k]))
g = grads[k]
m = self.b1 * m + (1 - self.b1) * g
v = self.b2 * v + (1 - self.b2) * (g * g)
m_hat = m / (1 - self.b1 ** self.t)
v_hat = v / (1 - self.b2 ** self.t)
params[k] -= self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
self.m[k], self.v[k] = m, v

早停与验证集

训练过程中使用验证集评估泛化误差,若验证损失在若干轮内不再改善,则提前停止。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class EarlyStopping:
def __init__(self, patience=20, min_delta=0.0):
self.patience = patience
self.min_delta = min_delta
self.best = np.inf
self.wait = 0
self.stop = False
def step(self, val_loss):
if val_loss < self.best - self.min_delta:
self.best = val_loss
self.wait = 0
else:
self.wait += 1
if self.wait >= self.patience:
self.stop = True

超参数优化(随机搜索)

相较网格搜索,随机搜索对重要超参数的覆盖度更高。常对学习率采用 log-uniform 采样。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def sample_log_uniform(low=-4, high=0):
# 10^U[low, high]
return 10 ** (np.random.uniform(low, high))

def random_search(train_fn, n_trials=10):
results = []
for _ in range(n_trials):
hp = {
'lr': sample_log_uniform(-4, -1),
'hidden': np.random.choice([16, 32, 64]),
'keep_prob': np.random.uniform(0.5, 0.9),
'use_bn': np.random.choice([True, False]),
'lam': np.random.choice([0.0, 1e-4, 1e-3])
}
val_loss = train_fn(hp)
results.append((hp, val_loss))
results.sort(key=lambda x: x[1])
return results[0]

组合网络:支持 BN/Dropout/L2 的 MLP

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
def softmax(x):
x = x - np.max(x, axis=1, keepdims=True)
exp_x = np.exp(x)
return exp_x / np.sum(exp_x, axis=1, keepdims=True)

def cross_entropy(y_true, y_pred):
eps = 1e-15
y_pred = np.clip(y_pred, eps, 1 - eps)
return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

class Affine:
def __init__(self, W, b):
self.W, self.b = W, b
def forward(self, x):
self.x = x
return np.dot(x, self.W) + self.b
def backward(self, dout):
self.dW = np.dot(self.x.T, dout)
self.db = np.sum(dout, axis=0)
return np.dot(dout, self.W.T)

class ReLU:
def forward(self, x):
self.mask = (x <= 0)
out = x.copy(); out[self.mask] = 0
return out
def backward(self, dout):
dout[self.mask] = 0
return dout

class SoftmaxWithLoss:
def forward(self, x, y_true):
self.y_true = y_true
self.y_pred = softmax(x)
return cross_entropy(y_true, self.y_pred)
def backward(self, dout=1.0):
m = self.y_true.shape[0]
return (self.y_pred - self.y_true) / m

class MLP:
def __init__(self, input_dim, hidden_dim, output_dim, use_bn=True, keep_prob=0.8,
init='he', lam=0.0):
init_fn = he_init if init == 'he' else xavier_init
self.params = {
'W1': init_fn(input_dim, hidden_dim), 'b1': np.zeros(hidden_dim),
'W2': init_fn(hidden_dim, output_dim), 'b2': np.zeros(output_dim)
}
self.affine1 = Affine(self.params['W1'], self.params['b1'])
self.relu = ReLU()
self.bn = BatchNorm(hidden_dim) if use_bn else None
self.drop = Dropout(keep_prob) if keep_prob < 1.0 else None
self.affine2 = Affine(self.params['W2'], self.params['b2'])
self.last = SoftmaxWithLoss()
self.use_bn = use_bn
self.keep_prob = keep_prob
self.lam = lam

def predict(self, X, training=True):
out = self.affine1.forward(X)
out = self.relu.forward(out)
if self.use_bn:
out = self.bn.forward(out, training=training)
if self.drop is not None:
out = self.drop.forward(out, training=training)
out = self.affine2.forward(out)
return out

def loss(self, X, y, training=True):
scores = self.predict(X, training=training)
data_loss = self.last.forward(scores, y)
reg = l2_regularization([self.params['W1'], self.params['W2']], self.lam)
return data_loss + reg

def accuracy(self, X, y):
scores = self.predict(X, training=False)
y_pred = np.argmax(scores, axis=1)
y_true = np.argmax(y, axis=1)
return np.mean(y_pred == y_true)

def gradient(self, X, y):
# 前向
self.loss(X, y, training=True)
# 反向
dout = self.last.backward(1.0)
dout = self.affine2.backward(dout)
if self.drop is not None:
dout = self.drop.backward(dout)
if self.use_bn:
dout = self.bn.backward(dout)
dout = self.relu.backward(dout)
dout = self.affine1.backward(dout)
grads = {
'W1': self.affine1.dW, 'b1': self.affine1.db,
'W2': self.affine2.dW, 'b2': self.affine2.db
}
# L2 正则
add_l2_to_grads(grads, self.params, self.lam)
return grads

综合示例:训练比较(He vs Xavier, BN/Dropout)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def one_hot(y, num_classes):
m = y.shape[0]
out = np.zeros((m, num_classes))
out[np.arange(m), y] = 1
return out

def make_spiral(n=300, k=3):
np.random.seed(0)
X = np.zeros((n*k, 2)); y = np.zeros(n*k, dtype=int)
for j in range(k):
ix = range(n*j, n*(j+1))
r = np.linspace(0.0, 1, n)
t = np.linspace(j*4, (j+1)*4, n) + np.random.randn(n)*0.2
X[ix] = np.c_[r*np.sin(t), r*np.cos(t)]
y[ix] = j
return X, one_hot(y, k)

def train_mlp_tips(init='he', use_bn=True, keep_prob=0.8, lam=1e-4,
optimizer='adam', epochs=600, schedule='exp'):
X, y = make_spiral(300, 3)
# 划分训练/验证
idx = np.arange(X.shape[0]); np.random.shuffle(idx)
split = int(0.8 * len(idx))
tr_idx, va_idx = idx[:split], idx[split:]
Xtr, ytr = X[tr_idx], y[tr_idx]
Xva, yva = X[va_idx], y[va_idx]

net = MLP(input_dim=2, hidden_dim=64, output_dim=3,
use_bn=use_bn, keep_prob=keep_prob, init=init, lam=lam)
opt = {'sgd': SGD(lr=0.1), 'mom': SGDMomentum(lr=0.1), 'adam': Adam(lr=1e-3)}[optimizer]
es = EarlyStopping(patience=50, min_delta=1e-4)

loss_hist = []
for epoch in range(epochs):
# 调度
if schedule == 'exp':
lr = exp_decay(epoch, lr0=(1e-3 if optimizer=='adam' else 0.1), gamma=0.995)
else:
lr = step_decay(epoch, lr0=(1e-3 if optimizer=='adam' else 0.1))
if isinstance(opt, SGD):
opt.lr = lr
elif isinstance(opt, SGDMomentum):
opt.lr = lr
elif isinstance(opt, Adam):
opt.lr = lr

# 全批次训练(可改为小批量)
grads = net.gradient(Xtr, ytr)
opt.step(net.params, grads)
# 同步到层
net.affine1.W, net.affine1.b = net.params['W1'], net.params['b1']
net.affine2.W, net.affine2.b = net.params['W2'], net.params['b2']

tr_loss = net.loss(Xtr, ytr, training=True)
va_loss = net.loss(Xva, yva, training=False)
loss_hist.append((tr_loss, va_loss))
if epoch % 50 == 0:
acc_tr = net.accuracy(Xtr, ytr); acc_va = net.accuracy(Xva, yva)
print(f"Epoch {epoch:03d} | LR {lr:.5f} | Train {tr_loss:.4f} | Val {va_loss:.4f} | Acc {acc_tr:.3f}/{acc_va:.3f}")
es.step(va_loss)
if es.stop:
print(f"Early stopping at epoch {epoch}")
break
return net, loss_hist