cs231n 学习之旅（一）

Q1: k-Nearest Neighbor classifier¶

目标： 完成 knn.ipynb 以及 cs231n/classifiers/k_nearest_neighbor.py 中的距离计算、预测与交叉验证相关实现。

课程默认在 Colab + Google Drive 上完成，但笔者的 Colab 常出现模块导入异常、改动不同步等问题，于是改在本机完成作业——虚拟环境可固定依赖版本，IDE 也能单步调试，排查问题更高效。

环境与数据¶

1. 下载 CIFAR-10

在仓库根目录执行：

cd cs231n/datasets
chmod +x ./get_datasets.sh
./get_datasets.sh

脚本会自动下载并解压 CIFAR-10 数据集到当前目录。

2. Python 虚拟环境与依赖

cd /path/to/assignment1
python3 -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -U pip
pip install -r requirements.txt # 具体依赖可问 ai, 懒得贴了

之后在 knn.ipynb 里选用该 kernel 即可。

kNN 实现要点¶

1. 距离计算的三种方式¶

两层循环（compute_distances_two_loops）¶

最基础的实现，逐对计算 L2 距离：

dists[i, j] = np.sqrt(((X[i] - self.X_train[j]) ** 2).sum())

关键点：

** 2 逐元素平方
.sum() 对所有维度求和（这里输入是一维向量，结果标量）
np.sqrt 开方得到欧氏距离

一层循环（compute_distances_one_loop）¶

向量化优化，一次处理一个测试点与所有训练点：

dists[i] = np.sqrt(((self.X_train - X[i]) ** 2).sum(axis=-1))

关键点：

广播：X[i] 形状 (D,) 广播到 (num_train, D)
axis=-1 在最后一维（特征维）求和，得到每个训练点的距离

无循环（compute_distances_no_loops）¶

完全向量化，利用矩阵乘法恒等式：

x2 = np.sum(X * X, axis=1, keepdims=True)           # (num_test, 1)
t2 = np.sum(self.X_train ** 2, axis=1).reshape(1, -1) # (1, num_train)
cross = X @ self.X_train.T                          # (num_test, num_train)
dists = np.sqrt(x2 + t2 - 2.0 * cross)

数学原理：

\[\|x - y\|^2 = \|x\|^2 + \|y\|^2 - 2x^\top y\]

关键点：

keepdims=True 保持二维便于广播
np.maximum(dists_sq, 0.0) 防止浮点误差产生负数

2. predict_labels 实现¶

找 k 个最近邻¶

idx = np.argsort(dists[i])[:k]  # 距离从小到大的 k 个索引
closest_y = self.y_train[idx]   # 这 k 个邻居的标签

np.argsort 返回排序后的索引（不是值），取前 k 个就是最近的 k 个。

统计标签出现次数¶

dic = {}
for label in closest_y:
    dic[label] = dic.get(label, 0) + 1

找最多数标签（平票时取较小标签）¶

y_pred[i] = max(dic, key=lambda k: (dic[k], -k))

关键点：

max(dic, key=...) 按 key 函数返回值比较，但返回的是原字典的键
(dic[k], -k) 是元组比较：先比票数，再比负的标签值（越大表示标签越小）
这样实现「平票时取较小标签」

NumPy 复习¶

1. reshape 与维度操作¶

# -1 表示自动计算该维大小
a.reshape(2, -1)      # 2 行，列数自动
a.reshape(-1)       # 拉成一维

# keepdims 保持维度便于广播
np.sum(a, axis=1, keepdims=True)  # 结果 shape (n, 1) 而非 (n,)

2. 广播（Broadcasting）¶

从最后一维开始对齐，规则：

维度相等 → 直接运算
其中一个为 1 → 沿该维复制扩展
都不相等且都不是 1 → 报错

示例：

(num_test, 1, D) - (1, num_train, D)  # 广播为 (num_test, num_train, D)
(num_test, D) - (D,)                  # 广播为 (num_test, D)

3. 逐元素运算 vs 矩阵乘法¶

运算符	含义	示例
`*`	逐元素乘法（广播后）	`a * b`
`@`	矩阵乘法	`A @ B.T`
`**`	逐元素幂运算	`a ** 2`

注意：NumPy 数组没有 .sqrt() 方法，要用 np.sqrt(a)。

4. argsort 与排序¶

idx = np.argsort(a)      # 返回升序排列的索引
a[idx]                   # 等价于 np.sort(a)

# 二维数组按行排序
idx = np.argsort(dists, axis=1)  # 每行内部排序

Inline Question 参考答案¶

Q1: 距离矩阵的亮行/亮列¶

亮行：某测试点与几乎所有训练点距离都大 → 该测试点是离群点
亮列：某训练点与几乎所有测试点距离都大 → 该训练点是离群点

Q2: 不影响 L1 kNN 的预处理¶

答案：1、2、3

(1) 全局减均值：L1 距离不变（两边减相同常数）
(2) 逐像素减均值：L1 距离不变（每位置减相同值）
(3) 全局标准化：L1 距离同比例缩放，排序不变
(4) 逐像素标准化：变成加权 L1，排序可能变
(5) 旋转：改变像素对应关系，距离会变

Q2: Implement a Softmax classifier¶

Inline Question 1

Why do we expect our loss to be close to -log(0.1)? Explain briefly.

\(\color{blue}{\textit Your Answer:}\)

因为在随机初始化权重（权重值很小且随机）时，每个类别的得分大致相等。对于 10 类分类问题，这意味着每个类别的预测概率约为 \(1/10 = 0.1\)。Softmax 损失是负对数似然 \(L_i = -\log(p_{y_i})\)，其中 \(p_{y_i}\) 是正确类别的概率。当概率均匀分布时，\(p_{y_i} = 0.1\)，因此损失约为 \(-\log(0.1) \approx 2.302\)。

softmax 中梯度计算的实现¶

数学推导¶

1. 定义与符号

输入数据 \(X \in \mathbb{R}^{N \times D}\)，\(N\) 个样本，每个样本 \(D\) 维特征
权重矩阵 \(W \in \mathbb{R}^{D \times C}\)，\(C\) 个类别
标签 \(y \in \mathbb{R}^N\)，\(y_i \in \{0, 1, ..., C-1\}\)
得分矩阵 \(S = XW \in \mathbb{R}^{N \times C}\)
Softmax 概率：\(P_{ij} = \frac{e^{S_{ij}}}{\sum_{k=0}^{C-1} e^{S_{ik}}}\)

2. 损失函数

交叉熵损失（加正则化）：

\[ L = -\frac{1}{N} \sum_{i=0}^{N-1} \log(P_{i, y_i}) + \lambda \|W\|^2 \]

其中 \(P_{i, y_i}\) 是第 \(i\) 个样本正确类别的预测概率。

3. 梯度推导

我们需要计算 \(\frac{\partial L}{\partial W}\)。利用链式法则：

\[\frac{\partial L}{\partial W} = X^\top \frac{\partial L}{\partial S}\]

因此关键是求 \(\frac{\partial L}{\partial S}\)（记为 \(dS \in \mathbb{R}^{N \times C}\)）。

4. 计算 \(dS_{ij} = \frac{\partial L}{\partial S_{ij}}\)

考虑第 \(i\) 个样本的损失：\(L_i = -\log(P_{i, y_i})\)

分两种情况讨论：

情况一：\(j = y_i\)（正确类别）

\[P_{i, y_i} = \frac{e^{S_{i, y_i}}}{\sum_k e^{S_{ik}}}\]

\[\frac{\partial \log(P_{i, y_i})}{\partial S_{i, y_i}} = \frac{\partial}{\partial S_{i, y_i}} \left( S_{i, y_i} - \log\sum_k e^{S_{ik}} \right) = 1 - P_{i, y_i}\]

因此：

\[dS_{i, y_i} = -\frac{\partial \log(P_{i, y_i})}{\partial S_{i, y_i}} = -(1 - P_{i, y_i}) = P_{i, y_i} - 1\]

情况二：\(j \neq y_i\)（错误类别）

\[ \frac{\partial \log(P_{i, y_i})}{\partial S_{ij}} = \frac{\partial}{\partial S_{ij}} \left( S_{i, y_i} - \log\sum_k e^{S_{ik}} \right) = -\frac{e^{S_{ij}}}{\sum_k e^{S_{ik}}} = -P_{ij} \]

因此：

\[ dS_{ij} = -\frac{\partial \log(P_{i, y_i})}{\partial S_{ij}} = P_{ij} \]

5. 统一表达式

综上，对于第 \(i\) 个样本：

\[ dS_{ij} = \begin{cases} P_{ij} - 1 & \text{if } j = y_i \\ P_{ij} & \text{if } j \neq y_i \end{cases} \]

可以简洁地写成：

\[ dS_{ij} = P_{ij} - \mathbb{1}_{[j = y_i]} \]

其中 \(\mathbb{1}_{[j = y_i]}\) 是指示函数（条件成立时为1，否则为0）。

6. 完整的梯度

\[ \frac{\partial L}{\partial S} = P - M \]

其中 \(M\) 是掩码矩阵，\(M_{ij} = \mathbb{1}_{[j = y_i]}\)。

然后：

\[ \frac{\partial L}{\partial W} = X^\top (P - M) / N + 2\lambda W \]

代码实现对照¶

Naive 版本（对应代码 43-46 行）：

p_copy = p.copy()           # P: softmax 概率 (C,)
p_copy[y[i]] -= 1          # P - M: 正确类别位置减 1
dW += np.outer(X[i], p_copy)  # X[i]^T (P - M), 外积得到 (D, C)

p_copy[y[i]] -= 1 实现了 \(P - M\)（只有正确类别位置减1）
np.outer(X[i], p_copy) 计算 \(X[i]^\top (P_{i\cdot} - M_{i\cdot})\)

Vectorized 版本（对应代码 102-105 行）：

p_copy = p.copy()                     # P: (N, C)
p_copy[np.arange(N), y] -= 1          # 对所有样本的正确类别位置减 1
dW = X.T @ p_copy / N + 2 * reg * W   # X^T (P-M) / N + 正则化梯度

p_copy[np.arange(N), y] -= 1 向量化地实现了对所有 \(N\) 个样本构造 \(P - M\)
X.T @ p_copy 计算矩阵乘法 \(X^\top (P - M)\)，结果形状 \((D, C)\)
除以 \(N\) 求平均，加上正则化梯度 \(2\lambda W\)

Vectorized 版本的 loss 实现较简单，此处略过。

关键理解¶

为什么正确类别要减1？

损失函数只关心正确类别的概率 \(P_{i, y_i}\)，但通过 softmax 归一化，所有类别的得分都会影响它。对于正确类别，增大其得分会同时降低其他类别的相对概率，因此梯度是 \(P_{ij} - 1\)（鼓励增大）；对于错误类别，梯度是 \(P_{ij}\)（鼓励减小）。

为什么外积/矩阵乘法？

\(\frac{\partial L}{\partial W_{dc}} = \sum_i X_{id} \cdot dS_{ic}\)，这正是矩阵乘法 \(X^\top \cdot dS\) 的定义。

数值稳定性

代码中 scores -= np.max(scores) 是为了防止指数爆炸，这是 softmax 的常用技巧（减去最大值不改变概率分布，也不改变梯度）。

实现 SGD¶

随机采样使用 np.random.choice()

chosen_idx = np.random.choice(num_train, batch_size, replace=True)
X_batch = X[chosen_idx]
y_batch = y[chosen_idx]

Hint： 有放回的采样要比无放回的采样快，而 replace 对 SGD 的收敛性无影响。

SGD 本来就是一个样本可能被多次选中（多轮 epoch）
一个 batch 内有少量重复样本不影响梯度估计的无偏性
当 batch_size << num_train 时，重复概率很低

用 self.loss() 得到的 grad 和 learning_rate 更新 self.W：

self.W -= grad * learning_rate

接着实现 LinearClassifier 的 predict 方法，输入 X 乘以打分矩阵 self.W，然后返回每个图片得分最大位置的索引：

y_pred = np.argmax(X @ self.W, axis=1)

选择合适的 learning_rate 和 reg：

learning_rates = [1e-7, 1e-6]
regularization_strengths = [2.5e4, 1e4]

for learning_rate in learning_rates:
    for reg in regularization_strengths:
        softmax = Softmax()
        softmax.train(X_train, y_train, learning_rate=learning_rate, reg=reg,
        num_iters=1000, verbose=False)
        train_accuracy = np.sum(softmax.predict(X_train) == y_train) / float(y_train.size)
        val_accuracy = np.sum(softmax.predict(X_val) == y_val) / float(y_val.size)
        if val_accuracy > best_val:
            best_val = val_accuracy
            best_softmax = softmax
        results[(learning_rate, reg)] = (train_accuracy, val_accuracy)

Inline question 3

Describe what your visualized Softmax classifier weights look like, and offer a brief explanation for why they look the way they do.

\(\color{blue}{\textit Your Answer:}\) The visualized weights look like blurry "templates" or "prototypes" of each class. For example, the "car" class weight looks like a blurry car shape, the "horse" class looks like a blurry horse, and so on. This happens because the Softmax classifier learns a weight vector for each class that represents the "average" appearance of that class across all training examples. The classification is done by computing the dot product between an input image and each class's weight vector, which measures how similar the input is to the learned template. The weights appear blurry because they are essentially an average over all training images of that class, smoothing out the variations between individual instances.

权重外观：可视化后的 Softmax 分类器权重看起来像每个类别的模糊"模板"或"原型"（template/prototype）。具体来说：

"汽车"类别的权重看起来像一个模糊的汽车轮廓
"马"类别的权重看起来像一个模糊的马的形状
"飞机"类别的权重看起来有类似飞机翅膀的结构
每个类别的权重都能隐约看出该类物体的基本形态

为什么看起来是这样：

线性分类器的本质：Softmax 分类器为每个类别学习一个权重向量，这个向量代表了该类所有训练样本的平均外观。
分类原理：预测时，分类器计算输入图像与每个类别权重向量的点积（dot product），点积越大表示越相似。所以权重向量本质上就是一个"理想模板"。
模糊的原因：
- 权重是通过对所有训练样本取平均得到的
- 同一类别的不同训练图像在细节上各不相同（比如不同的马有不同的姿态、颜色、背景）
- 取平均后，共同的特征保留下来，差异的细节被平滑掉了
- 结果就是看起来模糊的、但保留该类主要视觉特征的模板

这是线性分类器的特性，与更复杂的神经网络（如 CNN）学到的特征不同——CNN 可以学到更精细、层次化的特征，而线性分类器只能学到这种简单的平均模板。

Inline Question 4 — True or False

Suppose the overall training loss is defined as the sum of the per-datapoint loss over all training examples. It is possible to add a new datapoint to a training set that would change the softmax loss, but leave the SVM loss unchanged.

\(\color{blue}{\textit Your Answer:}\) True

\(\color{blue}{\textit Your Explanation:}\) This is possible because SVM (hinge loss) can have zero loss for correctly classified examples with sufficient margin, while Softmax (cross-entropy) loss is always positive. When adding a new datapoint that is correctly classified with high confidence: (1) For SVM, if \(s_{y_i} > s_j + \Delta\) for all \(j \neq y_i\), the loss is 0, so total SVM loss remains unchanged. (2) For Softmax, even with correct classification, \(e^{s_{y_i}}/\sum_j e^{s_j} < 1\), so \(-\log(\text{probability}) > 0\), meaning Softmax loss always increases. Therefore, such a datapoint changes Softmax loss but leaves SVM loss unchanged.

问题： 假设总体训练损失定义为所有训练样本的每个数据点损失之和。是否可能添加一个新的数据点到训练集中，使得 Softmax 损失改变，但 SVM 损失保持不变？

这种情况是可能的。原因在于 SVM 和 Softmax 损失函数的本质区别：

SVM 损失函数为：

\[ L_i = \sum_{j \ne y_i} \max(0, s_j - s_{y_i} + \Delta) \]

关键点：SVM 使用 max-margin 机制。
当一个样本被 正确分类 且置信度足够高（即 \(s_{y_i} > s_j + \Delta\) 对所有 \(j \ne y_i\)）时，该样本的 损失为 0。
因此，添加一个被完美分类的新样本 不会改变 SVM 的总体损失。

Softmax 损失函数为：

\[ L_i = -\log\left(\frac{e^{s_{y_i}}}{\sum_j e^{s_j}}\right) \]

关键点：Softmax 损失 永远不会为 0（除非某个类的概率为 1，其他为 0，这在现实中几乎不可能）。
任何数据点都会贡献正的交叉熵损失，因为：
即使分类正确，\(e^{s_{y_i}} / \sum_j e^{s_j} < 1\)（总存在其他类的非零概率）。
所以 \(-\log(\text{概率}) > 0\)。

具体例子：

考虑一个二类分类问题，假设：

当前模型参数使得对新样本 \(x\)，正确类别的分数 \(s_1 = 10\)，错误类别的分数 \(s_2 = 1\)。
对于 SVM（设 \(\Delta = 1\)）：\(\max(0, 1 - 10 + 1) = \max(0, -8) = 0\)，损失为 0。
对于 Softmax：\(-\log(e^{10} / (e^{10} + e^1)) \approx -\log(0.9999) \approx 0.0001 > 0\)，损失为正。

因此，添加这个新样本后：

SVM 损失不变（仍为 0）
Softmax 损失增加（增加了约 0.0001）

结论： 只要添加的数据点被 SVM 正确分类且满足 margin 条件，SVM 损失不变；但 Softmax 一定会增加损失，因为它永远不会输出概率为 1 的预测。

Q3: Two-Layer Neural Network¶

Affine layer 的 forward 和 backward 实现¶

forward:

x_reshaped = x.reshape(x.shape[0], -1)
out = x_reshaped @ w + b

先将 X 展平，然后直接线性输出即可

backward:

    Inputs:
    - dout: Upstream derivative, of shape (N, M)
    - cache: Tuple of:
      - x: Input data, of shape (N, d_1, ... d_k)
      - w: Weights, of shape (D, M)
      - b: Biases, of shape (M,)

    Returns a tuple of:
    - dx: Gradient with respect to x, of shape (N, d1, ..., d_k)
    - dw: Gradient with respect to w, of shape (D, M)
    - db: Gradient with respect to b, of shape (M,)

要根据前一层中传递回来的 Upstream derivative：dout (\(\frac{\partial L}{\partial out}\)) 计算

dx \((\frac{ \partial L }{ \partial x })\), dw \(\left( \frac{ \partial L }{ \partial w } \right)\), db \((\frac{ \partial L }{ \partial b })\)

\[ \begin{align} \frac{ \partial L }{ \partial x_{i,j} } &= \sum_{k=1}^{M} \frac{ \partial L }{ \partial out_{i,k} } \cdot \frac{ \partial out_{i,k} }{ \partial x_{i,j} } \\\ &= \sum_{k=1}^{M} \frac{ \partial L }{ \partial out_{i,k} } \cdot (w_{j,k}) \end{align} \]

即为 w 第 j 个行向量点乘 dout 第 i 个行向量，也即 dout 第 i 个行向量点乘 w 的转置的第 j 个列向量，故有：

\[ \frac{ \partial L }{ \partial x } = \frac{ \partial L }{ \partial out } \cdot w^T \]

同理分析可得 dw, db

    x_reshape = x.reshape(x.shape[0], -1)
    dx = (dout @ w.T).reshape(x.shape)
    dw = x_reshape.T @ dout
    db = np.sum(dout, axis=0)

ReLU 的 forward 和 backward 实现¶

\[ ReLU(x) = max(0, x) \]

forward:

    out = np.maximum(0, x)

backward:

    dx = dout * (x > 0)

~~宝宝巴士~~

Inline Question 1:¶

We've only asked you to implement ReLU, but there are a number of different activation functions that one could use in neural networks, each with its pros and cons. In particular, an issue commonly seen with activation functions is getting zero (or close to zero) gradient flow during backpropagation. Which of the following activation functions have this problem? If you consider these functions in the one dimensional case, what types of input would lead to this behaviour? 1. Sigmoid 2. ReLU 3. Leaky ReLU

\(\color{blue}{\textit Your Answer:}\)

1. Sigmoid

Sigmoid 函数在输入值非常大（正或负）时会出现梯度消失问题。

Sigmoid 公式：\(\sigma(x) = \frac{1}{1 + e^{-x}}\)
导数：\(\sigma'(x) = \sigma(x)(1 - \sigma(x))\)
问题区域：当 \(x\) 很大（\(x \gg 0\)）时，\(\sigma(x) \approx 1\)，导数趋近于 0；当 \(x\) 很小（\(x \ll 0\)）时，\(\sigma(x) \approx 0\)，导数也趋近于 0
最大梯度仅为 0.25（在 x=0 处），多层叠加后梯度会指数级衰减

2. ReLU

ReLU 在输入为负数时会出现梯度消失（"神经元死亡"问题）。

ReLU 公式：\(f(x) = \max(0, x)\)
导数：\(f'(x) = 1\)（当 x > 0），\(f'(x) = 0\)（当 x ≤ 0）
问题区域：当 \(x \leq 0\) 时，梯度完全为 0，神经元"死亡"，无法更新权重

3. Leaky ReLU

Leaky ReLU 不存在 梯度完全消失的问题。

Leaky ReLU 公式：\(f(x) = \max(\alpha x, x)\)，通常 \(\alpha = 0.01\)
导数：\(f'(x) = 1\)（当 x > 0），\(f'(x) = \alpha\)（当 x < 0）
特点：即使在负数区域，也有一个小的正梯度 \(\alpha\)，保证信息能够流动

总结对比：

激活函数	梯度消失区域	严重程度
Sigmoid	两端饱和区（\|x\| 很大）	高（最大梯度仅0.25）
ReLU	负半轴（x ≤ 0）	中（完全死亡）
Leaky ReLU	无	低（最小梯度为α）

"Sandwich" layers¶

实践中 Affine layer 往往接着就是 ReLU 等激活函数，故将 Affine layer 和 ReLU 结合成一层会获得很多 convenience。

已经实现，看看就行。

Loss layers: Softmax¶

直接 copy 前一个问题中的实现即可

    p = np.exp(x)
    p /= p.sum(axis=-1, keepdims=True)
    logp = np.log(p)

    N = x.shape[0]
    loss = -np.sum(logp[np.arange(N), y]) / N

    dx = p.copy()
    dx[np.arange(N), y] -= 1
    dx /= N

Two-layer network¶

初始化：注意 w1 和 w2 的维度

        self.w1 = np.random.normal(loc=0.0, scale=weight_scale, size=(input_dim, hidden_dim))
        self.w2 = np.random.normal(loc=0.0, scale=weight_scale, size=(hidden_dim, num_classes))
        self.b1 = np.zeros(hidden_dim)
        self.b2 = np.zeros(num_classes)
        self.params["W1"] = self.w1
        self.params["b1"] = self.b1
        self.params["W2"] = self.w2
        self.params["b2"] = self.b2

计算 loss：

如果 y = None, 直接返回 scores 即可。直接套用前面的实现：

        aff1_out, aff1_cache = affine_forward(X, self.params['W1'], self.params['b1'])
        relu_out, relu_cache = relu_forward(aff1_out)
        scores, aff2_cache = affine_forward(relu_out, self.params['W2'], self.params['b2'])

注意用字典来访问参数，而不是直接 self.w1 这样

如果 y 不为 None, 需要返回 loss 和 grads, 注意添加 L2 正则化, 为了方便检查以及简化 grads, 在前面乘以一个 0.5, 剩下的也是直接套用之前的实现即可

        loss, dscores = softmax_loss(scores, y)

        loss += 0.5 * self.reg * (np.sum(self.params['W1'] ** 2) + np.sum(self.params['W2'] ** 2))
        dx2, dw2, db2 = affine_backward(dscores, aff2_cache)
        drelu = relu_backward(dx2, relu_cache)
        dx1, dw1, db1 = affine_backward(drelu, aff1_cache)

        dw2 += self.reg * self.params['W2']
        dw1 += self.reg * self.params['W1']

        grads = {
            'W1': dw1,
            'b1': db1,
            'W2': dw2,
            'b2': db2
        }

Solver¶

使用 model = TwoLayerNet 达到约 36% 的正确率。抄作业即可：

solver = Solver(model, data,
                update_rule="sgd",
                optim_config={
                    'learning_rate': 1e-4,
                },
                lr_decay=0.95,
                num_epochs=5, batch_size=200,
                print_every=100)
solver.train()

注意 X_train, y_train, X_dev, y_dev 都包含在 data 中（之前的代码块中已经导入）

Output：

(Iteration 1 / 1225) loss: 2.303922
(Epoch 0 / 5) train acc: 0.069000; val_acc: 0.063000
(Iteration 101 / 1225) loss: 2.280164
(Iteration 201 / 1225) loss: 2.176490
(Epoch 1 / 5) train acc: 0.211000; val_acc: 0.257000
(Iteration 301 / 1225) loss: 2.104870
(Iteration 401 / 1225) loss: 1.997602
(Epoch 2 / 5) train acc: 0.294000; val_acc: 0.293000
(Iteration 501 / 1225) loss: 1.929971
(Iteration 601 / 1225) loss: 2.006990
(Iteration 701 / 1225) loss: 1.811825
(Epoch 3 / 5) train acc: 0.345000; val_acc: 0.332000
(Iteration 801 / 1225) loss: 1.792892
(Iteration 901 / 1225) loss: 1.808718
(Epoch 4 / 5) train acc: 0.370000; val_acc: 0.363000
(Iteration 1001 / 1225) loss: 1.892665
(Iteration 1101 / 1225) loss: 1.767647
(Iteration 1201 / 1225) loss: 1.700919
(Epoch 5 / 5) train acc: 0.376000; val_acc: 0.372000

Debug and Training¶

With the default parameters we provided above, you should get a validation accuracy of about 0.36 on the validation set. This isn't very good.

One strategy for getting insight into what's wrong is to plot the loss function and the accuracies on the training and validation sets during optimization.

Another strategy is to visualize the weights that were learned in the first layer of the network. In most neural networks trained on visual data, the first layer weights typically show some visible structure when visualized.

Tune your hyperparameters¶

What's wrong?

Loss 在或多或少地线性下降，这似乎表明学习率可能太低了。
训练准确率和验证准确率之间没有差距，这表明我们使用的模型容量较低，应该增加它的尺寸。
另一方面，对于非常大的模型，我们预期会看到更多的过拟合，这会表现为训练准确率和验证准确率之间存在非常大的差距。

Tuning

Tuning the hyperparameters and developing intuition for how they affect the final performance is a large part of using Neural Networks.

you should experiment with different values of the various hyperparameters, including

hidden layer size
learning rate
number of training epochs
regularization strength

You might also consider tuning the learning rate decay, but you should be able to get good performance using the default value.

Approximate results

You should be aim to achieve a classification accuracy of greater than 48% on the validation set. Our best network gets over 52% on the validation set.

Experiment

Your goal in this exercise is to get as good of a result on CIFAR-10 as you can (52% could serve as a reference), with a fully-connected Neural Network. Feel free implement your own techniques (e.g. PCA to reduce dimensionality, or adding dropout, or adding features to the solver, etc.).

input_size = 32 * 32 * 3
num_classes = 10
hidden_size_range = [50, 80, 100]
lr_range = [1e-4, 2e-4, 3e-4]
num_epochs_range = [5]
reg_range = [0.0, 0.2, 0.9]
best_hidden_size = 100
best_lr = 1e-4
best_num_epochs = 5
best_reg = 0.2
best_val_acc = 0.0

for hidden_size in hidden_size_range:
    for lr in lr_range:
        for num_epochs in num_epochs_range:
            for reg in reg_range:
                model = TwoLayerNet(input_size, hidden_size, num_classes, reg=reg)

                solver = Solver(model, data,
                                update_rule="sgd",
                                optim_config={
                                    'learning_rate': lr,
                                },
                                lr_decay=0.95,
                                num_epochs=num_epochs, batch_size=200,
                                print_every=100)
                solver.train()
                val_acc = max(solver.val_acc_history)
                if val_acc > best_val_acc :
                    best_val_acc = val_acc
                    best_model = model
                    best_hidden_size = hidden_size
                    best_lr = lr
                    best_num_epochs = num_epochs
                    best_reg = reg

Validation set accuracy:  0.48
Test set accuracy:  0.481
best_hidden_size, best_lr, best_reg: 100, 0.0003, 0.0
best_two_layer_net.npy saved.

~~给我训的 mac 都发烫了~~

What's PCA?¶

PCA（Principal Component Analysis，主成分分析）是一种 降维技术，也是最常用的无监督学习方法之一。

PCA 的目标：找到数据中方差最大的方向，将高维数据投影到低维空间，同时尽可能保留信息。

想象你有一堆三维空间中的点，但从某个角度看，它们基本上分布在一个平面上。PCA 就是找到这个"最佳观察角度"，把三维数据降到二维。

原始数据 (2D)          PCA 找到主成分          降维后 (1D)

    *  *                    *  *                    *
   *    *                  /    \                  *
  *      *       →        *      *        →       *
   *    *                /        \                 *
    *  *                  *  *  *  *                  *
                         ↑
                    第一主成分
                  （方差最大方向）

import numpy as np

def pca(X, n_components=2):
    """
    X: (N, D) 数据矩阵，N 个样本，D 维特征
    n_components: 降维后的维度
    """
    # 1. 中心化
    X_mean = np.mean(X, axis=0)
    X_centered = X - X_mean

    # 2. SVD（比特征值分解更数值稳定）
    U, S, Vt = np.linalg.svd(X_centered, full_matrices=False)

    # 3. 主成分就是 Vt 的前 n_components 行
    W = Vt[:n_components].T  # (D, k)

    # 4. 投影
    X_transformed = X_centered @ W  # (N, k)

    return X_transformed, W, X_mean

# 使用
X_pca, components, mean = pca(X, n_components=2)

Inline Question 2:¶

Now that you have trained a Neural Network classifier, you may find that your testing accuracy is much lower than the training accuracy. In what ways can we decrease this gap? Select all that apply.

Train on a larger dataset.
Add more hidden units.
Increase the regularization strength.
None of the above.

\(\color{blue}{\textit Your Answer:}\) 1, 3

\(\color{blue}{\textit Your Explanation:}\) A larger dataset provides more diverse examples, helping the model generalize better and reducing overfitting (1). Increasing regularization strength penalizes large weights, forcing the model to learn simpler patterns that generalize better to test data (3). Adding more hidden units would increase model capacity and complexity, which typically makes overfitting worse rather than better, so this would increase the gap between training and testing accuracy.

选项 1（在更大的数据集上训练）：更大的数据集提供更多样化的样本，帮助模型更好地泛化，减少过拟合，从而缩小训练/测试准确率差距。
选项 3（增加正则化强度）：L2 正则化通过惩罚大的权重来约束模型复杂度，迫使模型学习更简单、更通用的模式，提高泛化能力。
为什么不选 2：添加更多隐藏单元会增加模型容量，使模型更容易过拟合训练数据，反而会扩大训练/测试准确率差距。