Deep Learning
create by Arwin Yu
Tutorial 01 – Neural Networks
Agenda
- Perceptron Model
- Multi-Layer Perceptron
- Forward Calculation
- Backpropagation
- Neural-Network-Based House Price Regression Model
- Weight Initialization
- Deep Double Descent
The Perceptron
-
One of the first and simplest linear models.
-
Based on a linear threshold unit (LTU): the inputs and output are numbers, and each connection is associated with a weight.
-
The LTU computes a weighted sum of its inputs: $z = w_1x_1 + w_2x_2 +….+w_nx_n = w^Tx$, then applies a step function to that sum and outputs the result: $$ h_w(x) = step(z) = step(w^Tx) $$
-
Illustration:
-
Pseudocode:
- Require: Learning rate $\eta$
- Require: Initial parameter $w$
- While stopping criterion not met do
- For $i=1,…,m$:
- $ w_{t+1} \leftarrow w_t +\eta(y_i -sign(w_t^Tx_i))x_i $
- $t \leftarrow t + 1$
- For $i=1,…,m$:
- end while
Multi-Layer Perceptron (MLP)
-
An MLP consists of an input layer, one or more hidden layers, and a final output layer.
-
When the number of hidden layers is greater than 2, the network is usually called a deep neural network (DNN); with fewer than 2 hidden layers, it is often called an MLP (a common convention, not a strict definition).
What is the purpose of a layered structure?
The answer is simple: this is how the algorithm analyzes data. To understand it, consider an everyday analogy. When we see an image, can we instantly obtain all the information it contains? Not really. We need some time to think and to analyze the information conveyed by the image from multiple angles. This is similar to the multiple layers in a neural network: the model relies on these layers to extract information from the raw data from different perspectives.
From a mathematical perspective, each layer of a deep learning model contains a different number of perceptrons. This is equivalent to increasing or decreasing the dimensionality of the raw data and extracting its features in different dimensional spaces. What does a different dimensional space mean? For example, suppose we use a simple linear classifier to try to perfectly classify cats and dogs. We can start with one feature, such as “round eyes”; the classification result is shown below.
Because both cats and dogs have round eyes, we cannot obtain perfect classification at this point. Therefore, we may decide to add another feature, such as “pointed ears”; the classification result is shown below.
At this point, we can see that the data distributions for cats and dogs are gradually becoming more separated. Finally, we add a third feature, such as “long nose”, to obtain a three-dimensional feature space, as shown below.
At this stage, the model can already fit a classification decision boundary that separates cats and dogs well. This naturally raises a question: if we continue increasing the number of features and map the raw data into a higher-dimensional space, will that make classification easier?
In fact, not necessarily. Note that as the dimensionality of the problem increases, the density of training samples decreases exponentially. Suppose 10 training instances cover the full one-dimensional feature space, whose width is 5 unit intervals. In the one-dimensional case, the sample density is therefore 10/5=2 (samples/interval).
In the two-dimensional case, there are still 10 training instances, but now they cover a two-dimensional feature space with an area of 5×5=25 unit squares. Therefore, the sample density in two dimensions is 10/25=0.4 (samples/interval).
Finally, in the three-dimensional case, the 10 samples cover a feature-space volume of 5×5×5=125 unit cubes. Therefore, the sample density in three dimensions is 10/125=0.08 (samples/interval).
If we keep adding features, the dimensionality of the feature space keeps growing and the space becomes increasingly sparse. Because of this sparsity, finding a separable hyperplane becomes very easy. However, when the high-dimensional classification result is mapped back into a lower-dimensional space, the serious problem with this approach becomes apparent. The classification result for cats and dogs in a high-dimensional feature space is shown below. Note that because high-dimensional feature spaces are difficult to draw on paper, the figure maps the high-dimensional classification result into two dimensions. In this case, the learned decision boundary can very easily and perfectly separate every individual sample.
Question: Isn’t it good to perfectly separate the training data?
Not really, because training data is sampled from the real world, and no training set can contain every possible situation in the world. For example, when building a cat-and-dog dataset, it is impossible to photograph every cat and dog in the world. Perfectly separating this training set can actually rigidify the model’s thinking and make its generalization ability in the real world very poor. In everyday terms, this is like getting stuck in a narrow line of reasoning. For example, suppose we painstakingly devise one hundred features to define cattle in China. This strict definition can easily distinguish cattle from other species. But one day, a British dairy cow travels across the ocean to China. Because this foreign cow satisfies only 90 of China’s cattle-defining features, we refuse to classify it as cattle. This is clearly unreasonable. The reason is that the feature space has too many dimensions. This phenomenon is called the “curse of dimensionality”: when the dimensionality of a problem becomes large, classifier performance deteriorates.
Forward calculation
-
During the forward pass, for each training instance, the algorithm feeds it into the network and computes the output of every neuron in each successive layer
-
Using the network for prediction is simply performing a forward pass.
An example is shown below:
Backpropagation
Backpropagation is an efficient method for computing gradients. It can quickly calculate the partial derivative for every neuron in the network. Backpropagation first computes the network output through a forward pass, then propagates the error backward from the output layer to the input layer, and finally computes the partial derivatives for each neuron based on that error. The core idea is to pass the error backward through the chain rule and compute each neuron’s contribution to the error.
An example is shown below:
Initialize the network and construct a neural network with only one layer
(1) Initialize the network parameters:
Assume the neural network inputs and output are initialized as: $x_1=0.5, x_2=1.0, y=0.8$.
The parameters are initialized as: $w_1=1.0, w_2=0.5, w_3=0.5, w_4=0.7, w_5=1.0, w_6=2.0$.
(2) Forward calculation, as shown below
Similarly, $h_2$ is calculated to be 0.95. Multiply and sum $h_1$ and $h_2$ to obtain the forward-propagation result, as shown below
$$
\begin{aligned}
y^{\prime} & =w_5 \cdot h_1^{(1)}+w_6 \cdot h_2^{(1)} \\
& =1.0 \cdot 1.0+2.0 \cdot 0.95 \\
& =2.9
\end{aligned}
$$
(3) Compute the loss: calculate the loss using the ground-truth value $y=0.8$ and the squared-error loss function
$$
\begin{aligned}
\delta & =\frac{1}{2}\left(y-y^{\prime}\right)^2 \\
& =0.5(0.8-2.9)^2 \\
& =2.205
\end{aligned}
$$
(4) Compute the gradient: this process is essentially the calculation of partial derivatives. Take the partial derivative with respect to parameter $w_5$ as an example, as shown below
According to the chain rule:
$$
\frac{\partial \delta}{\partial w_5}=\frac{\partial \delta}{\partial y^{\prime}} \cdot \frac{\partial y^{\prime}}{\partial w_5}
$$
where:
$$
\begin{aligned}
\frac{\partial \delta}{\partial y^{\prime}} & =2 \cdot \frac{1}{2} \cdot\left(y-y^{\prime}\right)(-1) \\
& =y^{\prime}-y \\
& =2.9-0.8 \\
& =2.1 \\
y^{\prime} & =w_5 \cdot h_1^{(1)}+w_6 \cdot h_2^{(1)} \\
\frac{\partial y^{\prime}}{\partial w_5} & =h_1^{(1)}+0 \\
& =1.0
\end{aligned}
$$
therefore:
$$
\frac{\partial \delta}{\partial w_5}=\frac{\partial \delta}{\partial y^{\prime}} \cdot \frac{\partial y^{\prime}}{\partial w_5}=2.1 \times 1.0=2.1
$$
Similarly, if we take parameter $w_1$ as an example, its partial derivative calculation also uses the chain rule, as shown below.
$$
\begin{gathered}
\frac{\partial \delta}{\partial w_1}=\frac{\partial \delta}{\partial y^{\prime}} \cdot \frac{\partial y^{\prime}}{\partial h_1^{(1)}} \cdot \frac{\partial h_1^{(1)}}{\partial w_1} \\
y^{\prime}=w_5 \cdot h_1^{(1)}+w_6 \cdot h_2^{(1)} \\
\frac{\partial y^{\prime}}{\partial h_1^{(1)}}=w_5+0 \\
=1.0 \\
h_1^{(1)}=w_1 \cdot x_1+w_2 \cdot x_2 \\
\frac{\partial h_1^{(1)}}{\partial w_1}=x_1+0 \\
\frac{\partial \delta}{\partial w_1}=\frac{\partial \delta}{\partial y^{\prime}} \cdot \frac{\partial y^{\prime}}{\partial h_1^{(1)}} \cdot \frac{\partial h_1^{(1)}}{\partial w_1}=2.1 \times 1.0 \times 0.5=1.05
\end{gathered}
$$
(5) Update the network parameters with gradient descent: assume the initial value of the hyperparameter “learning rate” is 0.1. According to the gradient descent update rule, the update for parameter $w_1$ is calculated as follows:
$$
w_1^{\text {(update) }}=w_1-\eta \cdot \frac{\partial \delta}{\partial w_1}=1.0-0.1 \times 1.05=0.895
$$
Similarly, the other updated parameters can be calculated as:
$$
w_1=0.895, w_2=0.895, w_3=0.29, w_4=0.28, w_5=0.79, w_6=1.8005
$$
At this point, we have completed the full parameter-iteration process. We can compute the loss to see whether it has decreased, as follows:
$$
\begin{aligned}
\delta & =\frac{1}{2}\left(y-y^{\prime}\right)^2 \\
& =0.5(0.8-1.3478)^2 \\
& =0.15
\end{aligned}
$$
Compared with the previously computed forward-propagation loss of 2.205, this result is clearly smaller.
Common Layers
- Linear layer (a linear combination of the inputs).
- Activation layer (usually used together with a linear layer; applies a function to the weighted linear combination of inputs): ReLU, Binary Step, Sigmoid, TanH, etc…
- Softmax layer (a sigmoid-like layer for more than 2 classes; outputs the probability of each class) for classification tasks.
- Loss-function layer (for example, MSE and cross-entropy).
Example – Regression Neural Network – House Prices
- House price dataset:
- Two input features: Size and Floor
- One output: house price
- Loss function: MSE
- Network architecture: 2 hidden layers and one output layer
Layout:

$$ F(X,W) = W_3^T \phi_2(W_2^T\phi_1(W_1^TX + b_1) + b_2) + b_3 $$
Where: $$ X \in \mathbb{R}^2 $$ $$ W_1 \in \mathbb{R}^{2 \times 4} $$ $$ W_2 \in \mathbb{R}^{4 \times 3} $$ $$ W_3 \in \mathbb{R}^{3 \times 1} $$ $$ b_1 \in \mathbb{R}^4 $$ $$ b_2 \in \mathbb{R}^3 $$ $$ b_3 \in \mathbb{R} $$
Step-by-Step Solution
-
The MSE loss function and corresponding training objective for all training examples $x_i$: $$ Error = \frac{1}{N} \sum_{i=1}^N (F(x_i, W) – y_i)^2 = \frac{1}{N} ||F(X, W) – Y||_2^2 $$
-
Linear layer: $$ u_{out} = W^Tu_{in} + b $$
-
Activation layer:
-
$\phi_1$ and $\phi_2$ are multivariate vector nonlinear functions, so: $$ \phi(U) = \phi\left(\begin{bmatrix} u_1 \\ \vdots \\ u_n \end{bmatrix}\right) = \begin{bmatrix} \phi(u_1) \\ \vdots \\ \phi(u_n) \end{bmatrix} $$
-
For ReLU: $$ \begin{bmatrix} \phi(u_1) \\ \vdots \\ \phi(u_n) \end{bmatrix} = \begin{bmatrix} \max(0, u_1) \\ \vdots \\ \max(0, u_n) \end{bmatrix} $$
Forward Pass
$$ F(X,W) = W_3^T \phi_2(W_2^T\phi_1(W_1^TX + b_1) + b_2) + b_3 $$
Backward Pass
The following illustration depicts the backpropagation process:
Building a Neural Network with PyTorch
Now we will use PyTorch to implement a neural network for regression. We will use the “Boston housing” dataset and the architecture described above.
import torch
import torch.nn as nn
# define our neural network model
# this approach provides easier access to weights (e.g., 'model.fc1' will return the parameters of the first layer)
class HousePricesMLP(nn.Module):
# notice that we inherit from nn.Module
def __init__(self, input_dim, output_dim):
super(HousePricesMLP, self).__init__()
# here we initialize the building blocks of our network
# single neuron is just one linear (fully-connected) layer
self.fc_1 = nn.Linear(input_dim, 4)
self.fc_2 = nn.Linear(4, 3)
self.output_layer = nn.Linear(3, output_dim)
def forward(self, x):
# here we define what happens to the input x in the forward pass
# that is, the order in which x goes through the building blocks
x = torch.relu(self.fc_1(x))
x = torch.relu(self.fc_2(x))
return self.output_layer(x)
# alternative method - more readdable, easier to code, less convenient access to weights
# e.g., to access the first layer weights -- `model.hidden[0]`
class HousePricesMLP(nn.Module):
# notice that we inherit from nn.Module
def __init__(self, input_dim, output_dim):
super(HousePricesMLP, self).__init__()
# here we initialize the building blocks of our network
# single neuron is just one linear (fully-connected) layer
self.hidden = nn.Sequential(nn.Linear(input_dim, 4),
nn.ReLU(),
nn.Linear(4, 3),
nn.ReLU())
self.output_layer = nn.Linear(3, output_dim)
def forward(self, x):
# here we define what happens to the input x in the forward pass
# that is, the order in which x goes through the building blocks
return self.output_layer(self.hidden(x))
# NOTE: in this example we are using a very simple NN model
# We usually wider and deeper networks such as this one:
class HousePricesMLP(nn.Module):
# notice that we inherit from nn.Module
def __init__(self, input_dim, output_dim, hidden_dim=256):
super(HousePricesMLP, self).__init__()
# here we initialize the building blocks of our network
# single neuron is just one linear (fully-connected) layer
self.hidden = nn.Sequential(nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),)
self.output_layer = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
# here we define what happens to the input x in the forward pass
# that is, the order in which x goes through the building blocks
return self.output_layer(self.hidden(x))
from sklearn.datasets import fetch_california_housing
import pandas as pd
# Load data
california_housing = fetch_california_housing()
# Convert to DataFrame
data = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
data['target'] = california_housing.target
# Print description of the features
print(california_housing.DESCR)
.. _california_housing_dataset:
California Housing dataset
--------------------------
**Data Set Characteristics:**
:Number of Instances: 20640
:Number of Attributes: 8 numeric, predictive attributes and the target
:Attribute Information:
- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per household
- AveBedrms average number of bedrooms per household
- Population block group population
- AveOccup average number of household members
- Latitude block group latitude
- Longitude block group longitude
:Missing Attribute Values: None
This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).
This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).
A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.
It can be downloaded/loaded using the
:func:sklearn.datasets.fetch_california_housing function.
.. topic:: References
- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297
# Convert to DataFrame
boston = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
boston['MEDV'] = california_housing.target
# Sample 10 rows
sampled_data = boston.sample(10)
print(sampled_data)
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \
11143 3.1884 25.0 5.188630 1.073643 2166.0 2.798450 33.84
9961 5.8150 34.0 7.670412 1.183521 780.0 2.921348 38.33
12213 6.9930 13.0 6.428571 1.000000 120.0 2.857143 33.51
4354 8.9440 30.0 7.170455 1.087500 1776.0 2.018182 34.10
4629 2.2708 18.0 2.571135 1.108755 3296.0 2.254446 34.07
11026 5.8622 30.0 6.456164 1.038356 2271.0 3.110959 33.80
20185 5.9181 24.0 5.700000 1.034375 1049.0 3.278125 34.27
17427 2.3333 32.0 5.816976 1.140584 1074.0 2.848806 34.65
4080 3.1373 23.0 3.752241 1.074980 2391.0 1.948655 34.15
13890 2.2612 12.0 5.235714 1.024405 11139.0 6.630357 34.45
Longitude MEDV
11143 -117.94 1.35400
9961 -122.26 3.39200
12213 -117.18 5.00001
4354 -118.39 5.00001
4629 -118.30 1.75000
11026 -117.83 2.21000
20185 -119.16 2.21100
17427 -120.47 1.30200
4080 -118.37 2.63100
13890 -116.14 1.37500
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Use 2 features
x = boston[['AveRooms', 'AveOccup']].values # AveRooms - average number of rooms, AveOccup - average number of household members
y = boston['MEDV'].values
# Split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=5)
# Scaling
x_scaler = StandardScaler()
x_scaler.fit(x_train)
x_train = x_scaler.transform(x_train)
x_test = x_scaler.transform(x_test)
print("total training samples: {}, total test samples: {}".format(len(x_train), len(x_test)))
total training samples: 16512, total test samples: 4128
import torch
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn
# Convert to tensor dataset for PyTorch
boston_tensor_train_ds = TensorDataset(torch.tensor(x_train, dtype=torch.float), torch.tensor(y_train, dtype=torch.float))
boston_tensor_test_ds = TensorDataset(torch.tensor(x_test, dtype=torch.float), torch.tensor(y_test, dtype=torch.float))
# Check
print(f'sample 0: features: {boston_tensor_train_ds[0][0]}, target: {boston_tensor_train_ds[0][1]}')
# Define hyper-parameters and create our model
num_features = 2
output_dim = 1
batch_size = 128
learning_rate = 0.01
num_epochs = 200
# Device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# Loss criterion
criterion = nn.MSELoss()
# Model
model = HousePricesMLP(num_features, output_dim).to(device)
# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# DataLoader
train_loader = DataLoader(boston_tensor_train_ds, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(boston_tensor_test_ds, batch_size=batch_size, shuffle=False)
sample 0: features: tensor([-0.7866, -0.0164]), target: 2.8459999561309814
# Training loop
for epoch in range(num_epochs):
model.train()
for inputs, targets in train_loader:
inputs, targets = inputs.to(device), targets.to(device)
# Zero the parameter gradients
optimizer.zero_grad()
# Forward pass
outputs = model(inputs).view(-1)
loss = criterion(outputs, targets)
# Backward pass and optimization
loss.backward()
optimizer.step()
# Print loss for every 10 epochs
if (epoch+1) % 10 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
# Evaluation
model.eval()
with torch.no_grad():
test_loss = 0
for inputs, targets in test_loader:
inputs, targets = inputs.to(device), targets.to(device)
outputs = model(inputs).view(-1)
loss = criterion(outputs, targets)
test_loss += loss.item()
test_loss /= len(test_loader)
print(f'Test Loss: {test_loss:.4f}')
Epoch [10/200], Loss: 1.0621
Epoch [20/200], Loss: 1.0323
Epoch [30/200], Loss: 0.8559
Epoch [40/200], Loss: 1.3087
Epoch [50/200], Loss: 1.1804
Epoch [60/200], Loss: 1.0741
Epoch [70/200], Loss: 1.0675
Epoch [80/200], Loss: 0.9341
Epoch [90/200], Loss: 0.6055
Epoch [100/200], Loss: 1.0619
Epoch [110/200], Loss: 1.0063
Epoch [120/200], Loss: 0.9453
Epoch [130/200], Loss: 0.9202
Epoch [140/200], Loss: 0.9076
Epoch [150/200], Loss: 0.8739
Epoch [160/200], Loss: 1.0152
Epoch [170/200], Loss: 0.7826
Epoch [180/200], Loss: 0.8854
Epoch [190/200], Loss: 1.0572
Epoch [200/200], Loss: 0.9545
Test Loss: 1.0264
Weight Initialization
- As we have learned, neural networks are trained using stochastic optimization algorithms such as gradient descent, RMSprop, Adam, and so on…
- Recall that these algorithms require the parameters to be initialized to certain values. In other words, they use randomness to find a sufficiently good set of weights for the particular mapping function from inputs to outputs being learned from the data.
- These algorithms require the network weights to be initialized to small random values (random, but close to zero).
- Randomness is also used in the search process by shuffling the training dataset before each epoch, which in turn leads to different gradient estimates for each batch.
- Training deep models is a fairly difficult task, and most algorithms are strongly affected by the choice of initialization (p. 301, Deep Learning, 2016).
Why Not Simply Initialize with Zeros?
Recent Trend: Non-Random Initializations
-
Beyond Signal Propagation: Is Feature Diversity Necessary in Deep Neural Network Initialization?, in this paper, the authors construct a deep network with identical features by initializing almost all weights to 0. The architecture not only achieves perfect signal propagation and stable gradients, but also obtains high accuracy on standard benchmarks, suggesting that randomly diverse initialization is not necessary for training neural networks.
-
ZerO Initialization: Initializing Neural Networks with only Zeros and Ones, in this paper, random weight initialization is replaced by a fully deterministic initialization scheme that uses zeros and ones (after normalization) to initialize network weights, based on identity and Hadamard transforms. The authors show encouraging results on various benchmarks, paving the way for simple initialization schemes that work as well as random initialization.
These studies show that neural networks do not necessarily need randomly initialized weights to train well.
Types of Weight Initialization
-
Initializing neural-network weights is an active research area, because careful initialization can speed up the learning process.
-
There is no single best method for initializing neural-network weights.
-
We will review several popular initialization methods.
-
Unifrom – Initialize using values drawn from the uniform distribution $\mathcal{U}(a, b)$
-
In PyTorch –
torch.nn.init.uniform_(tensor, a=0.0, b=1.0) -
Normal – Initialize using values drawn from the normal distribution $\mathcal{N}(\text{mean}, \text{std}^2)$
-
In PyTorch –
torch.nn.init.normal_(tensor, mean=0.0, std=1.0) -
Constant – Initialize using the value $val$.
-
In PyTorch –
torch.nn.init.constant_(tensor, val) -
Ones – Initialize with the scalar value 1.
-
In PyTorch –
torch.nn.init.ones_(tensor) -
Zeros – Initialize with the scalar value 0.
-
In PyTorch –
torch.nn.init.zeros_(tensor) -
Xavier Unifrom – according to the method described in Understanding the difficulty of training deep feedforward neural networks – Glorot, X. & Bengio, Y. (2010), initialize values using a uniform distribution. The resulting tensor will contain values sampled from $\mathcal{U}(-a, a)$ where$$ a = \text{gain} \times \sqrt{\frac{6}{\text{fan}_{in} + \text{fan}_{out}}} $$
-
fan_inis the number of input units in the weight tensor,fan_outis the number of output units in the weight tensor, and the main role ofgainis to adjust the scale of the initialized weights so that signals do not suffer from vanishing or exploding gradients as they propagate through the network. -
In PyTorch –
torch.nn.init.xavier_uniform_(tensor, gain=1.0) -
Xavier Normal – according to the method described in Understanding the difficulty of training deep feedforward neural networks – Glorot, X. & Bengio, Y. (2010), initialize values using a normal distribution. The resulting tensor will contain values sampled from $\mathcal{N}(0,\text{std}^2)$ where $$ \text{std} = \text{gain} \times \sqrt{\frac{2}{\text{fan}_{in} + \text{fan}_{out}}} $$
-
fan_inis the number of input units in the weight tensor, andfan_outis the number of output units in the weight tensor -
In PyTorch –
torch.nn.init.xavier_normal_(tensor, gain=1.0) -
Kaiming (He) Uniform – according to the method described in Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification – He, K. et al. (2015), initialize values using a uniform distribution. The resulting tensor will contain values sampled from $\mathcal{U}(-\text{bound}, \text{bound})$ where $$ \text{bound} = \text{gain} \times \sqrt{\frac{3}{\text{fan-mode}}} $$
-
In PyTorch –
torch.nn.init.kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu') -
a– the negative slope used by leaky_relu (only used withleaky_relu) -
gain: scaling factor, usually $sqrt{2}$ for ReLU and Leaky ReLU. -
fan_mode: can befan_inorfan_out. -
fan_in: represents the number of input units in the weight tensor (the number of neurons in the previous layer). This means that during initialization, the number of inputs to each neuron in the forward pass is considered to ensure that signals do not become too large or too small during propagation. -
fan_out: represents the number of output units in the weight tensor (the number of neurons in the next layer). This means that during initialization, the number of outputs of each neuron in backpropagation is considered to ensure that gradients do not become too large or too small during propagation.
During initialization, both fan_in and fan_out are computed from the shape (dimensions) of the weight tensor, which is already determined when the network architecture is defined. For example, for a fully connected layer, the shape of its weight tensor is $[ \text{fan_out}, \text{fan_in} ]$.
- Kaiming (He) Normal – according to the method described in Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification – He, K. et al. (2015), initialize values using a normal distribution. The resulting tensor will contain values sampled from $\mathcal{N}(0,\text{std}^2)$, where $$ \text{std} = \frac{\text{gain}}{\sqrt{\text{fan-mode}}} $$
- In PyTorch –
torch.nn.init.kaiming_normal_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu')
PyTorch has default initialization schemes that usually work well. For example, kaiming_uniform is the default initialization for Linear layers in PyTorch:
Interactive Demo
Different Initializations Demo
Initializing Neural Network Weights with PyTorch
- Since PyTorch 1.0, most layers are initialized by default using the Kaiming Uniform method.
- Let’s see how to change the model initialization.
- Official PyTorch initialization documentation。
# define hyper-parmeters and create our model
num_features = 2
output_dim = 1
batch_size = 128
learning_rate = 0.01
num_epochs = 500
# device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# loss criterion
criterion = nn.MSELoss()
# model
model = HousePricesMLP(num_features, output_dim).to(device)
# optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# use a different initialization for the model
def weights_init(m):
classname = m.__class__.__name__
if classname.find('Linear') != -1:
torch.nn.init.xavier_normal_(m.weight, gain=1.0)
model.apply(weights_init)
HousePricesMLP(
(hidden): Sequential(
(0): Linear(in_features=2, out_features=256, bias=True)
(1): ReLU()
(2): Linear(in_features=256, out_features=256, bias=True)
(3): ReLU()
(4): Linear(in_features=256, out_features=256, bias=True)
(5): ReLU()
)
(output_layer): Linear(in_features=256, out_features=1, bias=True)
)
# another way to do that
class HousePricesMLP(nn.Module):
def __init__(self, input_dim, output_dim):
super(HousePricesMLP, self).__init__()
self.hidden = nn.Sequential(nn.Linear(input_dim, 4),
nn.ReLU(),
nn.Linear(4, 3),
nn.ReLU())
self.output_layer = nn.Linear(3, output_dim)
# NEW: init weights here
self.init_weights()
def forward(self, x):
return self.output_layer(self.hidden(x))
def init_weights(self):
for m in self.modules():
if isinstance(m, nn.Linear):
torch.nn.init.xavier_normal_(m.weight, gain=1.0)
if m.bias is not None:
torch.nn.init.constant_(m.bias, 0)
import numpy as np
boston_tensor_train_dataloader = DataLoader(boston_tensor_train_ds, batch_size=batch_size, shuffle=True)
# training loop for the model
for epoch in range(num_epochs):
epoch_losses = []
for features, targets in boston_tensor_train_dataloader:
# send data to device
features = features.to(device)
targets = targets.to(device)
# forward pass
output = model(features)
# loss
loss = criterion(output.view(-1), targets)
# backward pass
optimizer.zero_grad() # clean the gradients from previous iteration
loss.backward() # autograd backward to calculate gradients
optimizer.step() # apply update to the weights
epoch_losses.append(loss.item())
if epoch % 50 == 0:
print(f'epoch: {epoch} loss: {np.mean(epoch_losses)}')
# test error
model.eval()
with torch.no_grad():
test_outputs = model(torch.tensor(x_test, dtype=torch.float, device=device))
test_error = criterion(test_outputs.view(-1), torch.tensor(y_test, dtype=torch.float, device=device))
print(f'test MSE error: {test_error.item()}')
epoch: 0 loss: 1.3302481183710024
epoch: 50 loss: 0.9436245888702629
epoch: 100 loss: 0.9447847960531249
epoch: 150 loss: 0.9419918522354245
epoch: 200 loss: 0.9397756309472314
epoch: 250 loss: 0.9362258513768514
epoch: 300 loss: 0.9383858072665311
epoch: 350 loss: 0.9368446367655614
epoch: 400 loss: 0.936691609926002
epoch: 450 loss: 0.9354286706724833
test MSE error: 0.9552490711212158
Deep Double Descent
- Double descent in machine-learning training: as model size, data size, or training time increases, performance first improves, then gets worse, and then improves again.
- This effect can often be avoided through careful regularization or early stopping.
- Although this behavior appears to be fairly common, we still do not fully understand why it happens.
Deep double descent challenges the traditional view of the bias-variance tradeoff, which holds that increasing model complexity usually leads to overfitting and higher test error.
In modern deep learning models, especially those with large-scale datasets and architectures, this phenomenon highlights the nontrivial behavior of test error:
- Initial descent: as model complexity increases, the model fits the data better, reducing bias.
- Intermediate ascent: further increases in complexity lead to overfitting; the model begins to capture noise in the training data, increasing variance and test error.
- Second descent: after a certain complexity threshold is exceeded, the model becomes very powerful and can generalize better by effectively using large datasets and regularization techniques, thereby reducing test error again.
This insight has practical implications for model training and architecture design. It suggests that in some cases, increasing model complexity and data can ultimately lead to better generalization performance, contrary to traditional expectations. Regularization techniques and careful monitoring during training are essential for guiding this behavior effectively.
Viewing Double Descent from the Model Perspective
-
There are cases where bigger models are worse.
-
Model-wise double descent can lead to worse results when training with more data.
-
Classical Regime:
This shows the traditional bias-variance tradeoff theory.
In this regime, as model complexity increases, error first decreases and then rises again due to overfitting.
- Critical Regime:
In this regime, model error exhibits relatively sharp fluctuations.
The peak in test error appears near the interpolation threshold, where the model is just large enough to fit the training set.
- Modern Regime:
In this region, as model size increases further, error decreases again and the model improves.
- Reality (blue curve):
This shows the behavior observed in practice: test error first decreases, then rises in the critical regime, and finally decreases again in the modern regime.
- Train error (green curve):
As model complexity increases, training error continues to decrease, indicating that the model’s ability to fit the training set keeps improving.
Summary:
Critical regime: the double-descent phenomenon mainly appears in the critical regime, where model error fluctuates significantly.
Modern regime: with sufficiently large models, error decreases again, showing that more complex models can perform better when handling large-scale data.
Relationship between data and model complexity: double descent reminds us that when designing and training models, we cannot simply rely on increasing data and model complexity to improve performance. More factors and strategies must be considered, such as regularization and appropriate early stopping.
Sample-wise Non-monotonicity
- Sample-wise Non-monotonicity(Sample-wise Non-monotonicity):
This phenomenon means that increasing the number of training samples can sometimes harm model performance, contrary to the usual expectation that more data improves model accuracy.
Training Epochs and Model Size
Credits
- Icons made by Becris from www.flaticon.com
- Icons from Icons8.com – https://icons8.com
- Datasets from Kaggle – https://www.kaggle.com/
- Jason Brownlee – Why Initialize a Neural Network with Random Weights?
- OpenAI – Deep Double Descent
- Tal Daniel