Neural Networks (NN)

20240729221434927 Deep Learning


create by Arwin Yu

Tutorial 01 – Neural Networks

Checklist Agenda


  • Perceptron Model
  • Multi-Layer Perceptron
  • Forward Calculation
  • Backpropagation
  • Neural-Network-Based House Price Regression Model
  • Weight Initialization
  • Deep Double Descent

Mind Map The Perceptron


  • One of the first and simplest linear models.

  • Based on a linear threshold unit (LTU): the inputs and output are numbers, and each connection is associated with a weight.

  • The LTU computes a weighted sum of its inputs: $z = w_1x_1 + w_2x_2 +….+w_nx_n = w^Tx$, then applies a step function to that sum and outputs the result: $$ h_w(x) = step(z) = step(w^Tx) $$

  • Illustration:

    20240729221535924

  • Pseudocode:

    • Require: Learning rate $\eta$
    • Require: Initial parameter $w$
    • While stopping criterion not met do
      • For $i=1,…,m$:
        • $ w_{t+1} \leftarrow w_t +\eta(y_i -sign(w_t^Tx_i))x_i $
      • $t \leftarrow t + 1$
    • end while

Layers Multi-Layer Perceptron (MLP)


  • An MLP consists of an input layer, one or more hidden layers, and a final output layer.

  • When the number of hidden layers is greater than 2, the network is usually called a deep neural network (DNN); with fewer than 2 hidden layers, it is often called an MLP (a common convention, not a strict definition).

20240729221633250

What is the purpose of a layered structure?

The answer is simple: this is how the algorithm analyzes data. To understand it, consider an everyday analogy. When we see an image, can we instantly obtain all the information it contains? Not really. We need some time to think and to analyze the information conveyed by the image from multiple angles. This is similar to the multiple layers in a neural network: the model relies on these layers to extract information from the raw data from different perspectives.

From a mathematical perspective, each layer of a deep learning model contains a different number of perceptrons. This is equivalent to increasing or decreasing the dimensionality of the raw data and extracting its features in different dimensional spaces. What does a different dimensional space mean? For example, suppose we use a simple linear classifier to try to perfectly classify cats and dogs. We can start with one feature, such as “round eyes”; the classification result is shown below.

20240729221722727

Because both cats and dogs have round eyes, we cannot obtain perfect classification at this point. Therefore, we may decide to add another feature, such as “pointed ears”; the classification result is shown below.

20240729221812199

At this point, we can see that the data distributions for cats and dogs are gradually becoming more separated. Finally, we add a third feature, such as “long nose”, to obtain a three-dimensional feature space, as shown below.

20240729221851138

At this stage, the model can already fit a classification decision boundary that separates cats and dogs well. This naturally raises a question: if we continue increasing the number of features and map the raw data into a higher-dimensional space, will that make classification easier?

In fact, not necessarily. Note that as the dimensionality of the problem increases, the density of training samples decreases exponentially. Suppose 10 training instances cover the full one-dimensional feature space, whose width is 5 unit intervals. In the one-dimensional case, the sample density is therefore 10/5=2 (samples/interval).

In the two-dimensional case, there are still 10 training instances, but now they cover a two-dimensional feature space with an area of 5×5=25 unit squares. Therefore, the sample density in two dimensions is 10/25=0.4 (samples/interval).

Finally, in the three-dimensional case, the 10 samples cover a feature-space volume of 5×5×5=125 unit cubes. Therefore, the sample density in three dimensions is 10/125=0.08 (samples/interval).

If we keep adding features, the dimensionality of the feature space keeps growing and the space becomes increasingly sparse. Because of this sparsity, finding a separable hyperplane becomes very easy. However, when the high-dimensional classification result is mapped back into a lower-dimensional space, the serious problem with this approach becomes apparent. The classification result for cats and dogs in a high-dimensional feature space is shown below. Note that because high-dimensional feature spaces are difficult to draw on paper, the figure maps the high-dimensional classification result into two dimensions. In this case, the learned decision boundary can very easily and perfectly separate every individual sample.

20240729221930485

Question: Isn’t it good to perfectly separate the training data?

Not really, because training data is sampled from the real world, and no training set can contain every possible situation in the world. For example, when building a cat-and-dog dataset, it is impossible to photograph every cat and dog in the world. Perfectly separating this training set can actually rigidify the model’s thinking and make its generalization ability in the real world very poor. In everyday terms, this is like getting stuck in a narrow line of reasoning. For example, suppose we painstakingly devise one hundred features to define cattle in China. This strict definition can easily distinguish cattle from other species. But one day, a British dairy cow travels across the ocean to China. Because this foreign cow satisfies only 90 of China’s cattle-defining features, we refuse to classify it as cattle. This is clearly unreasonable. The reason is that the feature space has too many dimensions. This phenomenon is called the “curse of dimensionality”: when the dimensionality of a problem becomes large, classifier performance deteriorates.

Lego Head Forward calculation


  • During the forward pass, for each training instance, the algorithm feeds it into the network and computes the output of every neuron in each successive layer

  • Using the network for prediction is simply performing a forward pass.

An example is shown below:

20240729222116112

Image Source

Serial Tasks Backpropagation


Backpropagation is an efficient method for computing gradients. It can quickly calculate the partial derivative for every neuron in the network. Backpropagation first computes the network output through a forward pass, then propagates the error backward from the output layer to the input layer, and finally computes the partial derivatives for each neuron based on that error. The core idea is to pass the error backward through the chain rule and compute each neuron’s contribution to the error.

An example is shown below:

Initialize the network and construct a neural network with only one layer

20240729222340577

(1) Initialize the network parameters:

Assume the neural network inputs and output are initialized as: $x_1=0.5, x_2=1.0, y=0.8$.

The parameters are initialized as: $w_1=1.0, w_2=0.5, w_3=0.5, w_4=0.7, w_5=1.0, w_6=2.0$.

(2) Forward calculation, as shown below

20240729222426797

Similarly, $h_2$ is calculated to be 0.95. Multiply and sum $h_1$ and $h_2$ to obtain the forward-propagation result, as shown below

20240729222505497

$$
\begin{aligned}
y^{\prime} & =w_5 \cdot h_1^{(1)}+w_6 \cdot h_2^{(1)} \\
& =1.0 \cdot 1.0+2.0 \cdot 0.95 \\
& =2.9
\end{aligned}
$$

(3) Compute the loss: calculate the loss using the ground-truth value $y=0.8$ and the squared-error loss function

$$
\begin{aligned}
\delta & =\frac{1}{2}\left(y-y^{\prime}\right)^2 \\
& =0.5(0.8-2.9)^2 \\
& =2.205
\end{aligned}
$$

(4) Compute the gradient: this process is essentially the calculation of partial derivatives. Take the partial derivative with respect to parameter $w_5$ as an example, as shown below

20240729222554509

According to the chain rule:
$$
\frac{\partial \delta}{\partial w_5}=\frac{\partial \delta}{\partial y^{\prime}} \cdot \frac{\partial y^{\prime}}{\partial w_5}
$$

where:
$$
\begin{aligned}
\frac{\partial \delta}{\partial y^{\prime}} & =2 \cdot \frac{1}{2} \cdot\left(y-y^{\prime}\right)(-1) \\
& =y^{\prime}-y \\
& =2.9-0.8 \\
& =2.1 \\
y^{\prime} & =w_5 \cdot h_1^{(1)}+w_6 \cdot h_2^{(1)} \\
\frac{\partial y^{\prime}}{\partial w_5} & =h_1^{(1)}+0 \\
& =1.0
\end{aligned}
$$

therefore:
$$
\frac{\partial \delta}{\partial w_5}=\frac{\partial \delta}{\partial y^{\prime}} \cdot \frac{\partial y^{\prime}}{\partial w_5}=2.1 \times 1.0=2.1
$$

Similarly, if we take parameter $w_1$ as an example, its partial derivative calculation also uses the chain rule, as shown below.

$$
\begin{gathered}
\frac{\partial \delta}{\partial w_1}=\frac{\partial \delta}{\partial y^{\prime}} \cdot \frac{\partial y^{\prime}}{\partial h_1^{(1)}} \cdot \frac{\partial h_1^{(1)}}{\partial w_1} \\
y^{\prime}=w_5 \cdot h_1^{(1)}+w_6 \cdot h_2^{(1)} \\
\frac{\partial y^{\prime}}{\partial h_1^{(1)}}=w_5+0 \\
=1.0 \\
h_1^{(1)}=w_1 \cdot x_1+w_2 \cdot x_2 \\
\frac{\partial h_1^{(1)}}{\partial w_1}=x_1+0 \\
\frac{\partial \delta}{\partial w_1}=\frac{\partial \delta}{\partial y^{\prime}} \cdot \frac{\partial y^{\prime}}{\partial h_1^{(1)}} \cdot \frac{\partial h_1^{(1)}}{\partial w_1}=2.1 \times 1.0 \times 0.5=1.05
\end{gathered}
$$

(5) Update the network parameters with gradient descent: assume the initial value of the hyperparameter “learning rate” is 0.1. According to the gradient descent update rule, the update for parameter $w_1$ is calculated as follows:
$$
w_1^{\text {(update) }}=w_1-\eta \cdot \frac{\partial \delta}{\partial w_1}=1.0-0.1 \times 1.05=0.895
$$

Similarly, the other updated parameters can be calculated as:
$$
w_1=0.895, w_2=0.895, w_3=0.29, w_4=0.28, w_5=0.79, w_6=1.8005
$$

At this point, we have completed the full parameter-iteration process. We can compute the loss to see whether it has decreased, as follows:
$$
\begin{aligned}
\delta & =\frac{1}{2}\left(y-y^{\prime}\right)^2 \\
& =0.5(0.8-1.3478)^2 \\
& =0.15
\end{aligned}
$$

Compared with the previously computed forward-propagation loss of 2.205, this result is clearly smaller.

Popular Topic Common Layers


  • Linear layer (a linear combination of the inputs).
  • Activation layer (usually used together with a linear layer; applies a function to the weighted linear combination of inputs): ReLU, Binary Step, Sigmoid, TanH, etc…
  • Softmax layer (a sigmoid-like layer for more than 2 classes; outputs the probability of each class) for classification tasks.
  • Loss-function layer (for example, MSE and cross-entropy).

Home Example – Regression Neural Network – House Prices


  • House price dataset:
  • Two input features: Size and Floor
  • One output: house price
  • Loss function: MSE
  • Network architecture: 2 hidden layers and one output layer

Layout:

20240729222811653

$$ F(X,W) = W_3^T \phi_2(W_2^T\phi_1(W_1^TX + b_1) + b_2) + b_3 $$

Where: $$ X \in \mathbb{R}^2 $$ $$ W_1 \in \mathbb{R}^{2 \times 4} $$ $$ W_2 \in \mathbb{R}^{4 \times 3} $$ $$ W_3 \in \mathbb{R}^{3 \times 1} $$ $$ b_1 \in \mathbb{R}^4 $$ $$ b_2 \in \mathbb{R}^3 $$ $$ b_3 \in \mathbb{R} $$

Baby Footprints Path Step-by-Step Solution


  • The MSE loss function and corresponding training objective for all training examples $x_i$: $$ Error = \frac{1}{N} \sum_{i=1}^N (F(x_i, W) – y_i)^2 = \frac{1}{N} ||F(X, W) – Y||_2^2 $$

  • Linear layer: $$ u_{out} = W^Tu_{in} + b $$

  • Activation layer:

  • $\phi_1$ and $\phi_2$ are multivariate vector nonlinear functions, so: $$ \phi(U) = \phi\left(\begin{bmatrix} u_1 \\ \vdots \\ u_n \end{bmatrix}\right) = \begin{bmatrix} \phi(u_1) \\ \vdots \\ \phi(u_n) \end{bmatrix} $$

  • For ReLU: $$ \begin{bmatrix} \phi(u_1) \\ \vdots \\ \phi(u_n) \end{bmatrix} = \begin{bmatrix} \max(0, u_1) \\ \vdots \\ \max(0, u_n) \end{bmatrix} $$

Fast Forward Forward Pass


$$ F(X,W) = W_3^T \phi_2(W_2^T\phi_1(W_1^TX + b_1) + b_2) + b_3 $$

20240729225110480

Rewind Backward Pass


The following illustration depicts the backpropagation process:

20240729225214237

Olympic Torch Building a Neural Network with PyTorch


Now we will use PyTorch to implement a neural network for regression. We will use the “Boston housing” dataset and the architecture described above.

import torch
import torch.nn as nn
# define our neural network model
# this approach provides easier access to weights (e.g., 'model.fc1' will return the parameters of the first layer)
class HousePricesMLP(nn.Module):
    # notice that we inherit from nn.Module
    def __init__(self, input_dim, output_dim):
        super(HousePricesMLP, self).__init__()
        # here we initialize the building blocks of our network
        # single neuron is just one linear (fully-connected) layer
        self.fc_1 = nn.Linear(input_dim, 4) 
        self.fc_2 = nn.Linear(4, 3)
        self.output_layer = nn.Linear(3, output_dim)

    def forward(self, x):
        # here we define what happens to the input x in the forward pass
        # that is, the order in which x goes through the building blocks
        x = torch.relu(self.fc_1(x))
        x = torch.relu(self.fc_2(x))
        return self.output_layer(x)
# alternative method - more readdable, easier to code, less convenient access to weights
# e.g., to access the first layer weights -- `model.hidden[0]`
class HousePricesMLP(nn.Module):
    # notice that we inherit from nn.Module
    def __init__(self, input_dim, output_dim):
        super(HousePricesMLP, self).__init__()
        # here we initialize the building blocks of our network
        # single neuron is just one linear (fully-connected) layer
        self.hidden = nn.Sequential(nn.Linear(input_dim, 4),
                                    nn.ReLU(),
                                    nn.Linear(4, 3),
                                    nn.ReLU())
        self.output_layer = nn.Linear(3, output_dim)

    def forward(self, x):
        # here we define what happens to the input x in the forward pass
        # that is, the order in which x goes through the building blocks
        return self.output_layer(self.hidden(x))
# NOTE: in this example we are using a very simple NN model
# We usually wider and deeper networks such as this one:
class HousePricesMLP(nn.Module):
    # notice that we inherit from nn.Module
    def __init__(self, input_dim, output_dim, hidden_dim=256):
        super(HousePricesMLP, self).__init__()
        # here we initialize the building blocks of our network
        # single neuron is just one linear (fully-connected) layer
        self.hidden = nn.Sequential(nn.Linear(input_dim, hidden_dim),
                                    nn.ReLU(),
                                    nn.Linear(hidden_dim, hidden_dim),
                                    nn.ReLU(),
                                    nn.Linear(hidden_dim, hidden_dim),
                                    nn.ReLU(),)
        self.output_layer = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        # here we define what happens to the input x in the forward pass
        # that is, the order in which x goes through the building blocks
        return self.output_layer(self.hidden(x))
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load data
california_housing = fetch_california_housing()

# Convert to DataFrame
data = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
data['target'] = california_housing.target

# Print description of the features
print(california_housing.DESCR)
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:sklearn.datasets.fetch_california_housing function.

.. topic:: References

    - Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
      Statistics and Probability Letters, 33 (1997) 291-297
# Convert to DataFrame
boston = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
boston['MEDV'] = california_housing.target

# Sample 10 rows
sampled_data = boston.sample(10)
print(sampled_data)
       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
11143  3.1884      25.0  5.188630   1.073643      2166.0  2.798450     33.84   
9961   5.8150      34.0  7.670412   1.183521       780.0  2.921348     38.33   
12213  6.9930      13.0  6.428571   1.000000       120.0  2.857143     33.51   
4354   8.9440      30.0  7.170455   1.087500      1776.0  2.018182     34.10   
4629   2.2708      18.0  2.571135   1.108755      3296.0  2.254446     34.07   
11026  5.8622      30.0  6.456164   1.038356      2271.0  3.110959     33.80   
20185  5.9181      24.0  5.700000   1.034375      1049.0  3.278125     34.27   
17427  2.3333      32.0  5.816976   1.140584      1074.0  2.848806     34.65   
4080   3.1373      23.0  3.752241   1.074980      2391.0  1.948655     34.15   
13890  2.2612      12.0  5.235714   1.024405     11139.0  6.630357     34.45   

       Longitude     MEDV  
11143    -117.94  1.35400  
9961     -122.26  3.39200  
12213    -117.18  5.00001  
4354     -118.39  5.00001  
4629     -118.30  1.75000  
11026    -117.83  2.21000  
20185    -119.16  2.21100  
17427    -120.47  1.30200  
4080     -118.37  2.63100  
13890    -116.14  1.37500  
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Use 2 features
x = boston[['AveRooms', 'AveOccup']].values  # AveRooms - average number of rooms, AveOccup - average number of household members
y = boston['MEDV'].values

# Split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=5)

# Scaling
x_scaler = StandardScaler()
x_scaler.fit(x_train)
x_train = x_scaler.transform(x_train)
x_test = x_scaler.transform(x_test)

print("total training samples: {}, total test samples: {}".format(len(x_train), len(x_test)))
total training samples: 16512, total test samples: 4128
import torch
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn

# Convert to tensor dataset for PyTorch
boston_tensor_train_ds = TensorDataset(torch.tensor(x_train, dtype=torch.float), torch.tensor(y_train, dtype=torch.float))
boston_tensor_test_ds = TensorDataset(torch.tensor(x_test, dtype=torch.float), torch.tensor(y_test, dtype=torch.float))

# Check
print(f'sample 0: features: {boston_tensor_train_ds[0][0]}, target: {boston_tensor_train_ds[0][1]}')

# Define hyper-parameters and create our model
num_features = 2
output_dim = 1
batch_size = 128
learning_rate = 0.01
num_epochs = 200

# Device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Loss criterion
criterion = nn.MSELoss()

# Model
model = HousePricesMLP(num_features, output_dim).to(device)

# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# DataLoader
train_loader = DataLoader(boston_tensor_train_ds, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(boston_tensor_test_ds, batch_size=batch_size, shuffle=False)
sample 0: features: tensor([-0.7866, -0.0164]), target: 2.8459999561309814
# Training loop
for epoch in range(num_epochs):
    model.train()
    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)

        # Zero the parameter gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(inputs).view(-1)
        loss = criterion(outputs, targets)

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

    # Print loss for every 10 epochs
    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

# Evaluation
model.eval()
with torch.no_grad():
    test_loss = 0
    for inputs, targets in test_loader:
        inputs, targets = inputs.to(device), targets.to(device)
        outputs = model(inputs).view(-1)
        loss = criterion(outputs, targets)
        test_loss += loss.item()

    test_loss /= len(test_loader)
    print(f'Test Loss: {test_loss:.4f}')
Epoch [10/200], Loss: 1.0621
Epoch [20/200], Loss: 1.0323
Epoch [30/200], Loss: 0.8559
Epoch [40/200], Loss: 1.3087
Epoch [50/200], Loss: 1.1804
Epoch [60/200], Loss: 1.0741
Epoch [70/200], Loss: 1.0675
Epoch [80/200], Loss: 0.9341
Epoch [90/200], Loss: 0.6055
Epoch [100/200], Loss: 1.0619
Epoch [110/200], Loss: 1.0063
Epoch [120/200], Loss: 0.9453
Epoch [130/200], Loss: 0.9202
Epoch [140/200], Loss: 0.9076
Epoch [150/200], Loss: 0.8739
Epoch [160/200], Loss: 1.0152
Epoch [170/200], Loss: 0.7826
Epoch [180/200], Loss: 0.8854
Epoch [190/200], Loss: 1.0572
Epoch [200/200], Loss: 0.9545
Test Loss: 1.0264

Re Enter Pincode Weight Initialization


  • As we have learned, neural networks are trained using stochastic optimization algorithms such as gradient descent, RMSprop, Adam, and so on…
  • Recall that these algorithms require the parameters to be initialized to certain values. In other words, they use randomness to find a sufficiently good set of weights for the particular mapping function from inputs to outputs being learned from the data.
  • These algorithms require the network weights to be initialized to small random values (random, but close to zero).
  • Randomness is also used in the search process by shuffling the training dataset before each epoch, which in turn leads to different gradient estimates for each batch.
  • Training deep models is a fairly difficult task, and most algorithms are strongly affected by the choice of initialization (p. 301, Deep Learning, 2016).

?size=100&id=91CnU00i6HLv&format=png&color=000000 Why Not Simply Initialize with Zeros?


Recent Trend: Non-Random Initializations

  • Beyond Signal Propagation: Is Feature Diversity Necessary in Deep Neural Network Initialization?, in this paper, the authors construct a deep network with identical features by initializing almost all weights to 0. The architecture not only achieves perfect signal propagation and stable gradients, but also obtains high accuracy on standard benchmarks, suggesting that randomly diverse initialization is not necessary for training neural networks.

  • ZerO Initialization: Initializing Neural Networks with only Zeros and Ones, in this paper, random weight initialization is replaced by a fully deterministic initialization scheme that uses zeros and ones (after normalization) to initialize network weights, based on identity and Hadamard transforms. The authors show encouraging results on various benchmarks, paving the way for simple initialization schemes that work as well as random initialization.

These studies show that neural networks do not necessarily need randomly initialized weights to train well.

On Arrow Emoji Types of Weight Initialization


  • Initializing neural-network weights is an active research area, because careful initialization can speed up the learning process.

  • There is no single best method for initializing neural-network weights.

  • We will review several popular initialization methods.

  • Unifrom – Initialize using values drawn from the uniform distribution $\mathcal{U}(a, b)$

  • In PyTorch – torch.nn.init.uniform_(tensor, a=0.0, b=1.0)

  • Normal – Initialize using values drawn from the normal distribution $\mathcal{N}(\text{mean}, \text{std}^2)$

  • In PyTorch – torch.nn.init.normal_(tensor, mean=0.0, std=1.0)

  • Constant – Initialize using the value $val$.

  • In PyTorch – torch.nn.init.constant_(tensor, val)

  • Ones – Initialize with the scalar value 1.

  • In PyTorch – torch.nn.init.ones_(tensor)

  • Zeros – Initialize with the scalar value 0.

  • In PyTorch – torch.nn.init.zeros_(tensor)

  • Xavier Unifrom – according to the method described in Understanding the difficulty of training deep feedforward neural networks – Glorot, X. & Bengio, Y. (2010), initialize values using a uniform distribution. The resulting tensor will contain values sampled from $\mathcal{U}(-a, a)$ where$$ a = \text{gain} \times \sqrt{\frac{6}{\text{fan}_{in} + \text{fan}_{out}}} $$

  • fan_in is the number of input units in the weight tensor, fan_out is the number of output units in the weight tensor, and the main role of gain is to adjust the scale of the initialized weights so that signals do not suffer from vanishing or exploding gradients as they propagate through the network.

  • In PyTorch – torch.nn.init.xavier_uniform_(tensor, gain=1.0)

  • Xavier Normal – according to the method described in Understanding the difficulty of training deep feedforward neural networks – Glorot, X. & Bengio, Y. (2010), initialize values using a normal distribution. The resulting tensor will contain values sampled from $\mathcal{N}(0,\text{std}^2)$ where $$ \text{std} = \text{gain} \times \sqrt{\frac{2}{\text{fan}_{in} + \text{fan}_{out}}} $$

  • fan_in is the number of input units in the weight tensor, and fan_out is the number of output units in the weight tensor

  • In PyTorch – torch.nn.init.xavier_normal_(tensor, gain=1.0)

  • Kaiming (He) Uniform – according to the method described in Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification – He, K. et al. (2015), initialize values using a uniform distribution. The resulting tensor will contain values sampled from $\mathcal{U}(-\text{bound}, \text{bound})$ where $$ \text{bound} = \text{gain} \times \sqrt{\frac{3}{\text{fan-mode}}} $$

  • In PyTorch – torch.nn.init.kaiming_uniform_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu')

  • a – the negative slope used by leaky_relu (only used with leaky_relu)

  • gain: scaling factor, usually $sqrt{2}$ for ReLU and Leaky ReLU.

  • fan_mode: can be fan_in or fan_out.

  • fan_in: represents the number of input units in the weight tensor (the number of neurons in the previous layer). This means that during initialization, the number of inputs to each neuron in the forward pass is considered to ensure that signals do not become too large or too small during propagation.

  • fan_out: represents the number of output units in the weight tensor (the number of neurons in the next layer). This means that during initialization, the number of outputs of each neuron in backpropagation is considered to ensure that gradients do not become too large or too small during propagation.

During initialization, both fan_in and fan_out are computed from the shape (dimensions) of the weight tensor, which is already determined when the network architecture is defined. For example, for a fully connected layer, the shape of its weight tensor is $[ \text{fan_out}, \text{fan_in} ]$.

  • Kaiming (He) Normal – according to the method described in Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification – He, K. et al. (2015), initialize values using a normal distribution. The resulting tensor will contain values sampled from $\mathcal{N}(0,\text{std}^2)$, where $$ \text{std} = \frac{\text{gain}}{\sqrt{\text{fan-mode}}} $$
  • In PyTorch – torch.nn.init.kaiming_normal_(tensor, a=0, mode='fan_in', nonlinearity='leaky_relu')

PyTorch has default initialization schemes that usually work well. For example, kaiming_uniform is the default initialization for Linear layers in PyTorch:

Interactive Demo


Different Initializations Demo

Olympic Torch Initializing Neural Network Weights with PyTorch


# define hyper-parmeters and create our model
num_features = 2
output_dim = 1
batch_size = 128
learning_rate = 0.01
num_epochs = 500
# device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# loss criterion
criterion = nn.MSELoss()
# model
model = HousePricesMLP(num_features, output_dim).to(device)
# optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# use a different initialization for the model
def weights_init(m):
    classname = m.__class__.__name__
    if classname.find('Linear') != -1:
        torch.nn.init.xavier_normal_(m.weight, gain=1.0)
model.apply(weights_init)
HousePricesMLP(
  (hidden): Sequential(
    (0): Linear(in_features=2, out_features=256, bias=True)
    (1): ReLU()
    (2): Linear(in_features=256, out_features=256, bias=True)
    (3): ReLU()
    (4): Linear(in_features=256, out_features=256, bias=True)
    (5): ReLU()
  )
  (output_layer): Linear(in_features=256, out_features=1, bias=True)
)
# another way to do that
class HousePricesMLP(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(HousePricesMLP, self).__init__()
        self.hidden = nn.Sequential(nn.Linear(input_dim, 4),
                                    nn.ReLU(),
                                    nn.Linear(4, 3),
                                    nn.ReLU())
        self.output_layer = nn.Linear(3, output_dim)
        # NEW: init weights here
        self.init_weights()

    def forward(self, x):
        return self.output_layer(self.hidden(x))

    def init_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                torch.nn.init.xavier_normal_(m.weight, gain=1.0)
                if m.bias is not None:
                    torch.nn.init.constant_(m.bias, 0)
import numpy as np
boston_tensor_train_dataloader = DataLoader(boston_tensor_train_ds, batch_size=batch_size, shuffle=True)

# training loop for the model
for epoch in range(num_epochs):
    epoch_losses = []
    for features, targets in boston_tensor_train_dataloader:
        # send data to device
        features = features.to(device)
        targets = targets.to(device)
        # forward pass
        output = model(features)
        # loss
        loss = criterion(output.view(-1), targets)
        # backward pass
        optimizer.zero_grad()  # clean the gradients from previous iteration
        loss.backward()  # autograd backward to calculate gradients
        optimizer.step()  # apply update to the weights
        epoch_losses.append(loss.item())
    if epoch % 50 == 0:
        print(f'epoch: {epoch} loss: {np.mean(epoch_losses)}')

# test error
model.eval()
with torch.no_grad():
    test_outputs = model(torch.tensor(x_test, dtype=torch.float, device=device))
    test_error = criterion(test_outputs.view(-1), torch.tensor(y_test, dtype=torch.float, device=device))
print(f'test MSE error: {test_error.item()}')
epoch: 0 loss: 1.3302481183710024
epoch: 50 loss: 0.9436245888702629
epoch: 100 loss: 0.9447847960531249
epoch: 150 loss: 0.9419918522354245
epoch: 200 loss: 0.9397756309472314
epoch: 250 loss: 0.9362258513768514
epoch: 300 loss: 0.9383858072665311
epoch: 350 loss: 0.9368446367655614
epoch: 400 loss: 0.936691609926002
epoch: 450 loss: 0.9354286706724833
test MSE error: 0.9552490711212158

Alps Deep Double Descent


  • Double descent in machine-learning training: as model size, data size, or training time increases, performance first improves, then gets worse, and then improves again.
  • This effect can often be avoided through careful regularization or early stopping.
  • Although this behavior appears to be fairly common, we still do not fully understand why it happens.

20240729225716791

Deep double descent challenges the traditional view of the bias-variance tradeoff, which holds that increasing model complexity usually leads to overfitting and higher test error.

In modern deep learning models, especially those with large-scale datasets and architectures, this phenomenon highlights the nontrivial behavior of test error:

  • Initial descent: as model complexity increases, the model fits the data better, reducing bias.
  • Intermediate ascent: further increases in complexity lead to overfitting; the model begins to capture noise in the training data, increasing variance and test error.
  • Second descent: after a certain complexity threshold is exceeded, the model becomes very powerful and can generalize better by effectively using large datasets and regularization techniques, thereby reducing test error again.

This insight has practical implications for model training and architecture design. It suggests that in some cases, increasing model complexity and data can ultimately lead to better generalization performance, contrary to traditional expectations. Regularization techniques and careful monitoring during training are essential for guiding this behavior effectively.

Viewing Double Descent from the Model Perspective


  • There are cases where bigger models are worse.

  • Model-wise double descent can lead to worse results when training with more data.

    20240729225811323

  • Classical Regime:

This shows the traditional bias-variance tradeoff theory.
In this regime, as model complexity increases, error first decreases and then rises again due to overfitting.

  • Critical Regime:

In this regime, model error exhibits relatively sharp fluctuations.
The peak in test error appears near the interpolation threshold, where the model is just large enough to fit the training set.

  • Modern Regime:

In this region, as model size increases further, error decreases again and the model improves.

  • Reality (blue curve):

This shows the behavior observed in practice: test error first decreases, then rises in the critical regime, and finally decreases again in the modern regime.

  • Train error (green curve):

As model complexity increases, training error continues to decrease, indicating that the model’s ability to fit the training set keeps improving.

Summary:

Critical regime: the double-descent phenomenon mainly appears in the critical regime, where model error fluctuates significantly.

Modern regime: with sufficiently large models, error decreases again, showing that more complex models can perform better when handling large-scale data.

Relationship between data and model complexity: double descent reminds us that when designing and training models, we cannot simply rely on increasing data and model complexity to improve performance. More factors and strategies must be considered, such as regularization and appropriate early stopping.

Sample-wise Non-monotonicity


  • Sample-wise Non-monotonicity(Sample-wise Non-monotonicity):

This phenomenon means that increasing the number of training samples can sometimes harm model performance, contrary to the usual expectation that more data improves model accuracy.

20240729225857427

Training Epochs and Model Size


20240729230000779

20240729230029494

Prize Credits


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top