How does a neural net really work¶

In this notebook I reimplement core parts of fast.ai's Kaggle notebook "How does a neural net really work" to build intuition step by step. The flow is: start with regression, use gradient descent to optimize coefficients, apply the same ideas to a shallow neural network on Titanic data, and then extend that model into a deeper network.

Revising Regressions
- Plot a generic quadratic function ($ax^2 + bx + c$)
- Generate some random data points
- Learn the step-by-step process to find the values of a, b, and c that make our function represent the random data generated in 2
- Use the mean absolute error to manually adjust a, b, and c.
Understand and break down the Gradient Descent algorithm
The Basics of a Neural-Network using the Titanic Survival dataset from Kaggle
- Explore ReLUs and how they differ from a simple linear function
- Build a single-layer neural network using a weighted-sum function $f(x)=m_{Age}x_{Age}+m_{SibSp}x_{SibSp}+m_{Parch}x_{Parch}+m_{Fare}x_{Fare}$
- Do deep learning by stacking layers of weights/neurons into a multi-layer neural network

# Dependencies are managed with uv.
# Run `uv sync` in the project root before executing this notebook.

1. Revising Regressions¶

This section, from the fast.ai course, sets the stage for understanding how neural networks learn "weights". We'll plot some points on a graphic and use visualizations to see how changing the coefficients affects the function to better fit the points.

1.1 Plot a generic quadratic function ($ax^2+bx+c$)¶

from fastai.basics import torch, plt
import numpy as np, pandas as pd

# Wider display settings make notebook tables and tensors easier to inspect
np.set_printoptions(linewidth=140)
torch.set_printoptions(linewidth=140, sci_mode=False, edgeitems=7)
pd.set_option('display.width', 140)

# Set the figure DPI to 90 for better resolution
plt.rc('figure', dpi=90)

# Function to plot a mathematical function over a range
def plot_function(f, title=None, min=-2.1, max=2.1, color='r', ylim=None):
    # Create evenly spaced x values as a column vector
    x = torch.linspace(min,max, 100)[:,None]
    # Set y-axis limits if specified
    if ylim: plt.ylim(ylim)
    # Plot the function
    plt.plot(x, f(x), color)
    # Add title if provided
    if title is not None: plt.title(title)

# Function with quadratic expression ax^2 + bx + c
def quad(a, b, c, x): return a*x**2 + b*x + c

from functools import partial
# Creates a new function with fixed a,b,c parameters, leaving only x variable
# This allows us to create specific quadratic functions by "fixing" the coefficients
def mk_quad(a,b,c): return partial(quad, a,b,c)

def demo_plot_basic_quadratic():
    a = 3
    b = 2
    c = 1
    f = mk_quad(a, b ,c)
    plot_function(f, title=f'${a}x^2 + {b}x + {c}$')

demo_plot_basic_quadratic()

1.2. Generate some random data points¶

# Add both multiplicative and additive noise to input x
def add_noise(x, mult, add): return x * (1+torch.randn(x.shape) * mult) + torch.randn(x.shape) * add

def generate_noisy_data(f, x_start=-2, x_end=2, num_datapoints=20, noise_mult=0.15, noise_add=0.15, seed=42):
    # Define a static seed, so that the random data is always the same every time we run this
    torch.manual_seed(seed)
    # Create evenly spaced x values and add a dimension of size 1 at position 1,
    # transforming shape from (20,) to (20,1) to make it a column vector
    # Example: tensor([1,2,3]) shape=(3,)  ->  tensor([[1],[2],[3]]) shape=(3,1)
    x = torch.linspace(x_start, x_end, steps=num_datapoints).unsqueeze(1) 
    # Generate noisy y values by applying noise to function output
    # mult=0.15 for multiplicative noise, add=1.5 for additive noise
    y = add_noise(f(x), noise_mult, noise_add)
    return x, y

def demp_plot_random_data():
   x,y = generate_noisy_data(mk_quad(3,2,1))
   plt.scatter(x,y);

demp_plot_random_data()

1.3. Fit the function to the data¶

In this section, we will explore the step-by-step process to find the values of a, b, and c that allow our function to accurately reflect the random data generated in 1.2. The interactive plot below, shows how adjustments to a, b, and c influence the function's shape to better align with our data layout.

from ipywidgets import interact
@interact(a=1.5, b=1.5, c=1.5)
def demo_interactive_plot_quad(a, b, c):
    plt.close('all')  # Close all existing figures
    plt.figure()      # Create a new figure
    x,y = generate_noisy_data(mk_quad(3,2,1))
    
    plt.scatter(x,y)
    plot_function(mk_quad(a,b,c), ylim=(0,13))

1.4 Measure the error¶

This is cool and works, but we need to know how close we are to our ideal solution. In regression, we can use fun ways to estimate this, like the "Mean Absolute Error," which averages the distance between predicted and actual values.

The fastai library has a wrapper for some of the most common methods (from scikit-learn).

Here's a quick demo where we can see how it is calculated.

def mean_absolute_error(preds, acts): return (torch.abs(preds-acts)).mean()

def demo_mean_absolute_error():
    # Create some example predictions and actual values
    preds = torch.tensor([1.0, 2.0, 3.0, 4.0])
    actuals = torch.tensor([1.1, 2.1, 2.8, 4.2])

    # Calculate and print the mean absolute error
    error = mean_absolute_error(preds, actuals)
    print(f"Mean Absolute Error: {error:.3f}")

    # Let's break down what's happening:
    print("\nAbsolute differences between predictions and actuals:")
    abs_diffs = torch.abs(preds-actuals)
    for i, (p, a, d) in enumerate(zip(preds, actuals, abs_diffs)):
        print(f"Prediction: {p:.1f}, Actual: {a:.1f}, Absolute Difference: {d:.3f}")

demo_mean_absolute_error()

Mean Absolute Error: 0.150

Absolute differences between predictions and actuals:
Prediction: 1.0, Actual: 1.1, Absolute Difference: 0.100
Prediction: 2.0, Actual: 2.1, Absolute Difference: 0.100
Prediction: 3.0, Actual: 2.8, Absolute Difference: 0.200
Prediction: 4.0, Actual: 4.2, Absolute Difference: 0.200

@interact(a=1.5, b=1.5, c=1.5)
def plot_quad(a, b, c):
    x,y = generate_noisy_data(mk_quad(3,2,1))

    f = mk_quad(a, b ,c)
    plt.scatter(x,y)
    loss = mean_absolute_error(f(x), y)
    plot_function(f, ylim=(0,13), title=f"MAE: {loss:.2f}")

2. Understand and break down the Gradient Descent algorithm¶

Now that we can calculate the mean absolute error, the next step is to understand how to adjust our parameters a, b, and c to reduce this error. To do this, we can think about the gradients of the error with respect to each of the a,b,c parameters.

👉 If you were walking on a hill (representing the error surface), the partial derivative with respect to one direction (say, the 'a' direction) tells you the slope of the hill in that specific direction. A steep slope/gradient means a large change in error for a small change in 'a'.

For example, if we consider the partial derivative of the mean absolute error with respect to a (while keeping b and c fixed), a negative value would indicate that increasing a will lead to a decrease in the error (like walking forward-downhill in the 'a' direction). Conversely, a positive value would suggest that decreasing a would reduce the error (ackwardly walking backwards-downhill in the 'a' direction 😄).

Using AI, we can plot this hill. The following plot shows the "error surface" (MAE) for the function $f(x) = m*x + b$. By fixing one variable (e.g., b=2), we can visualize the differentiated function in terms of m and determine whether to increase or decrease m to move "downhill."

import plotly
plotly.offline.init_notebook_mode(connected=True)

import plotly.graph_objects as go

def demo_mae_surface():
    # Actual data points
    x_actual = torch.tensor([1.0, 2.0])
    y_actual = torch.tensor([2.0, 4.0])

    # Range of m and b values to explore
    m_vals = torch.linspace(-1, 5, 100) # Explore m from -1 to 5
    b_vals = torch.linspace(-1, 5, 100) # Explore b from -1 to 5

    # Create a meshgrid of m and b values
    M, B = torch.meshgrid(m_vals, b_vals, indexing='ij')

    # Initialize an array to store MAE values for the surface plot
    mae_values_surface = torch.zeros_like(M)

    # Calculate MAE for each combination of m and b for the surface plot
    for i in range(M.shape[0]):
        for j in range(M.shape[1]):
            m = M[i, j]
            b = B[i, j]
            preds = m * x_actual + b
            mae = mean_absolute_error(preds, y_actual)
            mae_values_surface[i, j] = mae.item() # Store the scalar value

    # --- Calculate MAE for fixed b to show gradient of m ---
    fixed_b = 2.0 # Fixed b value
    mae_values_fixed_b = []
    for m in m_vals:
        preds = m * x_actual + fixed_b
        mae = mean_absolute_error(preds, y_actual)
        mae_values_fixed_b.append(mae.item())


    # Create the surface plot using Plotly
    fig = go.Figure(data=[go.Surface(z=mae_values_surface.numpy(), x=b_vals.numpy(), y=m_vals.numpy(), colorscale='RdBu', name='MAE Surface')])

    # Add the line plot for fixed b, offsetting z values slightly
    z_offset_line = 0.1  # Offset to lift the line above the surface
    fig.add_trace(go.Scatter3d(
        x=[fixed_b] * len(m_vals), # Fixed b value for all m values
        y=m_vals.numpy(),
        z=[z + z_offset_line for z in mae_values_fixed_b], # Add offset to z values
        mode='lines',
        line=dict(color='green', width=4),
        name=f'MAE with fixed b={fixed_b}'
    ))

    # Add annotation - Adjusted for better readability, offsetting z value slightly
    annotation_m = 2.0
    annotation_mae = mean_absolute_error(annotation_m * x_actual + fixed_b, y_actual).item()
    z_offset_point = 0.1 # Offset to lift the point above the surface
    fig.add_trace(go.Scatter3d(
        x=[fixed_b],
        y=[annotation_m],
        z=[annotation_mae + z_offset_point], # Add offset to z value
        mode='markers+text',
        marker=dict(size=8, color='red'),
        text=["At m=2, decrease m to go downhill"], # Shortened text
        textposition="top center", # Changed text position to top center
        textfont=dict(size=10) # Reduced text size
    ))


    fig.update_layout(
        title='3D Error Surface: Mean Absolute Error (MAE) for f(x) = mx + b<br>with line showing gradient of m at fixed b=2', # Added <br> for title wrapping
        scene=dict(
            xaxis_title='b (y-intercept)',
            yaxis_title='m (slope)',
            zaxis_title='MAE',
            camera=dict(eye=dict(x=1.5, y=-1.5, z=0.8)) # Rotate perspective
        ),
        width=700,
        height=700
    )


    fig.show()

demo_mae_surface()

With PyTorch, we can use require_grad=True on tensors, which automatically handles differentiation and the calculation of the gradients for us. To be honest, it looks a bit like dark magic. Let's look at some examples where we can see this functionality from PyTorch being applied to our function $f(x) = m*x + b$, along with two different inputs (x1 and x2)🕵️‍♂️.

#########################################
# Example 1: Using x1 = [1.0, 2.0, 3.0]
#########################################
x1 = torch.tensor([1.0, 2.0, 3.0])
m1 = torch.tensor(2.0, requires_grad=True)
b1 = torch.tensor(1.0, requires_grad=True)

# Compute f(x1) = m1 * x1 + b1
y1 = m1 * x1 + b1
total1 = y1.sum()  # total1 = m1*(1+2+3) + b1*3 = 6*m1 + 3*b1

# Compute gradients for m1 and b1.
total1.backward()

print(f'Example 1 | x: {x1}, m: {m1}, b: {b1}')
print("Example 1 | Gradient with respect to m1 (m1.grad):", m1.grad)  # Expected: sum(x1) = 6.0
print("Example 1 | Gradient with respect to b1 (b1.grad):", b1.grad)  # Expected: len(x1) = 3
print("--------------")

#########################################
# Example 2: Using x2 = [1.0, 4.0] (different size and values)
#########################################
x2 = torch.tensor([1.0, 4.0])
m2 = torch.tensor(2.0, requires_grad=True)
b2 = torch.tensor(1.0, requires_grad=True)

# Compute f(x2) = m2 * x2 + b2
y2 = m2 * x2 + b2
total2 = y2.sum()  # total2 = m2*(1+4) + b2*2 = 5*m2 + 2*b2

# Compute gradients for m2 and b2.
total2.backward()

print(f'Example 2 | x: {x2}, m: {m2}, b: {b2}')
print("Example 2 | Gradient with respect to m2 (m2.grad):", m2.grad)  # Expected: sum(x2) = 5.0
print("Example 2 | Gradient with respect to b2 (b2.grad):", b2.grad)  # Expected: len(x2) = 2
print("--------------")

# Print an explanation that details the differentiation steps.
explanation = """
Explanation:
- **Example 1:**
   - We start with a list of numbers: x1 = [1.0, 2.0, 3.0].
   - We plug each number into our function, which means we multiply it by m (let's say m = 2) and then add b (let's say b = 1):
     - For 1.0: f(1.0) = 2 * 1 + 1 = 3
     - For 2.0: f(2.0) = 2 * 2 + 1 = 5
     - For 3.0: f(3.0) = 2 * 3 + 1 = 7
   - So, we end up with y1 = [3.0, 5.0, 7.0].
   - Then, we add these numbers together: 3 + 5 + 7 = 15.
   - If we look at our equation, we can say: total1 = m1*(1.0+2.0+3.0) + b1*3 = 6 * m1 + 3 * b1.
   - Now, when we want to see how changing m1 and b1 affects total1, we find:
     - Changing m1 gives us a "gradient" of 6.
     - Changing b1 gives us a "gradient" of 3.
   - So, m1.grad = 6.0 and b1.grad = 3.

- **Example 2:**
   - Now we have a different list: x2 = [1.0, 4.0].
   - We use the same m and b:
     - For 1.0: f(1.0) = 3 (same as before).
     - For 4.0: f(4.0) = 2 * 4 + 1 = 9.
   - So, now we have y2 = [3.0, 9.0].
   - We add these: 3 + 9 = 12.
   - In terms of our equation, total2 = m2*(1.0+4.0) + b2*2 = 5 * m2 + 2 * b2.
   - For the gradients:
     - Changing m2 gives us a "gradient" of 5.
     - Changing b2 gives us a "gradient" of 2.
   - So, m2.grad = 5.0 and b2.grad = 2.
"""
print(explanation)

Example 1 | x: tensor([1., 2., 3.]), m: 2.0, b: 1.0
Example 1 | Gradient with respect to m1 (m1.grad): tensor(6.)
Example 1 | Gradient with respect to b1 (b1.grad): tensor(3.)
--------------
Example 2 | x: tensor([1., 4.]), m: 2.0, b: 1.0
Example 2 | Gradient with respect to m2 (m2.grad): tensor(5.)
Example 2 | Gradient with respect to b2 (b2.grad): tensor(2.)
--------------

Explanation:
- **Example 1:**
   - We start with a list of numbers: x1 = [1.0, 2.0, 3.0].
   - We plug each number into our function, which means we multiply it by m (let's say m = 2) and then add b (let's say b = 1):
     - For 1.0: f(1.0) = 2 * 1 + 1 = 3
     - For 2.0: f(2.0) = 2 * 2 + 1 = 5
     - For 3.0: f(3.0) = 2 * 3 + 1 = 7
   - So, we end up with y1 = [3.0, 5.0, 7.0].
   - Then, we add these numbers together: 3 + 5 + 7 = 15.
   - If we look at our equation, we can say: total1 = m1*(1.0+2.0+3.0) + b1*3 = 6 * m1 + 3 * b1.
   - Now, when we want to see how changing m1 and b1 affects total1, we find:
     - Changing m1 gives us a "gradient" of 6.
     - Changing b1 gives us a "gradient" of 3.
   - So, m1.grad = 6.0 and b1.grad = 3.

- **Example 2:**
   - Now we have a different list: x2 = [1.0, 4.0].
   - We use the same m and b:
     - For 1.0: f(1.0) = 3 (same as before).
     - For 4.0: f(4.0) = 2 * 4 + 1 = 9.
   - So, now we have y2 = [3.0, 9.0].
   - We add these: 3 + 9 = 12.
   - In terms of our equation, total2 = m2*(1.0+4.0) + b2*2 = 5 * m2 + 2 * b2.
   - For the gradients:
     - Changing m2 gives us a "gradient" of 5.
     - Changing b2 gives us a "gradient" of 2.
   - So, m2.grad = 5.0 and b2.grad = 2.

Now we can create an interactive plot where we show the gradient on a, b and c. If the slope is negative we want to move forward (or downhill).

@interact(a=1.5, b=1.5, c=1.5)
def demo_quadratic_plot_with_gradients(a, b, c):
    x,y = generate_noisy_data(mk_quad(3,2,1))
    plt.scatter(x,y)

    a_tensor = torch.tensor([float(a)], requires_grad=True)
    b_tensor = torch.tensor([float(b)], requires_grad=True)
    c_tensor = torch.tensor([float(c)], requires_grad=True)

    f = mk_quad(a_tensor, b_tensor, c_tensor)

    loss =  torch.abs(f(x) - y).mean() 
    loss.backward()

    a_grad = a_tensor.grad.item()
    b_grad = b_tensor.grad.item()
    c_grad = c_tensor.grad.item()

    plot_function(lambda x: f(x).detach(), ylim=(0,13), title=f"MAE: {loss:.2f}, dLoss/da: {a_grad:.2f}, dLoss/db: {b_grad:.2f}, dLoss/dc: {c_grad:.2f}")

from fastai.metrics import mae

def demo_auto_fit(steps=20):
    x, y = generate_noisy_data(mk_quad(3,2,1))

    abc = torch.tensor([1.0,1.0,1.0], requires_grad=True)
    min_loss = float('inf')
    best_abc = abc.clone().detach() # Initialize best_abc with the initial abc

    for i in range(steps):
        f = mk_quad(*abc)
        loss = mae(f(x), y)
        loss.backward()

        with torch.no_grad():
            abc -= abc.grad*0.1
            abc.grad.zero_() # Clear gradients after update

        print(f'step={i}; loss={loss.item():.2f}; abc={abc}')

        if loss < min_loss:
            min_loss = loss
            best_abc = abc.clone().detach() # Update best_abc when a lower loss is found

    return best_abc

best_abc_params = demo_auto_fit()
print(f"Best abc parameters: {best_abc_params}")

step=0; loss=3.15; abc=tensor([1.1463, 1.0042, 1.0800], requires_grad=True)
step=1; loss=2.87; abc=tensor([1.2925, 1.0084, 1.1600], requires_grad=True)
step=2; loss=2.59; abc=tensor([1.4388, 1.0126, 1.2400], requires_grad=True)
step=3; loss=2.32; abc=tensor([1.5634, 1.0347, 1.2900], requires_grad=True)
step=4; loss=2.14; abc=tensor([1.6881, 1.0568, 1.3400], requires_grad=True)
step=5; loss=1.96; abc=tensor([1.7993, 1.0905, 1.3800], requires_grad=True)
step=6; loss=1.81; abc=tensor([1.9016, 1.1337, 1.4100], requires_grad=True)
step=7; loss=1.69; abc=tensor([1.9975, 1.1811, 1.4200], requires_grad=True)
step=8; loss=1.57; abc=tensor([2.0933, 1.2284, 1.4300], requires_grad=True)
step=9; loss=1.46; abc=tensor([2.1864, 1.2705, 1.4300], requires_grad=True)
step=10; loss=1.36; abc=tensor([2.2794, 1.3126, 1.4300], requires_grad=True)
step=11; loss=1.26; abc=tensor([2.3725, 1.3547, 1.4300], requires_grad=True)
step=12; loss=1.15; abc=tensor([2.4656, 1.3968, 1.4300], requires_grad=True)
step=13; loss=1.05; abc=tensor([2.5587, 1.4389, 1.4300], requires_grad=True)
step=14; loss=0.94; abc=tensor([2.6517, 1.4811, 1.4300], requires_grad=True)
step=15; loss=0.84; abc=tensor([2.7448, 1.5232, 1.4300], requires_grad=True)
step=16; loss=0.76; abc=tensor([2.7889, 1.5358, 1.4100], requires_grad=True)
step=17; loss=0.74; abc=tensor([2.8330, 1.5484, 1.3900], requires_grad=True)
step=18; loss=0.71; abc=tensor([2.8771, 1.5611, 1.3700], requires_grad=True)
step=19; loss=0.69; abc=tensor([2.9212, 1.5737, 1.3500], requires_grad=True)
Best abc parameters: tensor([2.9212, 1.5737, 1.3500])

3. The Basics of a Neural-Network¶

3.1 Introducing Non-Linearity with ReLU¶

We've seen that simple functions like quadratics can model some data, but real-world data is rarely so straightforward. Imagine trying to predict something complex, like whether a picture is a cat or a dog, based on many pixel values (our 'dimensions'). A simple quadratic or even a single linear function just won't be flexible enough to capture the intricate patterns in such high-dimensional data.

To handle this complexity, we need to build more powerful functions. Simply combining linear functions won't solve the problem because any combination of linear functions is still just a linear function! Linear functions can only model linear relationships in the data. Real-world data, like images of cats and dogs, is highly non-linear.

To introduce non-linearity, we use activation functions. ReLU (Rectified Linear Unit) is a simple yet powerful activation function that introduces non-linearity. By applying ReLU to the output of linear functions, we can create models that can learn complex, non-linear patterns in the data. This non-linearity is what allows neural networks to model intricate relationships that simple linear models cannot. This will lead us to the idea of a ReLU, a simple activation function, and the simplest "neural network" we can build with it.

def rectified_linear(m,b,x):
    y = m*x+b
    return torch.clip(y, 0.)

plot_function(partial(rectified_linear, 1,1))

Combining two ReLUs allows us to create more complex, piecewise linear functions, as illustrated in the interactive plot below. This combination increases the flexibility of our model, enabling it to capture more intricate relationships in the data.

def double_relu(m1,b1,m2,b2,x):
    return rectified_linear(m1,b1,x) + rectified_linear(m2,b2,x)

@interact(m1=-1.5, b1=-1.5, m2=1.5, b2=1.5)
def plot_double_relu(m1, b1, m2, b2):
    plot_function(partial(double_relu, m1,b1,m2,b2), ylim=(-1,6))

3.2 Building a Neural Network from-Scratch¶

From this point forward, we will be following this notebook: Linear model and neural net from scratch.

Important ⚠️: For simplicity, I'm skipping all the steps that involve data cleanup and preparation. This means of course that my model will most likely not have a very good performance.

We are using the Titanic competition from Kaggle. I have made a copy in my Hugging Face workspace, which tbh I did to to experiment how Datasets work on Hugging Face.

The goal is to create a model to predict whether a passenger Survived, which is provided in our dataset.

In essence, we will now combine functions like those we've explored above, such as ReLUs, to construct a simple neural network. This network will receive passenger features as input, apply weights (similar to $m$ in our previous examples), and hopefully predict whether the passenger Survived with the lowest possible loss/error.

from datasets import load_dataset
dataset = load_dataset("paulopontesm/titanic", data_files={"train": "train.csv", "test": "test.csv"})
train_dataset_df = dataset["train"].to_pandas()
test_dataset_df = dataset["test"].to_pandas()

train_dataset_df

Since we need numerical data for our model, we'll just use the columns that already contain numbers as predictors. These are the columns that are already numerical.

import numpy as np

train_dataset_df.describe(include=(np.number))

Now that we have numbers for the features, we can create tensors/arrays for our features (aka called independent variables) and target (aka dependent variable).

Even thought I mentioned above that I didn't want to do a lot of data transofrmations, I think we really need to remove the NaNs and to normalize the numbers.

from torch import tensor
from fastai.data.transforms import RandomSplitter

indep_cols = ['Age', 'SibSp', 'Parch', 'Fare']

t_dep = tensor(train_dataset_df.Survived)
trn_split, val_split = RandomSplitter(seed=42)(train_dataset_df)

train_set_targets, validation_set_targets = t_dep[trn_split], t_dep[val_split]

train_df = train_dataset_df.iloc[trn_split].copy()
validation_df = train_dataset_df.iloc[val_split].copy()

# We need to do 2 things before proceeding:
# 1. Replace all the nans by the mode of that column (fit on train split).
for col in indep_cols:
    mode_val = train_df[col].mode()[0]
    train_df[col] = train_df[col].fillna(mode_val)
    validation_df[col] = validation_df[col].fillna(mode_val)

# 2. Scale columns by the max value from the train split.
for col in indep_cols:
    max_val = train_df[col].max()
    if max_val == 0:
        max_val = 1
    train_df[col] = train_df[col] / max_val
    validation_df[col] = validation_df[col] / max_val

train_set_features = tensor(train_df[indep_cols].values, dtype=torch.float)
validation_set_features = tensor(validation_df[indep_cols].values, dtype=torch.float)
len(train_set_features), len(validation_set_features)

(713, 178)

Before training, let's sanity-check the split sizes.

We keep a validation split to estimate generalization during development, while the test set stays untouched for final evaluation.

len(train_set_features),len(validation_set_features)

(713, 178)

Now we can generate random weights (ms) for each of our features. We're using a linear model, effectively calculating a weighted sum of the features: $f(x) = m_{Age}*x_{Age} + m_{SibSp}*x_{SibSp} + m_{Parch}*x_{Parch} + m_{Fare}*x_{Fare}$. We will adjust these weights to predict passenger survival based on the features.

def generate_random_coefficients(num_coeffs):
    torch.manual_seed(42)
    coeffs = torch.rand(num_coeffs)-0.5
    return coeffs.requires_grad_()

nn_coeffs=generate_random_coefficients(num_coeffs=train_set_features.shape[1])
nn_coeffs

tensor([ 0.3823,  0.4150, -0.1171,  0.4593], requires_grad=True)

def calc_preds(coeffs, features): return (features*coeffs).sum(axis=1)

predictions = calc_preds(nn_coeffs, train_set_features)
predictions.topk(3)

torch.return_types.topk(
values=tensor([0.6401, 0.6401, 0.6258], grad_fn=<TopkBackward0>),
indices=tensor([183,  94, 462]))

def calc_loss(coeffs, features, targets): return torch.abs(calc_preds(coeffs, features)-targets).mean()

loss = calc_loss(coeffs=nn_coeffs, features=train_set_features, targets=train_set_targets)
loss

tensor(0.4230, grad_fn=<MeanBackward0>)

loss.backward()
nn_coeffs.grad

tensor([ 0.1071,  0.0191,  0.0035, -0.0091])

def one_epoch(coeffs, lr, train_set_features_set, train_set_targets_set):
    loss = calc_loss(coeffs, train_set_features_set, train_set_targets_set)
    loss.backward()
    with torch.no_grad():
        coeffs.sub_(coeffs.grad * lr)
        coeffs.grad.zero_()
    print(f"{loss:.3f}", end="; ")
    
def train_model(train_set_features_set, train_set_targets_set, epochs=60, lr=4):
    torch.manual_seed(442)
    coeffs = generate_random_coefficients(num_coeffs=train_set_features_set.shape[1])
    for i in range(epochs): one_epoch(coeffs, lr=lr, train_set_features_set=train_set_features_set, train_set_targets_set=train_set_targets_set)
    return coeffs

final_weights = train_model(train_set_features, train_set_targets)

def show_coeffs(coeffs): return dict(zip(indep_cols, coeffs.requires_grad_(False)))

show_coeffs(final_weights)

0.423; 0.385; 0.470; 0.421; 0.374; 0.461; 0.501; 0.442; 0.391; 0.424; 0.510; 0.440; 0.389; 0.426; 0.495; 0.432; 0.381; 0.444; 0.495; 0.429; 0.378; 0.447; 0.489; 0.422; 0.371; 0.452; 0.485; 0.416; 0.366; 0.453; 0.480; 0.412; 0.363; 0.394; 0.475; 0.407; 0.364; 0.379; 0.428; 0.476; 0.406; 0.365; 0.427; 0.371; 0.437; 0.478; 0.405; 0.368; 0.438; 0.377; 0.421; 0.477; 0.404; 0.370; 0.442; 0.381; 0.412; 0.477; 0.403; 0.371;

{'Age': tensor(0.7012),
 'SibSp': tensor(-0.3255),
 'Parch': tensor(0.4457),
 'Fare': tensor(1.9605)}

We have weights, let's do predictions then.

calc_preds(final_weights, validation_set_features)

tensor([0.2747, 0.2570, 0.2551, 0.4518, 0.2866, 0.3618, 0.1635, 0.3150, 0.1341, 0.3466, 0.3335, 0.3193, 0.1141, 0.2576, 0.3470, 0.5280,
        0.7175, 0.4107, 0.4690, 0.3270, 0.2576, 0.6322, 0.8274, 0.8114, 0.2165, 0.2791, 0.0857, 0.6890, 0.1942, 0.2571, 0.3689, 0.2762,
        0.2102, 0.3198, 0.3315, 0.1244, 0.5755, 0.4926, 0.2582, 0.2265, 0.2202, 0.2582, 0.5939, 0.5951, 0.2203, 0.9118, 0.4625, 0.1391,
        0.2540, 0.2362, 0.6799, 0.2099, 0.1912, 0.2639, 0.2865, 0.3434, 0.2571, 0.3599, 0.2298, 0.3316, 0.3904, 0.1474, 0.8728, 0.2293,
        0.4757, 0.5690, 0.2771, 0.4619, 0.4719, 0.4780, 0.2766, 0.4343, 0.3339, 0.2638, 0.3266, 0.1814, 0.2582, 0.6638, 0.2298, 0.5027,
        0.1516, 1.2702, 0.2481, 0.2551, 0.5139, 0.5039, 0.5466, 0.6997, 0.3164, 0.2677, 0.3352, 0.2582, 0.6915, 0.3119, 0.0737, 0.3904,
        0.5353, 0.2350, 0.2381, 0.2059, 0.2342, 0.2571, 0.3334, 0.3145, 0.4712, 0.4477, 0.3206, 0.5618, 0.5447, 0.2610, 0.5607, 0.3219,
        0.3422, 0.6164, 0.2582, 0.8158, 0.7685, 0.2571, 0.4272, 1.4952, 0.2150, 0.9052, 0.2571, 0.1533, 0.2550, 0.3896, 0.2666, 0.2764,
        0.1880, 0.4757, 0.4826, 0.2483, 0.3999, 0.3577, 0.7474, 0.2108, 0.6107, 0.3111, 0.3823, 0.4120, 0.3340, 0.7515, 0.4531, 0.3329,
        0.2771, 0.8546, 0.3932, 0.4615, 0.3767, 0.5560, 0.1580, 0.4056, 0.2392, 0.0988, 0.2350, 0.3166, 0.2582, 0.1864, 0.3719, 0.2829,
        0.4168, 0.3402, 0.2631, 0.9082, 0.5921, 1.2904, 0.3853, 0.4477, 1.8693, 0.6378, 0.1522, 0.2355, 0.4226, 0.2638, 0.4472, 0.7614,
        0.3441, 0.1797])

It's hard not to notice that we should be predicitng a 0 or 1 value, but instead we're getting a lot of negatives.

For simplicity, let's ignore this and say that everything above 0.5 survived.

preds = calc_preds(final_weights, validation_set_features)

print(f"True count was {torch.sum(preds>0.5)} should have been {torch.sum(validation_set_targets.bool())}")
print(f"False count was {torch.sum(preds<=0.5)} should have been {len( validation_set_targets.bool()) - torch.sum(validation_set_targets.bool())}")

True count was 41 should have been 72
False count was 137 should have been 106

With this we can use our validation set and calculate the % of predictions that we are getting correctly.

def calc_accuracy(predictions, validation_set_targets):
    bool_predictions = predictions > 0.5
    bool_validation_set_targets = validation_set_targets.bool()
    correct_predictions = bool_validation_set_targets == bool_predictions
    accuracy_float = correct_predictions.float()
    accuracy_val = accuracy_float.mean()
    return accuracy_val

accuracy_result = calc_accuracy(predictions=preds, validation_set_targets=validation_set_targets)
print(accuracy_result)

tensor(0.6348)

Looks like we're doing slightly better than throwing a coin. I say this is a success 🤔❓🤔 I don't think so...

From the counts above in the current run, the model predicts "not survived" much more often than "survived". That can still produce moderate accuracy on this split, while missing many passengers who actually survived.

The Linear model and neural net from scratch goes further on this exercise. It uses stronger preprocessing, encodes non-numeric columns, and applies a sigmoid so outputs stay between 0 and 1.

For me, this was enough to get a better overview of what's happening inside a neural network.

3.3 Do Deep Learning¶

In the section above we implemented a simple Neural Network. Now let's explore Deep Learning, which is what truly unlocks the power of Neural Networks.

Deep Learning involves creating Neural Networks with multiple layers. Instead of a single layer, we stack layers of neurons, allowing the network to learn more complex patterns and representations from the data.

def generate_random_coefficients_for_deep_learning(n_coeff, num_neurons_per_hidden_layer=[10, 10]):
    torch.manual_seed(42)
    # Define the number of neurons for each layer, including input, hidden, and output layers.
    # The input layer size is n_coeff, hidden layers sizes are from num_neurons_per_hidden_layer, and output layer size is 1.
    num_neurons = [n_coeff] + num_neurons_per_hidden_layer + [1]
    layers = []
    for i in range(len(num_neurons)-1):
        # Determine the size of the input for the current layer from the previous layer's neuron count
        layer_input_size = num_neurons[i]
        # Determine the size of the output for the current layer from the current layer's neuron count
        layer_output_size = num_neurons[i+1]
        # Initialize a layer with random weights between -0.5 and 0.5.
        # torch.rand generates uniform random numbers between 0 and 1, then we shift and scale to get range [-0.5, 0.5].
        # requires_grad_() is set to True to enable gradient tracking for these tensors, which is needed for backpropagation.
        layer = (torch.rand(layer_input_size, layer_output_size)-0.5).requires_grad_()
        layers.append(layer)
    return layers

dnn_layers_coeffs = generate_random_coefficients_for_deep_learning(n_coeff=train_set_features.shape[1], num_neurons_per_hidden_layer=[10, 10])
dnn_layers_coeffs

[tensor([[ 0.3823,  0.4150, -0.1171,  0.4593, -0.1096,  0.1009, -0.2434,  0.2936,  0.4408, -0.3668],
         [ 0.4346,  0.0936,  0.3694,  0.0677,  0.2411, -0.0706,  0.3854,  0.0739, -0.2334,  0.1274],
         [-0.2304, -0.0586, -0.2031,  0.3317, -0.3947, -0.2305, -0.1412, -0.3006,  0.0472, -0.4938],
         [ 0.4516, -0.4247,  0.3860,  0.0832, -0.1624,  0.3090,  0.0779,  0.4040,  0.0547, -0.1577]], requires_grad=True),
 tensor([[ 0.1343, -0.1356,  0.2104,  0.4464,  0.2890, -0.2186,  0.2886,  0.0895,  0.2539, -0.3048],
         [-0.4950, -0.1932, -0.3835,  0.4103,  0.1440,  0.2071,  0.1581, -0.0087,  0.3913, -0.3553],
         [ 0.0315, -0.3413,  0.1542, -0.1722,  0.1532, -0.1042,  0.4147, -0.2964, -0.2982, -0.2982],
         [ 0.4497,  0.1666,  0.4811, -0.4126, -0.4959, -0.3912, -0.3363,  0.2025,  0.1790,  0.4155],
         [-0.2582, -0.3409,  0.2653, -0.2021,  0.3035, -0.1187,  0.2860, -0.3885, -0.2523,  0.1524],
         [ 0.1057, -0.1275,  0.2980,  0.3399, -0.3626, -0.2669,  0.4578, -0.1687, -0.1773, -0.4838],
         [-0.2863,  0.1249, -0.0660, -0.3629,  0.0117, -0.3415, -0.4242, -0.2753, -0.4376, -0.3184],
         [ 0.4998,  0.0944,  0.1541, -0.4663, -0.3284, -0.1664,  0.0782, -0.4400, -0.2154, -0.2993],
         [ 0.0014, -0.1861, -0.0346, -0.3388, -0.3432, -0.2917, -0.1711, -0.3946,  0.4192, -0.0992],
         [ 0.4302,  0.1558, -0.4234,  0.3460, -0.1376, -0.1917, -0.4150, -0.4971,  0.1431, -0.1092]], requires_grad=True),
 tensor([[ 0.1947],
         [-0.4103],
         [ 0.3712],
         [-0.3670],
         [-0.0863],
         [ 0.1044],
         [ 0.2581],
         [ 0.4037],
         [ 0.4555],
         [-0.3965]], requires_grad=True)]

We can test how we do without any training

def calc_preds_for_deep_learning(coeffs, features):
    # @ is matrix multiplication in Python
    # It was introduced in Python 3.5 as part of [PEP 465](https://peps.python.org/pep-0465/)
    layer_features = features
    for layer in coeffs[:-1]:
        layer_features = layer_features @ layer
    layer_features = layer_features @ coeffs[-1]
    return layer_features.squeeze()

def calc_loss_for_deep_learning(coeffs, features, targets): return torch.abs(calc_preds_for_deep_learning(coeffs, features)-targets).mean()

dnn_preds = calc_preds_for_deep_learning(coeffs=dnn_layers_coeffs, features=validation_set_features)

print(f"True count was {torch.sum(dnn_preds>0.5)} should have been {torch.sum(validation_set_targets.bool())}")
print(f"False count was {torch.sum(dnn_preds<=0.5)} should have been {len( validation_set_targets.bool()) - torch.sum(validation_set_targets.bool())}")

True count was 16 should have been 72
False count was 162 should have been 106

And we need to do the grandient descent for all the coeffs on each layer.

def one_epoch_for_deep_learning(coeffs, lr, train_set_features_set, train_set_targets_set):
    loss = calc_loss_for_deep_learning(coeffs, train_set_features_set, train_set_targets_set)
    loss.backward()
    with torch.no_grad():
        for layer in coeffs:
            layer -= layer.grad * lr
            layer.grad.zero_()
    
def train_model_for_deep_learning(train_set_features_set, train_set_targets_set, num_neurons_per_hidden_layer=[10, 10], epochs=60, lr=4):
    torch.manual_seed(442)
    coeffs = generate_random_coefficients_for_deep_learning(n_coeff=train_set_features_set.shape[1], num_neurons_per_hidden_layer=num_neurons_per_hidden_layer)
    for i in range(epochs): one_epoch_for_deep_learning(coeffs, lr=lr, train_set_features_set=train_set_features_set, train_set_targets_set=train_set_targets_set)
    return coeffs # Returns the trained coefficients, which have the same structure as generate_random_coefficients_for_deep_learning

Let's test it then with different combinations of hidden layers and neurons per layer...

for num_neurons in [[10, 10], [20, 20],[5, 5, 5],[30], [], [2], [50], [2, 2], [50, 50], [5, 10, 5], [2, 2, 2, 2]]:
    dnn_final_weights = train_model_for_deep_learning(train_set_features, train_set_targets, num_neurons_per_hidden_layer=num_neurons)
    dnn_preds = calc_preds_for_deep_learning(coeffs=dnn_final_weights, features=validation_set_features)
    accuracy = calc_accuracy(predictions=dnn_preds, validation_set_targets=validation_set_targets)

    print(f"Hidden layers: {num_neurons}")
    print(f"True count was {torch.sum(dnn_preds>0.5)} should have been {torch.sum(validation_set_targets.bool())}")
    print(f"False count was {torch.sum(dnn_preds<=0.5)} should have been {len( validation_set_targets.bool()) - torch.sum(validation_set_targets.bool())}")
    print(f"Accuracy: {accuracy}")
    print("-" * 20) # Separator for readability

Hidden layers: [10, 10]
True count was 0 should have been 72
False count was 0 should have been 106
Accuracy: 0.5955055952072144
--------------------
Hidden layers: [20, 20]
True count was 0 should have been 72
False count was 0 should have been 106
Accuracy: 0.5955055952072144
--------------------
Hidden layers: [5, 5, 5]
True count was 0 should have been 72
False count was 178 should have been 106
Accuracy: 0.5955055952072144
--------------------

Hidden layers: [30]
True count was 0 should have been 72
False count was 178 should have been 106
Accuracy: 0.5955055952072144
--------------------
Hidden layers: []
True count was 41 should have been 72
False count was 137 should have been 106
Accuracy: 0.6348314881324768
--------------------
Hidden layers: [2]
True count was 2 should have been 72
False count was 176 should have been 106
Accuracy: 0.5955055952072144
--------------------

Hidden layers: [50]
True count was 0 should have been 72
False count was 178 should have been 106
Accuracy: 0.5955055952072144
--------------------

Hidden layers: [2, 2]
True count was 0 should have been 72
False count was 178 should have been 106
Accuracy: 0.5955055952072144
--------------------
Hidden layers: [50, 50]
True count was 0 should have been 72
False count was 0 should have been 106
Accuracy: 0.5955055952072144
--------------------
Hidden layers: [5, 10, 5]
True count was 0 should have been 72
False count was 178 should have been 106
Accuracy: 0.5955055952072144
--------------------

Hidden layers: [2, 2, 2, 2]
True count was 0 should have been 72
False count was 178 should have been 106
Accuracy: 0.5955055952072144
--------------------

Not a lot has changed...

Just for fun, we can see how adding a sigmoid and ReLU would affect the results... The code is Ctrl+V Ctrl+C from above, but with a smarter_calc_preds_for_deep_learning.

import torch.nn.functional as F

def smarter_calc_preds_for_deep_learning(coeffs, features):
    # @ is matrix multiplication in Python
    # It was introduced in Python 3.5 as part of [PEP 465](https://peps.python.org/pep-0465/)
    layer_features = features
    for layer in coeffs[:-1]:
        layer_features = F.relu(layer_features @ layer)
    layer_features = layer_features @ coeffs[-1]
    return torch.sigmoid(layer_features.squeeze())

def smarter_calc_loss_for_deep_learning(coeffs, features, targets):
    predictions = smarter_calc_preds_for_deep_learning(coeffs, features)
    return F.binary_cross_entropy(predictions, targets) # Changed loss to Binary Cross Entropy

def smarter_one_epoch_for_deep_learning(coeffs, lr, train_set_features_set, train_set_targets_set):
    loss = smarter_calc_loss_for_deep_learning(coeffs, train_set_features_set, train_set_targets_set)
    loss.backward()
    with torch.no_grad():
        for layer in coeffs:
            layer -= layer.grad * lr
            layer.grad.zero_()
    
def smarter_train_model_for_deep_learning(train_set_features_set, train_set_targets_set, num_neurons_per_hidden_layer=[10, 10], epochs=60, lr=4):
    torch.manual_seed(442)
    coeffs = generate_random_coefficients_for_deep_learning(n_coeff=train_set_features_set.shape[1], num_neurons_per_hidden_layer=num_neurons_per_hidden_layer)
    for i in range(epochs): smarter_one_epoch_for_deep_learning(coeffs, lr=lr, train_set_features_set=train_set_features_set, train_set_targets_set=train_set_targets_set)
    return coeffs # Returns the trained coefficients, which have the same structure as generate_random_coefficients_for_deep_learning

for num_neurons in [[10, 10], [20, 20],[5, 5, 5],[30], [], [2], [50], [2, 2], [50, 50], [5, 10, 5], [2, 2, 2, 2]]:
    dnn_final_weights = smarter_train_model_for_deep_learning(train_set_features, train_set_targets.float(), num_neurons_per_hidden_layer=num_neurons)
    dnn_preds = smarter_calc_preds_for_deep_learning(coeffs=dnn_final_weights, features=validation_set_features)
    accuracy = calc_accuracy(predictions=dnn_preds, validation_set_targets=validation_set_targets)

    print(f"Hidden layers: {num_neurons}")
    print(f"True count was {torch.sum(dnn_preds>0.5)} should have been {torch.sum(validation_set_targets.bool())}")
    print(f"False count was {torch.sum(dnn_preds<=0.5)} should have been {len( validation_set_targets.bool()) - torch.sum(validation_set_targets.bool())}")
    print(f"Accuracy: {accuracy}")
    print("-" * 20) # Separator for readability

Hidden layers: [10, 10]
True count was 26 should have been 72
False count was 152 should have been 106
Accuracy: 0.6853932738304138
--------------------
Hidden layers: [20, 20]
True count was 11 should have been 72
False count was 167 should have been 106
Accuracy: 0.6123595237731934
--------------------
Hidden layers: [5, 5, 5]
True count was 0 should have been 72
False count was 178 should have been 106
Accuracy: 0.5955055952072144
--------------------

Hidden layers: [30]
True count was 29 should have been 72
False count was 149 should have been 106
Accuracy: 0.6910112500190735
--------------------
Hidden layers: []
True count was 11 should have been 72
False count was 167 should have been 106
Accuracy: 0.6348314881324768
--------------------
Hidden layers: [2]
True count was 0 should have been 72
False count was 178 should have been 106
Accuracy: 0.5955055952072144
--------------------

Hidden layers: [50]
True count was 33 should have been 72
False count was 145 should have been 106
Accuracy: 0.6910112500190735
--------------------
Hidden layers: [2, 2]
True count was 0 should have been 72
False count was 178 should have been 106
Accuracy: 0.5955055952072144
--------------------
Hidden layers: [50, 50]
True count was 37 should have been 72
False count was 141 should have been 106
Accuracy: 0.7134831547737122
--------------------
Hidden layers: [5, 10, 5]
True count was 17 should have been 72
False count was 161 should have been 106
Accuracy: 0.6460674405097961
--------------------
Hidden layers: [2, 2, 2, 2]
True count was 0 should have been 72
False count was 178 should have been 106
Accuracy: 0.5955055952072144
--------------------

Interesting. That definitely improved.

I will stop here for now. However, the next step will likely be to add the boolean variables like is_male, is_female, is_class_1, etc. If I understood that correctly and I'm not making any mistakes, it should bring me to around 80% accuracy, like we see on the fast.ai notebook.

4. Binary Classification Metrics¶

4.1 What BCE Means (Simple to Model)¶

Let's start with BCE itself before talking about NN internals. BCE (Binary Cross-Entropy) is the penalty for predicted probabilities in binary classification.

For one point with target y in {0,1} and predicted probability p for class 1: $$-\left(y\log(p) + (1-y)\log(1-p)\right)$$

Intuition: confident correct predictions get small loss, confident wrong predictions get large loss.

eps = 1e-7

def bce_for_one_point(y, p, eps=1e-7):
    y_t = torch.tensor(float(y))
    p_t = torch.tensor(float(p)).clamp(eps, 1 - eps)
    return float((-(y_t * torch.log(p_t) + (1 - y_t) * torch.log(1 - p_t))).item())

single_examples = [
    ("true=1, p=0.95 (good)", 1, 0.95),
    ("true=1, p=0.55 (uncertain)", 1, 0.55),
    ("true=1, p=0.10 (confident wrong)", 1, 0.10),
    ("true=0, p=0.10 (good)", 0, 0.10),
]

for label, y, p in single_examples:
    print(f"{label:<34} -> BCE = {bce_for_one_point(y, p):.4f}")

print("\nHand-picked mini-batch (not random):")
# These values are chosen on purpose: mostly good predictions plus one confident wrong case.
mini_targets = torch.tensor([1, 1, 1, 0, 0, 0], dtype=torch.float32)
mini_probs = torch.tensor([0.95, 0.70, 0.40, 0.10, 0.30, 0.85], dtype=torch.float32)

mini_losses = -(
    mini_targets * torch.log(mini_probs.clamp(eps, 1 - eps))
    + (1 - mini_targets) * torch.log((1 - mini_probs).clamp(eps, 1 - eps))
)

for i, (y, p, l) in enumerate(zip(mini_targets, mini_probs, mini_losses), start=1):
    print(f"row {i}: y={int(y.item())} p={p.item():.2f} -> loss={l.item():.4f}")

print(f"Mean BCE over mini-batch: {mini_losses.mean().item():.4f}")

true=1, p=0.95 (good)              -> BCE = 0.0513
true=1, p=0.55 (uncertain)         -> BCE = 0.5978
true=1, p=0.10 (confident wrong)   -> BCE = 2.3026
true=0, p=0.10 (good)              -> BCE = 0.1054

Hand-picked mini-batch (not random):
row 1: y=1 p=0.95 -> loss=0.0513
row 2: y=1 p=0.70 -> loss=0.3567
row 3: y=1 p=0.40 -> loss=0.9163
row 4: y=0 p=0.10 -> loss=0.1054
row 5: y=0 p=0.30 -> loss=0.3567
row 6: y=0 p=0.85 -> loss=1.8971
Mean BCE over mini-batch: 0.6139

prob_axis = torch.linspace(0.001, 0.999, 200)
loss_if_true_1 = -torch.log(prob_axis)
loss_if_true_0 = -torch.log(1 - prob_axis)

plt.figure(figsize=(6, 4))
plt.plot(prob_axis, loss_if_true_1, label='If true label is 1')
plt.plot(prob_axis, loss_if_true_0, label='If true label is 0')
plt.title('BCE penalty curve')
plt.xlabel('Predicted probability of class 1')
plt.ylabel('Loss')
plt.legend()
plt.grid(alpha=0.25)
plt.show()

# Hand-picked one-feature example to show where probabilities come from in a model.
toy_x = torch.tensor([-2.5, -1.5, -0.5, 0.5, 1.5, 2.5], dtype=torch.float32)
toy_y = torch.tensor([0, 0, 0, 1, 1, 1], dtype=torch.float32)

toy_w, toy_b = 1.30, -0.20
toy_logits = toy_w * toy_x + toy_b
toy_probs = torch.sigmoid(toy_logits)
toy_bce = -(toy_y * torch.log(toy_probs.clamp(eps, 1 - eps)) + (1 - toy_y) * torch.log((1 - toy_probs).clamp(eps, 1 - eps))).mean()

print(f"Toy sigmoid: p(x) = sigmoid({toy_w:.2f} * x + {toy_b:.2f})")
print(f"Toy mean BCE from model probabilities: {toy_bce.item():.4f}")

x_grid = torch.linspace(float(toy_x.min().item()) - 0.5, float(toy_x.max().item()) + 0.5, 300)
p_grid = torch.sigmoid(toy_w * x_grid + toy_b)

plt.figure(figsize=(7, 4))
plt.scatter(toy_x[toy_y == 0], toy_y[toy_y == 0], color='tomato', s=70, label='Class 0')
plt.scatter(toy_x[toy_y == 1], toy_y[toy_y == 1], color='seagreen', s=70, label='Class 1')
plt.plot(x_grid, p_grid, color='black', linewidth=2, label='Sigmoid probability')

for xi, yi, pi in zip(toy_x, toy_y, toy_probs):
    plt.vlines(float(xi.item()), float(yi.item()), float(pi.item()), colors='gray', alpha=0.35, linewidth=1)

plt.title('Toy model: feature x to probability')
plt.xlabel('Feature x (hand-picked values)')
plt.ylabel('Predicted probability of class 1')
plt.ylim(-0.05, 1.05)
plt.grid(alpha=0.25)
plt.legend()
plt.show()

Toy sigmoid: p(x) = sigmoid(1.30 * x + -0.20)
Toy mean BCE from model probabilities: 0.1995

In the NN case, the math is exactly the same:

Replace w*x+b with NN logits.
Apply sigmoid to get probabilities.
Compute BCE on those probabilities.

Only the source of probabilities changes; BCE itself does not.

4.2 Binary Classification Terms in Plain English¶

Now let's run the same ideas on our NN validation probabilities. The goal is to make terms like BCE, confusion matrix, ROC, AUC, precision, recall, and F1 feel concrete.

Quick reminder for the next cells: we are using the same BCE formula, now on NN validation probabilities.

example_layers = [50, 50]
example_weights = smarter_train_model_for_deep_learning(
    train_set_features,
    train_set_targets.float(),
    num_neurons_per_hidden_layer=example_layers
)
example_probs = smarter_calc_preds_for_deep_learning(coeffs=example_weights, features=validation_set_features).detach().float()
example_targets = validation_set_targets.float()

print(f"Example model layers: {example_layers}")
print(f"Validation rows: {len(example_targets)}")

Example model layers: [50, 50]
Validation rows: 178

def manual_bce(probs, targets, eps=1e-7):
    probs = probs.clamp(eps, 1 - eps)
    targets = targets.float()
    return -(targets * torch.log(probs) + (1 - targets) * torch.log(1 - probs)).mean()

toy_bce_from_helper = manual_bce(toy_probs, toy_y)
nn_bce_manual = manual_bce(example_probs, example_targets)
nn_bce_library = torch.nn.functional.binary_cross_entropy(example_probs, example_targets)

print(f"Toy BCE (manual helper): {toy_bce_from_helper.item():.6f}")
print(f"NN validation BCE (manual): {nn_bce_manual.item():.6f}")
print(f"NN validation BCE (PyTorch): {nn_bce_library.item():.6f}")

prob_axis = torch.linspace(0.001, 0.999, 200)
loss_if_true_1 = -torch.log(prob_axis)
loss_if_true_0 = -torch.log(1 - prob_axis)

plt.figure(figsize=(6, 4))
plt.plot(prob_axis, loss_if_true_1, label='If true label is 1')
plt.plot(prob_axis, loss_if_true_0, label='If true label is 0')
plt.title('BCE penalty curve')
plt.xlabel('Predicted probability of survival')
plt.ylabel('Loss')
plt.legend()
plt.grid(alpha=0.25)
plt.show()

Toy BCE (manual helper): 0.199508
NN validation BCE (manual): 0.614086
NN validation BCE (PyTorch): 0.614086

Confusion matrix is a 2x2 table that counts prediction outcomes:

TP (true positive): predicted survived, and actually survived.
TN (true negative): predicted not survived, and actually not survived.
FP (false positive): predicted survived, but actually not survived.
FN (false negative): predicted not survived, but actually survived.

From those counts:

Precision: out of predicted survivors, how many were truly survivors?
Recall: out of all true survivors, how many did we catch?
F1: a balance score between precision and recall.

def preds_from_threshold(probs, threshold):
    return (probs >= threshold).int()

def confusion_counts(preds, targets):
    targets = targets.int()
    tp = int(((preds == 1) & (targets == 1)).sum().item())
    tn = int(((preds == 0) & (targets == 0)).sum().item())
    fp = int(((preds == 1) & (targets == 0)).sum().item())
    fn = int(((preds == 0) & (targets == 1)).sum().item())
    return tp, tn, fp, fn

def metrics_from_counts(tp, tn, fp, fn, eps=1e-9):
    accuracy = (tp + tn) / (tp + tn + fp + fn + eps)
    precision = tp / (tp + fp + eps)
    recall = tp / (tp + fn + eps)
    f1 = 2 * precision * recall / (precision + recall + eps)
    return accuracy, precision, recall, f1

default_threshold = 0.5
default_preds = preds_from_threshold(example_probs, default_threshold)
tp, tn, fp, fn = confusion_counts(default_preds, example_targets)
acc, prec, rec, f1 = metrics_from_counts(tp, tn, fp, fn)

print(f"threshold={default_threshold:.2f}")
print(f"TP={tp} TN={tn} FP={fp} FN={fn}")
print(f"accuracy={acc:.4f} precision={prec:.4f} recall={rec:.4f} f1={f1:.4f}")
print(f"check total rows: {tp + tn + fp + fn} == {len(example_targets)}")

cm = torch.tensor([[tn, fp], [fn, tp]], dtype=torch.float)
plt.figure(figsize=(4.5, 4.2))
plt.imshow(cm, cmap='Blues')
plt.title('Confusion matrix (threshold = 0.50)')
plt.xticks([0, 1], ['Pred: not survived', 'Pred: survived'], rotation=20)
plt.yticks([0, 1], ['Actual: not survived', 'Actual: survived'])
for i in range(2):
    for j in range(2):
        plt.text(j, i, int(cm[i, j].item()), ha='center', va='center', color='black')
plt.colorbar(fraction=0.046, pad=0.04)
plt.tight_layout()
plt.show()

threshold=0.50
TP=29 TN=98 FP=8 FN=43
accuracy=0.7135 precision=0.7838 recall=0.4028 f1=0.5321
check total rows: 178 == 178

Threshold is the cutoff used to convert probability into class label. If probability is above threshold, we predict "survived".

Lower threshold usually increases recall (we catch more true survivors), but also increases false positives. Higher threshold usually increases precision, but also increases false negatives.

threshold_grid = torch.linspace(0.05, 0.95, 19)
precision_vals, recall_vals, f1_vals = [], [], []

for thr in threshold_grid:
    preds = preds_from_threshold(example_probs, float(thr.item()))
    tp, tn, fp, fn = confusion_counts(preds, example_targets)
    _, precision, recall, f1 = metrics_from_counts(tp, tn, fp, fn)
    precision_vals.append(precision)
    recall_vals.append(recall)
    f1_vals.append(f1)

best_idx = int(torch.tensor(f1_vals).argmax().item())
best_threshold = float(threshold_grid[best_idx].item())
best_preds = preds_from_threshold(example_probs, best_threshold)
best_tp, best_tn, best_fp, best_fn = confusion_counts(best_preds, example_targets)
best_acc, best_prec, best_rec, best_f1 = metrics_from_counts(best_tp, best_tn, best_fp, best_fn)

plt.figure(figsize=(7, 4))
plt.plot(threshold_grid, precision_vals, marker='o', label='Precision')
plt.plot(threshold_grid, recall_vals, marker='o', label='Recall')
plt.plot(threshold_grid, f1_vals, marker='o', label='F1')
plt.axvline(0.5, linestyle='--', color='gray', label='Default threshold (0.50)')
plt.axvline(best_threshold, linestyle='--', color='green', label=f'Best F1 threshold ({best_threshold:.2f})')
plt.title('Threshold vs precision, recall, and F1')
plt.xlabel('Threshold')
plt.ylabel('Score')
plt.ylim(0, 1.05)
plt.grid(alpha=0.25)
plt.legend()
plt.show()

print(f"Best threshold by validation F1: {best_threshold:.2f}")
print(f"TP={best_tp} TN={best_tn} FP={best_fp} FN={best_fn}")
print(f"accuracy={best_acc:.4f} precision={best_prec:.4f} recall={best_rec:.4f} f1={best_f1:.4f}")

Best threshold by validation F1: 0.40
TP=41 TN=88 FP=18 FN=31
accuracy=0.7247 precision=0.6949 recall=0.5694 f1=0.6260

ROC curve (Receiver Operating Characteristic) shows model behavior across all thresholds. Each point is:

TPR (true positive rate): same idea as recall.
FPR (false positive rate): how often we wrongly flag non-survivors as survivors.

AUC (Area Under the ROC Curve) is one summary number from 0 to 1. Closer to 1 means better ranking quality; near 0.5 is close to random ranking.

def roc_points_from_thresholds(probs, targets, thresholds):
    fpr_vals, tpr_vals = [], []
    for thr in thresholds:
        preds = preds_from_threshold(probs, float(thr.item()))
        tp, tn, fp, fn = confusion_counts(preds, targets)
        tpr = tp / (tp + fn + 1e-9)
        fpr = fp / (fp + tn + 1e-9)
        fpr_vals.append(fpr)
        tpr_vals.append(tpr)

    fpr_tensor = torch.tensor(fpr_vals, dtype=torch.float)
    tpr_tensor = torch.tensor(tpr_vals, dtype=torch.float)
    order = torch.argsort(fpr_tensor)
    return fpr_tensor[order], tpr_tensor[order]

def manual_auc_trapezoid(fpr, tpr):
    auc_val = torch.trapz(tpr, fpr).item()
    return float(max(0.0, min(1.0, auc_val)))

roc_thresholds = torch.linspace(0.0, 1.0, 201)
roc_fpr, roc_tpr = roc_points_from_thresholds(example_probs, example_targets, roc_thresholds)
manual_auc = manual_auc_trapezoid(roc_fpr, roc_tpr)

plt.figure(figsize=(5.5, 5))
plt.plot(roc_fpr, roc_tpr, label=f'Model ROC (AUC={manual_auc:.3f})')
plt.plot([0, 1], [0, 1], '--', color='gray', label='Random baseline')
plt.title('ROC curve (manual thresholds)')
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.grid(alpha=0.25)
plt.legend()
plt.show()

print(f"Manual ROC-AUC: {manual_auc:.6f}")

Manual ROC-AUC: 0.709513

Libraries automate the same math shown above, so in practice we often call a helper instead of coding each step by hand.

try:
    from sklearn.metrics import roc_auc_score
    sklearn_auc = roc_auc_score(example_targets.int().numpy(), example_probs.numpy())
    print(f"Manual ROC-AUC : {manual_auc:.6f}")
    print(f"sklearn ROC-AUC: {sklearn_auc:.6f}")
except Exception as e:
    print(f"sklearn comparison skipped: {e}")

Manual ROC-AUC : 0.709513
sklearn ROC-AUC: 0.709119

So now we have a clearer checklist:

BCE tells us probability-quality error.
Confusion matrix tells us what kinds of mistakes we make.
Threshold tuning lets us pick the precision/recall tradeoff we want.
ROC-AUC tells us ranking quality across all thresholds.

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.250000	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	0	PC 17599	71.283302	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.925000	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.099998	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.050000	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.000000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.000000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.450001	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.000000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.750000	NaN	Q

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693432
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329224