Module 03 Model_2 Not Improving (training) #353

toddsp22 · 2023-03-16T14:31:45Z

toddsp22
Mar 16, 2023

When I train model_2 I get the following accuracies:

Epoch: 0

Train loss: 2.30229 | Train acc: 10.00%
Test loss: 2.30231 | Test acc: 9.99%

Epoch: 1

Train loss: 2.30228 | Train acc: 10.00%
Test loss: 2.30231 | Test acc: 9.99%

Epoch: 2

Train loss: 2.30228 | Train acc: 10.00%
Test loss: 2.30231 | Test acc: 9.99%

Train time on cpu: 147.086 seconds

So, there is something in wrong in:
my training function, my model or my function call.

Here is my train_function:

def train_step(model: torch.nn.Module,
data_loader: torch.utils.data.DataLoader,
loss_fn: torch.nn.Module,
optimizer: torch.optim.Optimizer,
accuracy_fn,
device: torch.device = device):

"""Performs a training with model trying to learn on data_loader."""
train_loss, train_acc = 0, 0

# put model into training mode

model.train()
for batch, (X, y) in enumerate(data_loader):
          
    # X, y = X.to(device), y.to(device)
   
    y_pred = model(X)

    loss = loss_fn(y_pred, y)
    train_loss += loss
    train_acc += accuracy_fn(y_true = y,
                             y_pred=y_pred.argmax(dim=1))
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

train_loss /= len(data_loader)
train_acc /= len(data_loader)
print(f"Train loss: {train_loss:.5f} | Train acc: {train_acc:.2f}%")
-------------------------------------------------------------------

This code works on model_1 # The to(device) is not needed since I'm working only on a cpu. It didn't make a difference anyway.

Here is my MNIST V2:

class FashionMNISTModelV2(nn.Module):
def init(self, input_shape: int, hidden_units: int, output_shape: int):
super().init()
self.conv_block_1 = nn.Sequential(
nn.Conv2d(in_channels=input_shape,
out_channels=hidden_units,
kernel_size=3,
stride=1,
padding=1),
nn.ReLU(),
nn.Conv2d(in_channels=hidden_units,
out_channels=hidden_units,
kernel_size=3,
stride=1,
padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2)
)
self.conv_block_2 = nn.Sequential(
nn.Conv2d(hidden_units,hidden_units,3,padding=1),
nn.ReLU(),
nn.Conv2d(hidden_units,hidden_units,3,padding=1),
nn.ReLU(),
nn.MaxPool2d(kernel_size=2)
)
self.classifier=nn.Sequential(
nn.Flatten(),
nn.Linear(in_features=hidden_units77,
out_features=output_shape)
)

def forward(self, x: torch.Tensor):
    x=self.conv_block_1(x)
    #print(x.shape)
    x=self.conv_block_2(x)
    #print(x.shape)
    x=self.classifier(x)
    return x

Maybe someone could come up with a bug that I don't see.

Here is my epoch loop:

orch.manual_seed(42)
torch.cuda.manual_seed(42)

from timeit import default_timer as timer
train_time_start_model_2 = timer()

train and test model

epochs = 3
for epoch in tqdm(range(epochs)):
print(f"Epoch: {epoch}\n---------")
train_step(model=model_2,
data_loader = train_dataloader,
loss_fn=loss_fn,
optimizer=optimizer,
accuracy_fn=accuracy_fn,
device=device)
test_step(model=model_2,
data_loader = test_dataloader,
loss_fn=loss_fn,
accuracy_fn=accuracy_fn,
device=device)

train_time_end_model_2 = timer()
total_train_time_model_2 = print_train_time(start=train_time_start_model_2,
end=train_time_end_model_2,
device=device)

Perhaps someone could find a problem there.

Also I noticed something I thought was interesting. While trying to see if my functions were the problem, I noticed this:

If you are training model_1 and use model_2's parameters it won't learn. Here is what I used though:

loss_fn= nn.CrossEntropyLoss()
optimizer == torch.optim.SGD(params=model_2.parameters(),
lr=0.1)

Please I've spent hours combing through my code and I can't find an answer.

A version problem?

toddsp22 · 2023-03-16T17:53:49Z

toddsp22
Mar 16, 2023
Author

optimizer == torch.optim.SGD(params=model_2.parameters(), _<-- The bug is here._
it's the == the optimizer that is used is the model_1 parameters() since the model_2 parameter optimizer was never created. So the model doesn't ever train.

If I could make a suggestion. I would suggest calling the optimizer optimizer_m2 or something like it. If it was, I would have found the bug right away.

0 replies

Yossi-Hd · 2023-03-18T12:59:06Z

Yossi-Hd
Mar 18, 2023

I had the same issue, and as the @toddsp22 suggested the problem was in the 'optimizer' (I spelled the optimizer wrong in the definition of 'model_2' so it goes back to the existing 'optimizer' which was the 'optimizer' of model_1)

0 replies

gaspardringuenet · 2023-03-28T08:23:09Z

gaspardringuenet
Mar 28, 2023

Hello,
I have the exact same issue (even the loss is the same), but my optimizer seems to be fine. My code's below. If someone wants to take a look that would be great. What I don't get is that everything's fine when I write it all instead of using train_step() and test_step() functions.

My functions and setup:

loss_fn = nn.CrossEntropyLoss()

optimizer = torch.optim.SGD(params=model.parameters(), lr=0.1)

def train_step(model: nn.Module,
               data_loader: DataLoader,
               loss_fn: nn.Module,
               optimizer: torch.optim.Optimizer,
               device: torch.device = device):
  train_loss = 0
  
  model.train()

  for batch, (X, y) in tqdm(enumerate(data_loader)):
    X, y = X.to(device), y.to(device)

    y_logits = model(X)

    loss = loss_fn(y_logits, y)
    train_loss += loss

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

  train_loss /= len(data_loader)

  return train_loss

def test_step(model: nn.Module,
              data_loader: DataLoader,
              loss_fn: nn.Module,
              optimizer: torch.optim.Optimizer,
              device: torch.device = device):
  test_loss = 0

  model.eval()
  with torch.inference_mode():
    for batch, (X, y) in enumerate(data_loader):
      X, y = X.to(device), y.to(device)

      y_logits = model(X)

      loss = loss_fn(y_logits, y)
      test_loss += loss
    
    test_loss /= len(data_loader)

  return test_loss

The training and testing code:

%%time
from tqdm.auto import tqdm

model = model.to("cpu")

epochs = 5

for epoch in tqdm(range(epochs)):
  print(f"Epoch: {epoch}\n-------")

  train_loss = train_step(model=model,
                          data_loader=train_dataloader,
                          loss_fn=loss_fn,
                          optimizer=optimizer,
                          device="cpu")
  
  test_loss = test_step(model=model,
                        data_loader=test_dataloader,
                        loss_fn=loss_fn,
                        optimizer=optimizer,
                        device="cpu")
  print(f"Loss: {train_loss:.3f} | Test_loss: {test_loss:.3f}")

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Module 03 Model_2 Not Improving (training) #353

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Module 03 Model_2 Not Improving (training) #353

Uh oh!

toddsp22 Mar 16, 2023

Epoch: 0

Epoch: 1

Epoch: 2

Train time on cpu: 147.086 seconds

train and test model

train_time_end_model_2 = timer() total_train_time_model_2 = print_train_time(start=train_time_start_model_2, end=train_time_end_model_2, device=device)

Replies: 3 comments

Uh oh!

toddsp22 Mar 16, 2023 Author

Uh oh!

Yossi-Hd Mar 18, 2023

Uh oh!

gaspardringuenet Mar 28, 2023

toddsp22
Mar 16, 2023

train_time_end_model_2 = timer()
total_train_time_model_2 = print_train_time(start=train_time_start_model_2,
end=train_time_end_model_2,
device=device)

toddsp22
Mar 16, 2023
Author

Yossi-Hd
Mar 18, 2023

gaspardringuenet
Mar 28, 2023