Raw logits or softmax probability outputs in nn.CrossEntropyLoss()? #252

gigasurgeon · 2023-01-09T12:46:07Z

gigasurgeon
Jan 9, 2023

In section 4, we have code for multiclass classification. I was experimenting with the code and tried to pass both the raw logits as well as probabilities (after passing raw logits through torch.softmax() function) to torch.nn.CrossEntropyLoss() . In both the cases, my model reached 99% accuracy.

I was wondering, which one is recommended, raw logits or softmax probs to crossentropy?

CASE 1 - training loop where I am passing logits to crossentropy

for epoch in range(EPOCHS):
    model.train()
    logits = model(X_train)
    probs = torch.softmax(logits, dim=1)
    preds = torch.argmax(probs, dim=1)
    loss = loss_fn(logits, y_train)
    acc = accuracy(preds, y_train)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

CASE 2 - training loop where I am passing prediction probabilities (after passing logits to softmax()) to crossentropy

for epoch in range(EPOCHS):
    model.train()
    logits = model(X_train)
    probs = torch.softmax(logits, dim=1)
    preds = torch.argmax(probs, dim=1)
    loss = loss_fn(probs, y_train)
    acc = accuracy(preds, y_train)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

In both the cases, my model reaches 97% test accuracy

Answered by AlienSarlak

Jan 11, 2023

As I mentioned, it is indeed an interesting observation. So, thank you for sharing this.

But back to your question. According to PyTorch documentation:
The input is expected to contain the unnormalized logits for each class (which do not need to be positive or sum to 1, in general).
Therefore it is recommended to set the raw logit instead of probabilities.

But it is beyond that. Have you noticed that the input is in the shape of logits and y_train or probs and 'y_train'? If you have a closer look, you will notice that the logits and 'probs' are in the shape of [number_of_sample, number_of_classes], but y_train is in [number_of_sample]! So, how does cross-entropy loss calculate the loss?
T…

View full answer

AlienSarlak · 2023-01-10T20:45:57Z

AlienSarlak
Jan 10, 2023

This is an interesting observation!

If you changed the loss_fun parameters like below:
loss = loss_fn(y_pred, y_blob_train)
You should get an error like this:
RuntimeError: Expected floating point type for target with class probabilities, got Long

The reason is that y_pred doesn't have any grad_fn meaning that loss cannot do the backward due to lack of grad_fn!

Could you please restart the kernel of the notebook and re-run it again? Also, if possible please provide more information and pieces of code. I am curious about this topic that you raised.

1 reply

gigasurgeon Jan 11, 2023
Author

I have updated the post and added code sections. I am using VS code on my local machine.

AlienSarlak · 2023-01-11T10:53:21Z

AlienSarlak
Jan 11, 2023

As I mentioned, it is indeed an interesting observation. So, thank you for sharing this.

But back to your question. According to PyTorch documentation:
The input is expected to contain the unnormalized logits for each class (which do not need to be positive or sum to 1, in general).
Therefore it is recommended to set the raw logit instead of probabilities.

But it is beyond that. Have you noticed that the input is in the shape of logits and y_train or probs and 'y_train'? If you have a closer look, you will notice that the logits and 'probs' are in the shape of [number_of_sample, number_of_classes], but y_train is in [number_of_sample]! So, how does cross-entropy loss calculate the loss?
The answer is in the cross entropy loss (CEL) function implementation. Inside the CEL function, it assumes that you are inserting the raw logit. Hence, it computes the softmax for you implicitly. Therefore there is no need for any regularization on the input of CEL as input.

Another interesting aspect is that unlike argmax the softmax doesn't detach the gradient. Otherwise, the loss couldn't keep track of the gradients of the model's parameters.

Now let me answer your possible doubt about the fact that you achieve more than 90 per cent of accuracy no matter whether logit passes as input of loss or probs.
In my opinion, because by passing the 'probs' you pass the regulized form of logit to the loss function then it can deal with that data as well. However, since CEL has an implicit implementation of softmax it means the input regularizes twice which is not numerically stable and not a good idea for large datasets. (as you know we've just worked on toy datasets :) )

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Raw logits or softmax probability outputs in nn.CrossEntropyLoss()? #252

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Raw logits or softmax probability outputs in nn.CrossEntropyLoss()? #252

Uh oh!

Uh oh!

gigasurgeon Jan 9, 2023

Replies: 2 comments · 1 reply

Uh oh!

AlienSarlak Jan 10, 2023

Uh oh!

gigasurgeon Jan 11, 2023 Author

Uh oh!

Uh oh!

AlienSarlak Jan 11, 2023

gigasurgeon
Jan 9, 2023

Replies: 2 comments 1 reply

AlienSarlak
Jan 10, 2023

gigasurgeon Jan 11, 2023
Author

AlienSarlak
Jan 11, 2023