Why Softmax not used when Cross-entropy-loss is used as loss function during Neural Network training in PyTorch?
Quick answer:
Cross-Entropy-Loss (CELoss) with Softmax can be converted to a simplified equation. This simplified equation is computationally efficient as compared to calculating CELoss and Softmax separately. PyTorch’s nn.CrossEntropyLoss() uses this simplified equation. Hence we can say “CrossEntropyLoss() in PyTorch internally computes softmax”
Detailed answer:
Cross Entropy Loss: MultiClass
‘y’ is the label/ground truth. It can take two values 1 or 0. ‘P’ is the predicted probability which varies between 0 to 1. N is total number of classes (c = 1 to N)
Lets take an example and go through the equations. The above mentioned equation is for calculating loss on one input/image (No batch input). The image belongs to only one class, therefore among all the ‘yc’ labels, only one ‘yc’ is 1 and rest are 0. Therefore, we get the following one term equation:
Here, Pc is the probability output of the neuron whose label is 1. All Pc values are obtained from applying softmax to the last layer of the neural network. Pc = \sigma(x_i) is computed as follows:
The \sigma(x_i) is softmax output value of one neuron. [\sigma(x_i)] = [Pc value of one neuron whose ‘yc’ is 1]. I am repeating and stressing on this because it is important to understand this clearly.
Substituting this equation in the Cross Entropy Loss we get the following equation:
Simplifying above equation:
Finally we got the CELoss equation used in PyTorch which combines CELoss and softmax in one equation and is simple in terms of computation.
I was curious to understand the reason behind not using Softmax when Cross Entropy Loss is used during training in PyTorch, which I did and it led to this post. I hope this was useful post. If this material needs more clarification or needs any changes/improvements, let me know, I am happy to look into it.
References: