You’re constructing a Keras mannequin. If you’ve neglected deep learning for so long that your understanding of output activations and cost functions has waned, then a certain degree of memorization (or reference) may be necessary to get everything working correctly again. Here are some key points that you might want to bear in mind as you continue exploring your options.
Or:
Understanding the underlying reasons can simplify complex matters. What drives the correlation between output activations and price features remains unclear. Do they always have to?
In a nutshell
We intentionally choose activations that prompt the community to forecast outcomes that align with our desired predictions.
The pricing operative is determined by the model.
Since neural networks are typically optimized using metrics like cross-entropy or mean squared error, and rely on the assumed distribution of output items, this often leads to fundamentally distinct optimization objectives? By achieving these objectives, the cross-entropy (i.e., mismatch) between the actual and predicted distributions is consequently reduced.
Let’s begin with the best: the linear case.
Regression
Here’s a simple yet effective model for predicting sepal width based on sepal size that botanists will find useful.
The underlying assumption of our model posits that the distribution of sepal widths follows a normal pattern conditioned on sepal size. Typically, we seek to predict the mean of a conditional Gaussian distribution.
In this case, the associated fee operates that minimizes cross-entropy – equivalently, it optimizes maximum likelihood – is a key concept.
That’s precisely why we’re leveraging this strategy as our primary asset.
Instead, we’d seek to forecast the median of that conditional distribution directly. In this scenario, we would modify the associated fee structure to incorporate an absolute error approach.
Nonlinearity beckons.
Binary classification
As avid backyard bird enthusiasts, we’re eager for an app that alerts us to the presence of birds in our own yard – without mistakenly triggering notifications for passing airplanes or other aerial activity. Will we therefore establish a community capable of distinguishing between birdwatching and aviation?
While we typically explore binary classification, its underlying model is often framed as a probability distribution conditioned on the input features. So:
A Bernoulli random variable takes on exactly two values: 0 with probability , and 1 with probability . So that’s what our community should strive to create.
A more robust approach would be to consider clipping all values outside a predetermined interval, thereby ensuring that any outliers are effectively removed from the dataset. However, when we embark on this endeavour, the gradient in those regions should be: The community cannot be taught through conventional means.
A more effective approach is to squash the entire incoming interval into the range (0,1) using the logistic function.

As one can clearly observe, the sigmoid function exhibits saturation when its input becomes either extremely large or extremely small. Is that this problematic?
It relies upon. As the ultimate goal remains to ensure the associated fee structure reaches optimal saturation. If we were to choose implied squared error here, as is often the case in regression tasks, that’s exactly what would happen.
Notwithstanding our adherence to the overarching principle of maximum likelihood/cross-entropy, the ensuing loss will likely ensue.
The location where the curve of the sigmoid function begins to plateau, effectively undoing its original upward momentum?
When implementing deep learning models in Keras, the associated loss function is crucial for training the network effectively. binary_crossentropy
. For each unit of a single merchandise sold, any loss incurred must be accurately recorded and accounted for.
- When the bottom fact is 1?
- When the bottom fraction is zero?
Here, one can observe that when a specific instance is confidently predicted by the community as being misclassified, it significantly contributes to the overall loss.

When we draw a distinction between more than two options, complexity arises.
Multi-class classification
The CIFAR-10 dataset comprises 60,000 32×32-pixel colour images in 10 classes, with each class representing a distinct object category, such as animals or vehicles.
Notably, few alternatives exist to the preceding text; nevertheless, consider the alterations made to the activation and pricing functions.
Now that we’ve got this mixed with some other stuff. Why?
To guarantee a precise probability distribution: The cumulative probabilities of all mutually exclusive events should converge to a total of exactly 1.
The CIFAR-10 dataset features a single object per image, resulting in mutually exclusive instances.
The single-draw multinomial distribution, also known as “Multinoulli” in reference to Murphy’s work, can be modeled using the softmax activation function.
Because the sigmoid function and softmax function have the propensity to saturate. When processing vast amounts of data, instances may arise where outputs become extremely large, leading to potential issues.
Similarly, the activation function in the associated fee operates inversely to undo the effect of saturation.
The category we’re estimating the likelihood of exhibits a linear relationship with our loss function, ensuring it cannot reach saturation point.
The loss operation in Keras that does this for us is known as Binary Cross-Entropy (BCE)? categorical_crossentropy
. We utilise sparse categorical cross-entropy loss within our code, mirroring categorical_crossentropy
However, this approach doesn’t preclude conversion of integer labels to one-hot vectors when necessary.
What does softmax actually do when it comes to probability calculations? What innovative products will we create with our uncooked outputs?

After applying the softmax function, the normalized likelihood distribution resembles:

The reference to a specific “place” within the title implies a sense of context and location, but without further information, it’s unclear what is meant by this phrase. Is it an allusion to a physical or metaphorical place? Can we infer that the title is referring to a specific geographical location, cultural institution, or perhaps a mental or emotional space? Without more clarification, the meaning of “place” within the title remains ambiguous and open to interpretation. It’s crucial to acknowledge that activation functions should not solely guarantee the attainment of targeted distributions; instead, they also reconfigure the interdependencies among values.
Conclusion
Our publication started by exploring common heuristics, such as “when performing multi-class classification, we employ softmax activation in conjunction with categorical cross-entropy as our loss function.” We aimed to demonstrate the underlying logic behind these approaches, and hopefully, we have succeeded in doing so.
Despite this consideration, you may still need to deduce when these guidelines do not apply? To identify and count various entities within an image. Given that the technique is not particularly useful, since we do not want to overemphasize differences between candidates. So, accordingly, we would employ all output items with on to establish the probability of occurrence.
Goodfellow, I., Bengio, Y., & Courville, A. 2016. . MIT Press.
Murphy, Kevin. 2012. . MIT Press.