Multi-class
Multi-class classification refers to classification problems where you can have more than just 2 possible output labels, so not just zero or one. Some examples include:
When trying to read postal or zip codes on an envelope
Classify whether a patient may have any 3 or more possible diseases
Visual defect inspection of part manufacturer in a factory
So, multi-class classification is still a classification problem, as y can take on a small discrete categories, not just any number, but now, y can take more than just 2 possible values. The plot may look something like this:
Softmax
The softmax regression algorithm is a generalization of logistic regression, which is a binary classification algorithm to the multi-class classification contexts.
Take a look at the difference of the formula between logistic regression with only 2 possible outputs and softmax regression with N possible outputs on the left side of the image below:
The right side of the image above shows the computation of the softmax regression with 4 possible outputs. Note how the total of the possible outputs equal to 1.
For example,
a1 = probability of the input being a picture of a dog
a2 = probability of the input being a picture of a cat
a3 = probability of the input being a picture of a bird
a4 = probability of the input being neither a dog, a cat, nor a bird.
in the example image above:
a1 = 0.30
a2 = 0.20
a3 = 0.15
a4 = 0.35
the total of a1 + a2 + a3 + a4 is equal to 1, based on the output result, the highest probability is a4, which means the input is neither a dog, a cat, nor a bird.
Let's take a look at the cost and loss function for logistic regression (2 possible outputs) vs softmax regression (N possible outputs):
In logistic regression, y can only be either 1 or 0 (per the image above).
In softmax regression, y can be 1 or 2 or 3 or ... N.
Take a look at the Crossentropy loss plot:
if aj was very close to 1, then loss will be very small
but if aj had only 50% chance, then the loss gets a little bigger
the smaller aj in, the bigger the loss
this incentivizes the algorithm to make aj as large as possible (or as close to 1 as possible.
because whatever the actual value y was, you want the algorithm to say, hopefully, that the chance of y being that value was pretty large.
Comentarios