K-Nearest Neighbors: Part 3 More about KNN
The goal of this lecture is to tackle more advanced topics related to KNN and dive into more details. I recommend you to read Part 1 if you haven’t. The important points that you need to know about KNN are summarized there.
One of the drawbacks of KNN seen in Part 1 is that it is not built on any probabilistic framework: no posterior probabilities of class membership and no way to infer number of neighbours or metric parameters probabilistically.
A probabilistic classifier returns a probability distribution over outputs given an input, what we call the likelihood p(y|x; parameters). When the probality to belong to a class is around 0.5, sometimes we need human intervention to make a decision or at least combine different classes… This is not possible when working with the classic KNN. There are different modification of KNN to make a probabilistic output, we are going to see some of them. You could design your own modification, especially if you think it’s more suitable for your problem.
1. KNN classifier from a Bayesian perspective
In this section, the goal is to understand how the KNN would perform from a Bayesian point of view. That means we are going to define the Likelihood, the Prior and the Posterior.
Let’s remind that
In a Bayesian Approach, we need to define the likelihood that is the probability of having a class given a vector x (and a training set). One way we could think of, is given a point, we find the K nearest neighbors based on the distance function defined, given the list of these nearest neighbors, we build a likelihood by counting the number of elements in each class.
Let’s take the example in the figure, we have three classes, and K = 6. The goal is to decide the class of the green point. So we could write p(y=1|x_green) = 4/6 = 2/3
p(y=2|x_green) = 2/6 = 1/3
p(y=3|x_green) = 0/6 = 0
This is the very first approach that we could think of. Some of the problems we notice already:
- We assign a null probability for classes not within the sphere defined by the K nearest neighbors
- The probability does not take into consideration the distance information, we would want to have high probability when the points are closer.
Okay, let’s solve that. First, let’s define the context in a mathematical framework.
Let x be the point with the unknown label. The likelihood probability could be writen as:
If we want to add the some weights ( like the distance, or weight according to features, of a combination), we could write the likelihood probability as:
Now, to fix the problem that points outside the sphere defined by the K nearest neighbors have a null class, and also to take advatanges of the Softmax function, we could write the likelihood probability as:
With Beta is the inverse of the temperature of the Softmax function and theta is the set of parameters related to the weights. All set for the likelihood.
For the point x, one way to estimate the unconditional probability could be:
p(x) = (nxK)/V where V is the volume of the sphere contaning the K nearest neighbors. This an estimation of the probability of having a point x in a volume V.
And we could also estimate the probability of having a class by:
p(C_i) = n_i/n.
And using the Bayes rule we can calculate the posterior.
2. Visualization and Experiments
The goal here is to do some visualizations of the likelihood based on a random sample of data and see the effect of changes made of the likelihood.
The data have two dimensions and three possibles classes. For this modification
Now, considering this likelihood:
We can see by comparing with weights and without weights that the likelihood is more spread and soft.
And now with the final likelihood modification:
We can see that the Softmax made the likelihood more spread and deleted null probabalities. The effect of the inverse of the temperature is as follow:
- For small values of beta, we tend to have a more uniform distribution
- For high values of beta, we have more spiky distribution.
This is it for this lecture, keep in mind that you can build your own probabilistic appraoch, Machine Learning models are not always meant to be used in the standard format, make them more customized so they can be more powerful on your data.