K-Nearest Neighbors: Part 2 More about KNN
The goal of this lecture is to tackle more advanced topics related to KNN and dive into more details. I recommend you to read Part 1 if you haven’t.
The topics addressed in this lecture are:
- KNN and the curse of dimensionality
- KNN time and space complexity
- KNN with feature weighting
1. KNN and the curse of dimensionality
Let us recall a general fact that is all machine learning algorithms needs a dense dataset in order to accurately predict over the entire data space. Errors arise in all algorithms if there are gaps between the data (sparse data). But the sparsity of the data is even trickier when it comes to KNN, here is why:
As seen in Part 1, KNN is based on the assumption that similar/close inputs give similar outputs. So two points are considered similar if they are close in every single dimension. Let us study how the data become more sparse when the dimension of the input space increases:
So for the KNN-algorithm, taking a test point P, the edge length of the smallest cube that contains all k-nearest neighbors increases exponentially as the dimension increases. For example, let set k = 10 and n the number of training samples n=1000, the edge length of the smallest cube changes as follow
Very surprising, that when d is large we almost need the entire space to find the 10-NN. This is the curse of dimensionality…
We might think that by increasing the number of training examples we can get around this problem, but this is not always possible especially when d is large, because you need a number of samples which grows exponentially with d.
The points to keep in mind out of this are, in higher dimensions:
1. Higher dimensional spaces are empty
2. Points in high dimensional spaces are isolated
3. Neighborhoods are no longer local
4. The notion of nearest neighbors vanishes
2. KNN space and time complexity
Space complexity
We can reduce the space complexity by using dimensionality reduction or choosing a less number of training examples. Some of the strategies are:
Editing: w remove data points that do not affect the decision boundary. For example, consider a single data point (aka “outlier”) surrounded by many data points from a different class. If we perform a kNN prediction, this single data point will not influence the class label prediction in plurality voting; hence, we can safely remove it. (We can also think of relabeling this point)
Prototyping: The idea is to replace selected data points by prototypes that sum-marize multiple data points in dense regions. This could be done using some clusetring algorithm like K-Means.
For the time complexity, it depends on the way we are performing the search for the k points and the way we are storing the training examples (data structure).
Brute force:
Suppose that the training examples are stored in some kind of a list (a matrix is list of rows), each element of the list is an instance. We have two versions for the brute force method:
In this first version, we iterate k times over the training examples to find the closest element that has not been added yet to the list of the K-Nearest Neighbors. At each iteration, we calculate the distances between the test point and the training examples. A different version might be calculating the distances between the test point and the training examples one time and storing, that will add an extra space complexity to the one seen before. This is the second version:
Max-Heap
There is an improvement that we can make on this second version, once we have the list Distances. Instead of iterating k times over the training examples, we could use a max-heap (you don’t know what max-heap is? check this) in the first step you are going to put random distances in your max-heap, in one iteration over the training examples, compare the current distance with the root of the max-heap, if it’s larger you should skip, otherwise insert the element in the max-heap that takes log(k). After finishing one iteration, the K-Nearest neighbors are the elements of the max-heap. So the time complexity is O(n*d+n*log(k)).
A smart Data Structure (k-d Tree for instance)
Instead of storing the training examples in a simple array. We can think about storing them in a smarter data structure that would make the step of searching for nearest neighbors easier. That means, making the process of prediction less complex (faster), by adding some complexity to the training process. One of the data structures that we can use is K-D Trees. I am not going to give too much details about this part; all you need to know is that there are some data structures that would make the search during the prediction for nearest neighbors easier.
The first step is the construction of the k-d tree with time complexity of O(d*n*log(n)), once done we can then use it to find the nearest neighbor for a given point, that can be done in O(log(n)). K-d Tree is not good in high dimensional spaces, we can think about using Local Sensitive Hashing. I am going to cover this algorithm in a separate lecture.
3. KNN with weighted features
For this part, we will consider the Euclidean distance as our distance metric.
We use the training examples to decide the weights, this way we will give more importance to the most significant features and even perform feature selection to make prediction using only the top important features. This method also provides this information of the importance of each feature.
This is it for part 2 of the series dedicated to the KNN algorithm.