K-Nearest Neighbors (KNN) Algorithm Explained with Easy-to-Understand Examples
K-Nearest Neighbors (KNN) is a machine learning algorithm that can be used for classification and regression problems. It's a simple and easy-to-understand algorithm, which makes it a good starting point for learning about machine learning.
The basic idea behind KNN is to find the K closest training examples to a new example, and use the class of the majority of those K examples to predict the class of the new example. In other words, if K=3 and two of the closest examples are class A and one is class B, we would predict that the new example is class A.
Here's an example to help illustrate how KNN works:
Imagine that we want to classify different types of fruits based on their weight and color. We have a dataset of 10 fruits, with their weights and colors, and their corresponding labels (Apple or Orange):
Fruit | Weight (grams) | Color | Label |
Fruit 1 | 100 | Red | Apple |
Fruit 2 | 120 | Red | Apple |
Fruit 3 | 130 | Red | Apple |
Fruit 4 | 140 | Orange | Orange |
Fruit 5 | 150 | Orange | Orange |
Fruit 6 | 160 | Orange | Orange |
Fruit 7 | 170 | Orange | Orange |
Fruit 8 | 180 | Orange | Orange |
Fruit 9 | 190 | Red | Apple |
Fruit 10 | 200 | Red | Apple |
We want to predict the label of a new fruit that weighs 165 grams and is orange in color. To do this using KNN, we would:
Calculate the distance between the new fruit and each of the 10 fruits in our dataset, using a distance metric such as Euclidean distance.
Sort the distances in ascending order to get the K closest fruits to our new fruit. Let's say we set K=3 for this example.
Count the number of fruits of each label among the K closest fruits. In this case, we have 2 oranges and 1 apple.
Predict the label of the new fruit to be the label that was most common among the K closest fruits. In this case, we would predict that the new fruit is an Orange.
That's the basic idea behind KNN! The algorithm is simple, but it can be very effective for certain problems. One drawback of KNN is that it can be slow and memory-intensive for large datasets, since it requires computing the distance between every example in the dataset and the new example.