k-Nearest Neighbors Algorithm
k-Nearest Neighbors (k-NN) is a simple and widely used supervised learning algorithm. It can be applied to both classification and regression tasks. The algorithm classifies or predicts a data point based on how closely it resembles its neighbours. k-NN does not have an explicit training phase; instead, it stores the entire dataset and makes predictions by finding the k nearest neighbours of a given input and using their majority class (for classification) or average value (for regression) to make predictions.
Characteristics:β
-
Instance-Based Learning:
k-NN is a lazy learner, meaning it stores all training instances and delays computation until prediction time. -
Non-Parametric:
It makes no assumptions about the underlying data distribution, making it highly flexible but sensitive to noisy data. -
Distance-Based:
The algorithm relies on a distance metric (e.g., Euclidean distance) to measure the data points' proximity or similarity.
How k-NN Works:β
-
Data Collection:
k-NN requires a labelled dataset of examples, where each example consists of feature values and a corresponding label (for classification) or continuous target (for regression). -
Prediction:
To predict the label or value for a new, unseen data point:- Measure the distance between the new point and all points in the training set using a distance metric.
- Select the k nearest neighbours based on the shortest distances.
- For classification, assign the most frequent class (majority voting) among the k neighbours. For regression, calculate the average of the k neighboursβ values.
Distance Metrics:β
The most commonly used distance metrics in k-NN are:
1. Euclidean Distanceβ
Euclidean Distance is the most common distance metric, defined as the straight-line distance between two points in Euclidean space.
where:
- and are the coordinates of the points in -dimensional space.
2. Manhattan Distanceβ
Manhattan Distance is the sum of the absolute differences between the coordinates of the points. It is also known as the L1 distance or taxicab distance.
where:
- and are the coordinates of the points in -dimensional space.
3. Minkowski Distanceβ
Minkowski Distance is a generalized metric that can be seen as a generalization of both the Euclidean and Manhattan distances.
where:
- $p$ is a parameter that defines the distance metric:
- gives the Manhattan distance.
- gives the Euclidean distance.
- and are the coordinates of the points in -dimensional space.
4. Hamming Distanceβ
Hamming Distance is used for categorical data and is defined as the number of positions at which the corresponding elements are different.
where:
Choosing k:β
-
Small k:
A smaller k (e.g., k=1 or k=3) makes the model sensitive to noise and can lead to overfitting because it considers fewer neighbors. -
Large k:
A larger k provides a more generalized prediction but may lead to underfitting if it includes too many neighbors from different classes. -
Optimal k:
The ideal value of k is often found through experimentation or by using techniques like cross-validation.
Classification with k-NN:β
In classification tasks, k-NN assigns the class label based on the majority class among the k-nearest neighbours. Each neighbour votes for its class, and the most frequent class becomes the prediction.
Example:
Suppose we are classifying an unknown data point based on three nearest neighbours (k=3), and the classes of these neighbours are:
- Neighbor 1: Class A
- Neighbor 2: Class A
- Neighbor 3: Class B
Since Class A occurs more frequently, the new point is assigned to Class A.
Regression with k-NN:β
In regression tasks, k-NN predicts the target value based on the average of the values of its k nearest neighbours.
Example:
To predict the price of a house, k-NN will find k houses with similar features (square footage, number of rooms) and return the average price of those k houses as the predicted price.
Steps in k-NN Algorithm:β
-
Data Storage:
k-NN stores the entire dataset of training examples. -
Distance Calculation:
For a new input data point, compute the distance between the input and every point in the training set using a chosen distance metric. -
Identify Neighbors:
Sort the distances and identify the k-nearest neighbours. -
Prediction:
- For classification, count the occurrences of each class among the k neighbours and assign the class with the majority votes.
- For regression, take the average of the k neighbours' target values.
Problem Statement:β
Given a dataset with labelled examples (for classification) or continuous targets (for regression), the goal is to classify or predict new input data points by finding the k most similar data points in the training set and using them to infer the output.
Key Concepts:β
-
Lazy Learning:
k-NN is called a lazy learner because it doesnβt explicitly learn a model during training but simply stores the training dataset. -
Similarity:
The similarity between data points is quantified by calculating the distance between their feature vectors. -
Majority Voting:
For classification, the class of a new data point is determined by the majority class among its k nearest neighbours. -
Averaging:
For regression, the predicted value is the average of the k nearest neighbour's target values.
Time Complexity:β
-
Training Time Complexity:
k-NN doesnβt require a training phase, so it takes constant time. -
Prediction Time Complexity:
Predicting the class or value for a new data point requires computing the distance between the new point and all n training points, each of which has d dimensions.
Space Complexity:β
- Space Complexity:
The algorithm stores the entire dataset, which consists of n points in d dimensions.
Example:β
Consider a simple k-NN classification example for predicting whether a fruit is an apple or an orange based on its features (weight and Color):
- Dataset:
| Weight (g) | Color (scale 1-10) | Fruit |
|------------|--------------------|---------|
| 150 | 8 | Apple |
| 170 | 7 | Apple |
| 130 | 6 | Orange |
| 140 | 5 | Orange |
Step-by-Step Execution:
-
Store Dataset:
Store the dataset as-is. -
Calculate Distances:
For a new fruit with a weight of 160g and Color value of 7, compute the distance from this point to all existing data points. -
Find k Nearest Neighbors:
If k=3, identify the 3 closest fruits to the new one based on the shortest distances. -
Make Prediction:
Count the class occurrences (Apple or Orange) among the 3 nearest neighbours and assign the new fruit to the most frequent class.
Python Implementation:β
Hereβs a simple implementation of the k-NN algorithm using scikit-learn:
from sklearn.neighbors import KNeighborsClassifier
import numpy as np
# Sample data
X = np.array([[150, 8], [170, 7], [130, 6], [140, 5]]) # Features (Weight, Color)
y = np.array(['Apple', 'Apple', 'Orange', 'Orange']) # Target (Fruit)
# Create k-NN classifier
knn = KNeighborsClassifier(n_neighbors=3)
# Train the model
knn.fit(X, y)
# Make a prediction for a new fruit
new_fruit = np.array([[160, 7]]) # New fruit with weight 160g and color 7
predicted_fruit = knn.predict(new_fruit)
print(f"The predicted fruit is: {predicted_fruit[0]}")
Summary:β
The k-Nearest Neighbors Algorithm is a straightforward and intuitive method for both classification and regression tasks. It works by finding the k most similar examples in the training dataset and using them to predict the class or value of a new data point. While easy to implement, k-NN can be computationally expensive, especially on large datasets, as it requires calculating the distance to every training point at prediction time.