Principal Component Analysis (PCA)
Definition:β
Principal Component Analysis (PCA) is an unsupervised learning algorithm for dimensionality reduction. It transforms data into a new coordinate system where the greatest variance by any projection lies on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
Characteristics:β
-
Dimensionality Reduction:
PCA reduces the number of features in a dataset while retaining as much variability as possible. -
Variance Maximization:
The principal components are chosen such that they maximize the variance of the projected data. -
Linear Transformation:
PCA is a linear transformation technique that projects data onto a lower-dimensional space.
Steps Involved:β
-
Standardize the Data:
Center the data by subtracting the mean and scaling to unit variance if necessary. -
Compute the Covariance Matrix:
Calculate the covariance matrix to understand how variables relate to one another. -
Calculate Eigenvalues and Eigenvectors:
Determine the eigenvalues and eigenvectors of the covariance matrix to identify principal components. -
Sort Eigenvalues and Eigenvectors:
Sort the eigenvalues in descending order and select the top k eigenvectors corresponding to the largest eigenvalues. -
Transform the Data:
Project the original data onto the new feature space defined by the selected eigenvectors.
Problem Statement:β
Given a high-dimensional dataset, PCA aims to reduce its dimensionality while preserving as much information (variance) as possible. This is particularly useful for visualization and reducing computational costs in subsequent analyses.
Key Concepts:β
-
Eigenvalue:
A scalar that indicates how much variance is captured by each principal component. -
Eigenvector:
A direction in which a particular transformation acts; in PCA, it represents a principal component. -
Explained Variance Ratio:
The proportion of variance explained by each principal component, is useful for determining how many components to retain.
Split Criteria:β
PCA does not involve splitting data like supervised learning; instead, it focuses on transforming all available data into a lower-dimensional space based on variance maximization.
Time Complexity:β
-
Training Complexity:
Computing PCA typically involves matrix operations that can have a time complexity of , where is the number of samples and is the number of features. -
Prediction Complexity:
The complexity for projecting new data points is , where is the number of principal components retained.
Space Complexity:β
- Space Complexity:
The space complexity primarily depends on storing covariance matrices and eigenvectors, which can be .
Example:β
Consider a scenario where we want to reduce features from a dataset containing measurements of different attributes of flowers (e.g., sepal length, sepal width, petal length, petal width).
Dataset Example:
Sepal Length | Sepal Width | Petal Length | Petal Width |
---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 |
4.9 | 3.0 | 1.4 | 0.2 |
4.7 | 3.2 | 1.3 | 0.2 |
4.6 | 3.1 | 1.5 | 0.2 |
Step-by-Step Execution:
-
Input Data:
The model receives training data with multiple features (sepal length, width, etc.). -
Standardize Data:
Center and scale each feature to have zero mean and unit variance. -
Compute Covariance Matrix:
Calculate how features vary together. -
Calculate Eigenvalues/Eigenvectors:
Find eigenvalues and eigenvectors from the covariance matrix. -
Select Principal Components:
Choose top k components based on explained variance. -
Transform Data:
Project original data onto selected principal components for reduced representation.
Python Implementation:β
Hereβs a basic implementation of PCA using scikit-learn:
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
# Load dataset
iris = load_iris()
X = iris.data
# Create PCA model
pca = PCA(n_components=2)
# Fit model and transform data
X_reduced = pca.fit_transform(X)
# Visualize results
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=iris.target)
plt.title('PCA of Iris Dataset')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.colorbar()
plt.show()