Gaussian Mixture Models (GMM)
Definition:​
Gaussian Mixture Models (GMMs) are a probabilistic model used to represent a mixture of multiple Gaussian distributions within a dataset. GMMs are commonly applied in clustering tasks where data is assumed to be generated from several Gaussian-distributed subpopulations, but the identity of the subpopulation to which a particular data point belongs is unknown.
Characteristics:​
-
Probabilistic Model:
GMM represents each data point as belonging to one of several Gaussian distributions, each with its own mean, covariance, and weight (prior probability). -
Soft Clustering:
Unlike methods like K-Means, which provide hard assignments of data points to clusters, GMM assigns each data point a probability of belonging to each cluster. -
Multivariate Gaussian Distributions:
GMMs can model data in multiple dimensions, with each Gaussian component having its own covariance matrix, allowing for flexibility in cluster shapes.
Components of GMM:​
-
Gaussian Components:
Each component in GMM is a Gaussian distribution, defined by a mean vector, a covariance matrix, and a mixing coefficient. The model assumes that data points are drawn from these Gaussian distributions. -
Means:
The mean vector defines the center of each Gaussian distribution, representing the location of the cluster in the feature space. -
Covariance Matrix:
This matrix defines the shape and orientation of each Gaussian distribution. Depending on the covariance type (full, diagonal, tied, spherical), GMM can model clusters with different sizes and orientations. -
Mixing Coefficients:
Mixing coefficients (or weights) represent the proportion of data points that belong to each Gaussian component, summing to 1 across all components. -
Expectation-Maximization (EM) Algorithm:
GMM parameters are typically estimated using the EM algorithm, which iteratively assigns probabilities to each data point and updates the model parameters to maximize the likelihood of the data.
GMM Architecture:​
-
Input Data:
The input to a GMM can be multi-dimensional data points, such as 2D or 3D points, where the goal is to identify underlying subgroups or clusters within the data. -
Gaussian Components:
The model assumes that the data comes from a mixture of Gaussian distributions, each representing a potential cluster in the data. -
EM Algorithm:
The Expectation-Maximization algorithm alternates between two steps:- Expectation (E-step): Calculates the probability of each data point belonging to each Gaussian component (soft assignments).
- Maximization (M-step): Updates the parameters of each Gaussian component (mean, covariance, and weights) based on the data's probability distribution from the E-step.
-
Cluster Assignment:
After convergence, data points are assigned to clusters based on their highest probability of belonging to a particular Gaussian component.
Covariance Types in GMM:​
-
Full Covariance:
Each Gaussian component has its own general covariance matrix, allowing clusters to take on any elliptical shape. -
Diagonal Covariance:
Each Gaussian component has its own diagonal covariance matrix, implying that features are independent but can have different variances. -
Tied Covariance:
All Gaussian components share the same covariance matrix, limiting the flexibility of cluster shapes. -
Spherical Covariance:
Each Gaussian component has a covariance matrix that is a scaled identity matrix, meaning clusters are spherical and have the same radius.
Problem Statement:​
Given a set of multi-dimensional data points, the goal of GMM is to model the data as being generated from a mixture of several Gaussian distributions. This helps in clustering the data points into different subpopulations based on their probability of belonging to each Gaussian component.
Key Concepts:​
-
Latent Variables:
In GMM, each data point is assumed to be associated with a hidden or latent variable, indicating which Gaussian component it comes from. -
Likelihood:
GMM maximizes the likelihood of the data, which is the probability that the observed data was generated by the mixture of Gaussian distributions. -
Posterior Probability:
The posterior probability gives the likelihood that a data point belongs to a particular Gaussian component, computed during the E-step of the EM algorithm. -
Convergence:
The EM algorithm iterates between the E-step and M-step until the parameters converge, i.e., the likelihood does not change significantly between iterations.
Steps Involved:​
-
Initialize Parameters:
Initialize the means, covariance matrices, and mixing coefficients for each Gaussian component, either randomly or using a method like K-Means. -
Expectation Step:
For each data point, calculate the probability that it belongs to each Gaussian component based on the current parameters. -
Maximization Step:
Update the means, covariance matrices, and mixing coefficients of each Gaussian component based on the probabilities calculated in the E-step. -
Repeat:
Repeat the E-step and M-step until the model parameters converge. -
Cluster Assignment:
Assign each data point to the Gaussian component with the highest posterior probability, or retain the soft assignments for probabilistic clustering.
Example:​
from sklearn.mixture import GaussianMixture
import numpy as np
# Generate synthetic data
np.random.seed(42)
data = np.vstack([
np.random.normal(loc=0, scale=1, size=(100, 2)),
np.random.normal(loc=5, scale=1.5, size=(100, 2))
])
# Fit a GMM model with 2 components
gmm = GaussianMixture(n_components=2, covariance_type='full')
gmm.fit(data)
# Predict the cluster for each data point
labels = gmm.predict(data)
# Get the probabilities of each data point belonging to each Gaussian component
probs = gmm.predict_proba(data)
# Print the means and covariances of the Gaussian components
print("Means:", gmm.means_)
print("Covariances:", gmm.covariances_)
Summary & Applications of GMM:​
-
Clustering:
GMMs are widely used in clustering tasks, particularly when clusters are not well-separated or when clusters have different shapes and sizes. -
Anomaly Detection:
GMMs can be used to detect anomalies or outliers in the data by modeling the normal data as a mixture of Gaussians and flagging points with low likelihood under this model. -
Density Estimation:
GMMs are used for density estimation, allowing for probabilistic modeling of the data distribution, useful in scenarios where data is continuous. -
Speech Recognition:
GMMs are a key component in traditional speech recognition systems, where they model the distribution of feature vectors corresponding to different phonemes. -
Image Segmentation:
GMMs can be used for segmenting images into regions by modeling pixel intensities as a mixture of Gaussians.
Time Complexity:​
-
Training Complexity:
The time complexity of fitting a GMM with EM is generally , where is the number of data points, is the dimensionality, is the number of components, and is the number of iterations until convergence. -
Prediction Complexity:
For inference, the complexity is , where is the number of data points, is the dimensionality, and is the number of Gaussian components.
Space Complexity:​
- Space Complexity:
GMM space complexity depends on the number of components , dimensionality , and the number of data points , typically for storing the parameters.