Histogram-Based Outlier Score (HBOS)
Overview
Histogram-Based Outlier Score (HBOS) is an unsupervised anomaly detection technique that assumes independence between features and analyzes each feature independently. It calculates histograms for each feature and then evaluates the "outlierness" of a data point based on its position in the histogram bins.
The main idea is that points falling in bins with low frequencies (rare occurrences) are more likely to be considered outliers.
How HBOS Works
-
Feature Independence Assumption:
- HBOS assumes that all features are independent, which simplifies the computation.
- This allows the detection process to analyze each feature individually without considering multivariate dependencies.
-
Creating Histograms for Each Feature:
- For each feature in the dataset, a histogram is created by dividing the range of the feature into several bins.
- The number of bins and the binning strategy can be set manually or determined automatically.
-
Scoring Data Points:
- The outlier score for a data point is calculated based on the inverse of the bin frequency for each feature. If a data point falls into a bin with a low frequency (i.e., fewer samples), it is assigned a higher outlier score.
- The final outlier score for a data point is often the product of the individual scores for each feature (assuming independence). Alternatively, the sum of logarithmic scores can be used for numerical stability.
-
Normalization of Scores:
- Scores are typically normalized to fall within a certain range, such as [0, 1], to facilitate interpretation.
Mathematical Formulation
Let x = (x₁, x₂, ..., xₙ) be a data point in an n-dimensional feature space. The outlier score for the data point x, denoted as HBOS(x), is computed based on the frequency of each feature value in the histogram bins.
Step 1: Construct Histograms
For each feature i, construct a histogram with bᵢ bins. The frequency of data points falling within each bin is used to estimate the probability distribution for the feature.
Step 2: Calculate the Probability for Each Feature
The probability for a feature value xᵢ to fall within a particular bin is given by: Pᵢ(xᵢ) = (count of data points in the bin containing xᵢ) / (total number of data points)
Step 3: Compute the HBOS Score
The HBOS score for the data point x is then calculated as the product of the inverse probabilities for each feature (assuming feature independence):
HBOS(x) = ∏ (1 / Pᵢ(xᵢ)), for i = 1 to n
Alternatively, a logarithmic version can be used to improve numerical stability:
HBOS(x) = ∑ -log(Pᵢ(xᵢ)), for i = 1 to n
Advantages of HBOS
- Computational Efficiency: Since each feature is processed independently, HBOS is computationally efficient and can handle large datasets with high dimensions.
- Scalability: It scales well with the number of data points and features, making it suitable for big data applications.
- Interpretability: The algorithm is straightforward to understand and interpret, especially in cases where the distributions of features are known.
Limitations of HBOS
- Independence Assumption: The primary limitation is the assumption that features are independent. This may not hold in many real-world datasets, leading to less effective anomaly detection.
- Binning Sensitivity: The choice of the number of bins and binning strategy can significantly affect the results. Improper binning may lead to missed or false detections of anomalies.
- Univariate Approach: As it handles each feature separately, it may not detect anomalies that only manifest when considering the interaction between multiple features.
Use Cases for HBOS
- Network Intrusion Detection: Identifying unusual patterns in network traffic based on individual features like packet size, duration, or frequency.
- Credit Card Fraud Detection: Detecting unusual transactions by evaluating independent features such as transaction amount or frequency.
- Sensor Data Monitoring: Identifying anomalies in IoT sensor data by assessing readings for individual sensors.
Implementation Example (Python)
Here's a basic example of how HBOS can be implemented using Python with the pyod
library:
from pyod.models.hbos import HBOS
from sklearn.datasets import make_classification
# Generate synthetic data
X, _ = make_classification(n_samples=1000, n_features=10, contamination=0.1, random_state=42)
# Initialize HBOS detector
hbos = HBOS(n_bins=10, alpha=0.1)
# Fit the model
hbos.fit(X)
# Predict anomaly scores
anomaly_scores = hbos.decision_function(X)
# Get binary labels (0: inlier, 1: outlier)
anomaly_labels = hbos.predict(X)
print("Anomaly Scores:", anomaly_scores)
print("Anomaly Labels:", anomaly_labels)