Histogram-Based Outlier Score (HBOS)
Overview
Histogram-Based Outlier Score (HBOS) is an unsupervised anomaly detection technique that assumes independence between features and analyzes each feature independently. It calculates histograms for each feature and then evaluates the "outlierness" of a data point based on its position in the histogram bins.
The main idea is that points falling in bins with low frequencies (rare occurrences) are more likely to be considered outliers.
How HBOS Works
-
Feature Independence Assumption:
- HBOS assumes that all features are independent, which simplifies the computation.
- This allows the detection process to analyze each feature individually without considering multivariate dependencies.
-
Creating Histograms for Each Feature:
- For each feature in the dataset, a histogram is created by dividing the range of the feature into several bins.
- The number of bins and the binning strategy can be set manually or determined automatically.
-
Scoring Data Points:
- The outlier score for a data point is calculated based on the inverse of the bin frequency for each feature. If a data point falls into a bin with a low frequency (i.e., fewer samples), it is assigned a higher outlier score.
- The final outlier score for a data point is often the product of the individual scores for each feature (assuming independence). Alternatively, the sum of logarithmic scores can be used for numerical stability.
-
Normalization of Scores:
- Scores are typically normalized to fall within a certain range, such as [0, 1], to facilitate interpretation.
Mathematical Formulation
Let x = (x₁, x₂, ..., xₙ) be a data point in an n-dimensional feature space. The outlier score for the data point x, denoted as HBOS(x), is computed based on the frequency of each feature value in the histogram bins.
Step 1: Construct Histograms
For each feature i, construct a histogram with bᵢ bins. The frequency of data points falling within each bin is used to estimate the probability distribution for the feature.
Step 2: Calculate the Probability for Each Feature
The probability for a feature value xᵢ to fall within a particular bin is given by: Pᵢ(xᵢ) = (count of data points in the bin containing xᵢ) / (total number of data points)
Step 3: Compute the HBOS Score
The HBOS score for the data point x is then calculated as the product of the inverse probabilities for each feature (assuming feature independence):
HBOS(x) = ∏ (1 / Pᵢ(xᵢ)), for i = 1 to n
Alternatively, a logarithmic version can be used to improve numerical stability:
HBOS(x) = ∑ -log(Pᵢ(xᵢ)), for i = 1 to n