Improving performance of Local outlier factor with KD-Trees
Tweet
Local outlier factor (LOF) is an outlier detection algorithm, that detects outliers based on comparing local density of data instance with its neighbors. It does so to decide if data instance belongs to region of similar density. It can detect an outlier in a dataset, for which number of clusters is unknown, and clusters are of different density and size. It's inspired from KNN (K-Nearest Neighbors) algorithm, and is widely used. There is a R implemantation available.
The naive approach to do this is to form all pair euclidan distance matrix, and then run knn query to proceed further. But this approach just sucks, as it is $\Theta(n^2)$ in terms of both space and time complexity. But, this can be improvd with KDTrees., and already its implementation exists in python, thanks to scipy, so lets use this to find outliers.
Synthetic dataset
%pylab inline
import numpy as np
np.random.seed(2) # to reproduce the result
Populating the interactive namespace from numpy and matplotlib
dim = 2 # number of dimensions of dataset = 2
# cluster of normal random variable moderately dense
data1 = np.random.np.random.multivariate_normal([0, 1500], [[100000, 0], [0, 100000]], 2000)
# very dense
data2 = np.random.np.random.multivariate_normal([2000, 0], [[10000, 0], [0, 10000]], 2500)
# sparse
data3 = np.random.np.random.multivariate_normal([2500, 2500], [[100000, 0], [0, 100000]], 500)
# mix the three dataset and shuffle
data = np.vstack((np.vstack((data1, data2)), data3))
np.random.shuffle(data)
# add some noise : zipf is skewed distribution and can have extreme values(outliers)
zipf_alpha = 2.25
noise = np.random.zipf(zipf_alpha, (5000,dim)) * np.sign((np.random.randint(2, size = (5000, dim)) - 0.5))
data += noise
Naive approach to LOF
Pairwise Euclidean distance calculation with DistanceMetric implementation in scikit-learn. In this, we just compute all-pair euclidean distance, i.e. $d(i, j) = |x(i)-x(j)|_2$.
from sklearn.neighbors import DistanceMetric
# distance between points
import time
tic = time.time()
dist = DistanceMetric.get_metric('euclidean').pairwise(data)
print '++ took %g msecs for Distance computation' % ((time.time() - tic)* 1000)
++ took 740 msecs for Distance computation
Performing KNN query.In this step, the nearest k neighbors are identified $N_k(i)$, and radius is the distance of k-th rearest neighbor of a datapoint.
tic = time.time()
k = 17 # number of neighbors to consider
# get the radius for each point in dataset (distance to kth nearest neighbor)
# radius is the distance of kth nearest point for each point in dataset
idx_knn = np.argsort(dist, axis=1)[:,1 : k + 1] # by row' get k nearest neighbour
radius = np.linalg.norm(data - data[idx_knn[:, -1]], axis = 1) # radius
print '+++ took %g msecs for KNN Querying' % ((time.time() - tic)* 1000)
+++ took 4800 msecs for KNN Querying
Then LRD(Local Reachability distance) is calculated. For this, first reach distance $rd(i, j)$ is computed between point concern $x(i)$ and its neighbors $ j:j\in N_k(i)$, which is the maximum of euclidean distance or radius $r(i)$ of point concerned. Then, LRD is the inverse of mean of reach distance of all k-neighbors of each point.
# calculate the local reachability density
tic = time.time()
LRD = []
for i in range(idx_knn.shape[0]):
LRD.append(np.mean(np.maximum(dist[i, idx_knn[i]], radius[idx_knn[i]])))
print '++++ took %g msecs for LRD computation' % ((time.time() - tic)* 1000)
++++ took 429 msecs for LRD computation
finally, the outlier score $LOF$ is calsulated.
# calculating the outlier score
tic = time.time()
rho = 1. / np.array(LRD) # inverse of density
outlier_score = np.sum(rho[idx_knn], axis = 1)/ np.array(rho, dtype = np.float16)
outlier_score *= 1./k
print '+++++ took %g msecs for Outlier scoring' % ((time.time() - tic)* 1000)
+++++ took 9.99999 msecs for Outlier scoring
Now lets se the histogram of Outlier score, to choose the optimal threshold to decid weather a data-point is outlier is not.
weights = np.ones_like(outlier_score)/outlier_score.shape[0] # to normalize the histogram to probability plot
hist(outlier_score, bins = 50, weights = weights, histtype = 'stepfilled', color = 'cyan')
title('Distribution of outlier score')
<matplotlib.text.Text at 0x36030588>
It can be observd that, the optimal outlier score threshold to decide weather a data-point is outlier is outlier or not is around 2 for most of the cases, so lets use it to see our sesults.
threshold = 2.
# plot non outliers as green
scatter(data[:, 0], data[:, 1], c = 'green', s = 10, edgecolors='None', alpha=0.5)
# find the outliers and plot te outliers
idx = np.where(outlier_score > threshold)
scatter(data[idx, 0], data[idx, 1], c = 'red', s = 10, edgecolors='None', alpha=0.5)
<matplotlib.collections.PathCollection at 0x3640e6a0>
We have seen the results of LOF with naive approachfor KNN queries. Now lets see optimisations with KD-Trees.
Using KD Trees
KD-Trees insertion and KNN query.
from sklearn.neighbors import KDTree as Tree
tic = time.time()
BT = Tree(data, leaf_size=5, p=2)
# Query for k nearest, k + 1 because one of the returnee is self
dx, idx_knn = BT.query(data[:, :], k = k + 1)
print '++ took %g msecs for Tree KNN Querying' % ((time.time() - tic)* 1000)
++ took 122 msecs for Tree KNN Querying
LRD computation.
tic = time.time()
dx, idx_knn = dx[:, 1:], idx_knn[:, 1:]
# get the radius for each point in dataset
# radius is the distance of kth nearest point for each point in dataset
radius = dx[:, -1]
# calculate the local reachability density
LRD = np.mean(np.maximum(dx, radius[idx_knn]), axis = 1)
print '++ took %g msecs for LRD computation' % ((time.time() - tic)* 1000)
++ took 8.99982 msecs for LRD computation
Now, rest is same, so, i'm just replicating the rsult for completion.
# calculating the outlier score
tic = time.time()
rho = 1. / np.array(LRD) # inverse of density
outlier_score = np.sum(rho[idx_knn], axis = 1)/ np.array(rho, dtype = np.float16)
outlier_score *= 1./k
print '+++++ took %g msecs for Outlier scoring' % ((time.time() - tic)* 1000)
# plotiing the histogram of outlier score
weights = np.ones_like(outlier_score)/outlier_score.shape[0] # to normalize the histogram to probability plot
hist(outlier_score, bins = 50, weights = weights, histtype = 'stepfilled', color = 'cyan')
title('Distribution of outlier score')
#plotting the result
threshold = 2.
# plot non outliers as green
figure()
scatter(data[:, 0], data[:, 1], c = 'green', s = 10, edgecolors='None', alpha=0.5)
# find the outliers and plot te outliers
idx = np.where(outlier_score > threshold)
scatter(data[idx, 0], data[idx, 1], c = 'red', s = 10, edgecolors='None', alpha=0.5)
+++++ took 4.00019 msecs for Outlier scoring
<matplotlib.collections.PathCollection at 0x36ad0b38>
The results are same, and should be.
Putting everything together
Lets create a class, to combine evrything together. It will be important in evaluating performance. From above results, we note that the most time is spent for KNN querying. Also, i will create a method to compare the performances and see the results.
Now, lets compare the performace of 2 methods- Naive and KDTree implementations.
perf_test(methods = ['Tree', 'Naive'], n_list = [2 ** i for i in range(4, 14)], plot = True)
We see that KDTree outperforms Naive method for narge $n$, but it may not do well for small number of datasets. In my PC, i cannot run Naive method beyond $2^{13}$ datapoints, or else i receie MemoryError. So, lets evauate te performance of KDTrees upto 1Million datapoints.
perf_test(methods = ['Tree'], n_list = [2 ** i for i in range(4, 21)], plot = True)
We can see, algorithm is scaling well with data-set size $n$. If we analyse the complexity of algorithm, its linearithmic , i.e. $\Theta (n\log{n})$.
Download this post as:
Discuss this post via e-mail