Improving performance of Local outlier factor with KD-Trees


Local outlier factor (LOF) is an outlier detection algorithm, that detects outliers based on comparing local density of data instance with its neighbors. It does so to decide if data instance belongs to region of similar density. It can detect an outlier in a dataset, for which number of clusters is unknown, and clusters are of different density and size. It's inspired from KNN (K-Nearest Neighbors) algorithm, and is widely used. There is a R implemantation available.

The naive approach to do this is to form all pair euclidan distance matrix, and then run knn query to proceed further. But this approach just sucks, as it is $\Theta(n^2)$ in terms of both space and time complexity. But, this can be improvd with KDTrees., and already its implementation exists in python, thanks to scipy, so lets use this to find outliers.

Synthetic dataset

%pylab inline
import numpy as np
np.random.seed(2) # to reproduce the result

Populating the interactive namespace from numpy and matplotlib

dim = 2 # number of dimensions of dataset = 2
# cluster of normal random variable moderately dense
data1 = np.random.np.random.multivariate_normal([0, 1500], [[100000, 0], [0, 100000]], 2000)

# very dense
data2 = np.random.np.random.multivariate_normal([2000, 0], [[10000, 0], [0, 10000]], 2500)

# sparse
data3 = np.random.np.random.multivariate_normal([2500, 2500], [[100000, 0], [0, 100000]], 500)

# mix the three dataset and shuffle
data = np.vstack((np.vstack((data1, data2)), data3))
np.random.shuffle(data)

# add some noise : zipf is skewed distribution and can have extreme values(outliers)
zipf_alpha = 2.25
noise = np.random.zipf(zipf_alpha, (5000,dim)) * np.sign((np.random.randint(2, size = (5000, dim)) - 0.5))
data += noise

Naive approach to LOF

Pairwise Euclidean distance calculation with DistanceMetric implementation in scikit-learn. In this, we just compute all-pair euclidean distance, i.e. $d(i, j) = |x(i)-x(j)|_2$.

from sklearn.neighbors import DistanceMetric
 # distance between points
import time
tic = time.time()
dist = DistanceMetric.get_metric('euclidean').pairwise(data)
print '++ took %g msecs for Distance computation' %  ((time.time() - tic)* 1000)
++ took 740 msecs for Distance computation

Performing KNN query.In this step, the nearest k neighbors are identified $N_k(i)$, and radius is the distance of k-th rearest neighbor of a datapoint.

tic = time.time()
k = 17 # number of neighbors to consider
# get the radius for each point in dataset (distance to kth nearest neighbor)
# radius is the distance of kth nearest point for each point in dataset        
idx_knn = np.argsort(dist, axis=1)[:,1 : k + 1] # by row' get k nearest neighbour   
radius = np.linalg.norm(data - data[idx_knn[:, -1]], axis = 1) # radius
print '+++ took %g msecs for KNN Querying' %  ((time.time() - tic)* 1000)
+++ took 4800 msecs for KNN Querying

Then LRD(Local Reachability distance) is calculated. For this, first reach distance $rd(i, j)$ is computed between point concern $x(i)$ and its neighbors $ j:j\in N_k(i)$, which is the maximum of euclidean distance or radius $r(i)$ of point concerned. Then, LRD is the inverse of mean of reach distance of all k-neighbors of each point.

# calculate the local reachability density
tic = time.time()
LRD = []
for i in range(idx_knn.shape[0]):
    LRD.append(np.mean(np.maximum(dist[i, idx_knn[i]], radius[idx_knn[i]])))
print '++++ took %g msecs for LRD computation' %  ((time.time() - tic)* 1000)
++++ took 429 msecs for LRD computation

finally, the outlier score $LOF$ is calsulated.

# calculating the outlier score
tic = time.time()
rho = 1. / np.array(LRD) # inverse of density
outlier_score = np.sum(rho[idx_knn], axis = 1)/ np.array(rho, dtype = np.float16)
outlier_score *= 1./k
print '+++++ took %g msecs for Outlier scoring' %  ((time.time() - tic)* 1000)
+++++ took 9.99999 msecs for Outlier scoring

Now lets se the histogram of Outlier score, to choose the optimal threshold to decid weather a data-point is outlier is not.

weights = np.ones_like(outlier_score)/outlier_score.shape[0] # to normalize the histogram to probability plot
hist(outlier_score, bins = 50, weights = weights, histtype = 'stepfilled', color = 'cyan')
title('Distribution of outlier score')
<matplotlib.text.Text at 0x36030588>

center

It can be observd that, the optimal outlier score threshold to decide weather a data-point is outlier is outlier or not is around 2 for most of the cases, so lets use it to see our sesults.

threshold = 2.
# plot non outliers as green
scatter(data[:, 0], data[:, 1], c = 'green', s = 10, edgecolors='None', alpha=0.5)
# find the outliers and plot te outliers
idx = np.where(outlier_score > threshold)
scatter(data[idx, 0], data[idx, 1], c = 'red', s = 10, edgecolors='None', alpha=0.5)
<matplotlib.collections.PathCollection at 0x3640e6a0>

center

We have seen the results of LOF with naive approachfor KNN queries. Now lets see optimisations with KD-Trees.

Using KD Trees

KD-Trees insertion and KNN query.

from sklearn.neighbors import KDTree as Tree
tic = time.time()
BT = Tree(data, leaf_size=5, p=2)
# Query for k nearest, k + 1 because one of the returnee is self
dx, idx_knn = BT.query(data[:, :], k = k + 1)

print '++ took %g msecs for Tree KNN Querying' %  ((time.time() - tic)* 1000)
++ took 122 msecs for Tree KNN Querying

LRD computation.

tic = time.time()
dx, idx_knn = dx[:, 1:], idx_knn[:, 1:]
# get the radius for each point in dataset
# radius is the distance of kth nearest point for each point in dataset        
radius = dx[:, -1]
# calculate the local reachability density
LRD = np.mean(np.maximum(dx, radius[idx_knn]), axis = 1)

print '++ took %g msecs for LRD computation' %  ((time.time() - tic)* 1000)
++ took 8.99982 msecs for LRD computation

Now, rest is same, so, i'm just replicating the rsult for completion.

# calculating the outlier score
tic = time.time()
rho = 1. / np.array(LRD) # inverse of density
outlier_score = np.sum(rho[idx_knn], axis = 1)/ np.array(rho, dtype = np.float16)
outlier_score *= 1./k
print '+++++ took %g msecs for Outlier scoring' %  ((time.time() - tic)* 1000)

# plotiing the histogram of outlier score
weights = np.ones_like(outlier_score)/outlier_score.shape[0] # to normalize the histogram to probability plot
hist(outlier_score, bins = 50, weights = weights, histtype = 'stepfilled', color = 'cyan')
title('Distribution of outlier score')

#plotting the result
threshold = 2.
# plot non outliers as green
figure()
scatter(data[:, 0], data[:, 1], c = 'green', s = 10, edgecolors='None', alpha=0.5)
# find the outliers and plot te outliers
idx = np.where(outlier_score > threshold)
scatter(data[idx, 0], data[idx, 1], c = 'red', s = 10, edgecolors='None', alpha=0.5)
+++++ took 4.00019 msecs for Outlier scoring

<matplotlib.collections.PathCollection at 0x36ad0b38>

center

center

The results are same, and should be.

Putting everything together

Lets create a class, to combine evrything together. It will be important in evaluating performance. From above results, we note that the most time is spent for KNN querying. Also, i will create a method to compare the performances and see the results.

Now, lets compare the performace of 2 methods- Naive and KDTree implementations.

perf_test(methods = ['Tree', 'Naive'], n_list = [2 ** i for i in range(4, 14)], plot = True)

center

We see that KDTree outperforms Naive method for narge $n$, but it may not do well for small number of datasets. In my PC, i cannot run Naive method beyond $2^{13}$ datapoints, or else i receie MemoryError. So, lets evauate te performance of KDTrees upto 1Million datapoints.

perf_test(methods = ['Tree'], n_list = [2 ** i for i in range(4, 21)], plot = True)

center

We can see, algorithm is scaling well with data-set size $n$. If we analyse the complexity of algorithm, its linearithmic , i.e. $\Theta (n\log{n})$.

Download this post as:

Ipython Notebook | PDF

Discuss this post via e-mail

comments powered by Disqus

umb, powered by jekyll, bootstrap, gh-pages