RODHA: Robust Outlier Detection using Hybrid Approach

The task of outlier detection is to find the small groups of data objects that are exceptional to the inherent behavior of the rest of the data. Detection of such outliers is fundamental to a variety of database and analytic tasks such as fraud detection and customer migration. There are several approaches[10] of outlier detection employed in many study areas amongst which distance based and density based outlier detection techniques have gathered most attention of researchers. In information theory, entropy is a core concept that measures uncertainty about a stochastic event, and it means that entropy describes the distribution of an event. Because of its ability to describe the distribution of data, entropy has been applied in clustering applications in data mining. In this paper, we have developed a robust supervised outlier detection algorithm using hybrid approach (RODHA) which incorporates both the concept of distance and density along with entropy measure while determining an outlier. We have provided an empirical study of different existing outlier detection algorithms and established the effectiveness of the proposed RODHA in comparison to other outlier detection algorithms.


Introduction
The majo rity o f the earlier research works of data min ing focussed on the general pattern applicable to the larger section of the data. On the other hand, outlier detection focuses on that smaller section of data that exhib it exceptional behaviour compared to the rest large amount of the data. A well-quoted definition of outliers is first given by Hawkins [12]. It states, "An outlier is an observation that deviates so much fro m other observations so as to arouse suspicion that it was generated by a d ifferent mechanis m". Outlier detection, since its inception has been regarded as an important aspect for study in data mining research as it uncovers the valuable knowledge hidden behind whole data and aiding the decision makers to make profit or improve the service quality. Outlier detection has several applications. For examp le, outlier detection can be emp loyed as a pre-processing step to clean the data set from erroneous measurements and noisy data points. On the other hand, it can also be used to isolate suspicious or interesting patterns in the data. Examp les include fraud detection, customer relationship management, network intrusion, clin ical diagnosis and biological data analysis.
In this paper we have provided an empirical study of some existing outlier detection techniques. We have done a detail theoretical study and imp lementation of Locality Sensitive Hashing (LSH)-based outlier detection technique proposed by Wang (et. al ) [20]. Apart fro m this we have proposed a robust outlier detection algorith m using a hybrid approach (RODHA) based on both distance and density based approach along with incorporating the entropy measure to determine the outliers. The proposed RODHA can be found to be significant in view of the following points.
• Free fro m the restriction of the using specific pro ximity measure.
• Takes the benefit of distance based, density based as well as information theoretic approach while identifying an outlier.
• Sensitive and scalable.
• Performance is independent of dimensionality and number of clusters.
Rest of the paper is organized as follo ws: section 2 reports related research. In section 3, we p rovide the background of our work. In section 4, the LSH-based outlier detection technique is described in brief. Section 5 presents the proposed RODHA approach and the empirical evaluation of the method is reported in detail in section 6. Finally, concluding remarks and future direction of research is given in section 7.

Related Research
There are two kinds of outlier detection methods: formal tests and informal tests [22]. Formal and informal tests are usually called tests of discordance and outlier labelling methods, respectively.
Most formal tests need test statistics for hypothesis testing. They are usually based on assuming some well-behaving distribution, and test if the target extreme value is an outlier of the distribution, i.e., whether or not it deviates fro m the assumed distribution. So me tests are fo r a single outlier and others for mult iple outliers. Selection of these tests main ly depends on numbers and type of target outliers and type of data distribution. Even though formal tests are quite powerful under well-behaving statistical assumptions such as a distribution assumption, most distributions of real-world data may be unknown or may not follow specific distributions such as the normal, gamma, or exponential. Another limitation is that they are susceptible to masking or swamping problems.
On the other hand, most outlier labelling methods, informal tests, generate an interval or criterion for outlier detection instead of hypothesis testing, and any observations beyond the interval or criterion is considered as an outlier. There are two reasons for using an outlier labelling method. One is to find possible outliers as a screening device before conducting a formal test. The other is to find the extreme values away fro m the majority of the data regardless of the distribution. So me very popular outlier labelling parameters are Z-score [22], Standard deviation (SD) method [22], Turkey's method, MAD e method [22] and Median Rule [22].
In data min ing, the problem of outlier detection has been tried to solve based on several approaches [10] in different problem do mains. The class of solution to outlier detection ranges from statistical methods to geometric methods and fro m density based approaches to distance based approaches. Statistical methods are appropriate if one has a good sense for the background distribution but typically does not scale well to large datasets or datasets of even moderate dimensionality. Geo metric methods essentially rely on variants of the convex hull algorith m wh ich has a co mp lexity that is exponential in the dimensionality of the data, and they are often imp ractical. The d istance-based approach [15] originally proposed by Ng and Knorr. They define a point to be a distance-based outlier if at least a user-defined fraction of the points in the dataset are further away than some user-defined minimu m distance fro m that point. In their experiments, they primarily focus on datasets containing only continuous attributes. This can be expensive to compute particu larly in h igher dimensions. A standard distance based approach called ORCA [3] proposed by Stephen D. Bay emp loys some pruning rule for optimization of processing time in large mult i-d imensional datasets. Because of this pruning rule the algorithm scales well to a linear time in case of high d imensional dataset. Another distance based approach in conjunction with a ranking scheme is the Locality Sensitive Hashing (LSH)-based outlier detection proposed by Wang (et. al) [20]. Here the outlier ranking scheme is based on a hashing concept called Locality Sensitive Hashing (LSH). The basic idea for LSH is to convert the data into manageable fingerprints and hash them so that similar data points are mapped to the same buckets with high probability. Density-based approaches [4] to outlier detection rely on the computation of the local neighbourhood density of a point. In one such technique, a local outlier factor (LOF) is computed for each point. The LOF of a point is based on the ratio of the local density of the area around the point and the local densities of its neighbours. The size of a neighbourhood of a point is determined by the area containing a user-supplied min imu m number of points (MinPts). Pang-Ning Tan proposed OutRank-b [16], a graph-based outlier detection algorith m. In this technique the graph representation of data is based upon two approaches-the object similarity and nu mber of shared neighbours between objects. Besides this a Markov chain model is built upon this graph, which assigns an outlier score to each object. Agrwal [21] has suggested a local subspace based outlier detection which uses different subspace for different objects. This approach basically adopts local density based outlier detection by defin ing a Local Subspace based Outlier Factor (LSOF) in high-dimensional datasets. A. Ghoting (et. al) [23] proposed an outlier detection algorithm, LOADED, for outlier detection in evolving datasets containing both continuous and categorical attributes. LOA DED is a tuneable algorithm, wherein one can trade off co mputation for accuracy so that domain-specific response times are ach ieved. S. Wu (et. al) [24] incorporated the concept of entropy to propose an informat ion theoretic outlier detection technique for large-scale categorical data. This strategy, first, adopts a deviation-based strategy, avoids the use of statistical tests and proximity-based measures to identify exceptional objects. Secondly, co mbine entropy and total correlation with attribute weighting to define the concept of weighted holo-entropy, where the entropy measures the global disorder of a data set and the total correlation measures the attribute relationship.

Discussion and Moti vation
Fro m the inception of research on outlier detection in data mining, researchers have focussed on most trivial distance based approach to most recent ranking driven approach [20] of outlier detection. In course of time, several contextual modifications are made on density-based, graph-based and statistical outlier detection approaches, but none is able to provide a very acceptable solution, with a high accuracy, to the outlier detection problem. To su mmarize, based on our survey we observe the following.
• Although distance based approach is a trivial criteria for outlier detection, but it alone is not suitable for the datasets having clusters of different distribution.
• In the distance based outlier detection [15], the main overhead is the selection of the user-defined fract ion of data those are further away than another user-defined threshold distance.
• The Statistical approaches require either construction of a probabilistic data model based on emp irical data, which is rather a co mplicated computational task, or require a priori knowledge of the distribution laws. Even if the model is parameterized, co mplex co mputational procedures for finding these parameters are needed. Moreover, it is not guaranteed that the data being examined match the assumed distribution law if there is no estimate of the density distribution based on the empirical data.
• Density based approach of outlier detection considers neighbourhood density of points to declare outlier or non-outlier. This approach is able to provide better detection results if selection of the required input parameter ε is done accurately.
• The performance o f the existing outlier detection algorith ms are dataset dependent. Therefore, development of a robust, sensitive outlier detection technique which is free fro m the limitations offered by the aforesaid algorith ms is of utmost importance.

Background of the Work
In this section, we will discuss the background concepts which provide the basis of our work. The proposed outlier detection technique is a combination of both distance and density based outlier detection approach along with the concept of entropy as a measure for outlier detection.

Outlier
Outliers are those observations in the data that do not conform to the inherent patterns of the data. There are several defin itions given for outlier fro m different v iew point. An examp le of outliers in two d imensional dataset is illustrated in Figure 1. Outliers may be induced due to a variety of reasons such as malicious activity (e.g., credit card fraud, cyber attacks, novelty detection, and breakdown of a system), but all these reasons have a common characteristic that they are interesting to the analyst. The interestingness or real life relevance of outliers is a key feature of outlier detection [19]. Outlier detection is related to, but distinct fro m noise removal or no ise accommodation that deals with unwanted noise in the data. Noise does not have any real life significance and acts as hindrance to data analysis.

Distance-based Outlier
Distance-based method was originally proposed by Knorr and Ng [15].

It states that -"An object O in a dataset T is a DB(p, D)-outlier if at least fraction p of the objects in T lies greater than distance D from O".
This notion is further extended based on the distance of a point fro m its k-th nearest neighbour. Alternatively, the outlier factor of each data point is computed as the sum of its k-th nearest neighbours. Here the distance can be proximity given by any of the dissimilarity measure Euclidian distance, L p norm, Cosine distance etc.

Density-based Outlier
Density-based approach was proposed by Breuning et al. [4]. It relies on the local outlier factor (LOF) of each point, which depends on the local density of its neighbourhood. In our work we considered local neighbourhood density in terms of nu mber of points lying in the ε -neighbourhood of the object. In this v iew point, an outlier is the point ly ing so sparsely that there are not more than a threshold MinPts number o f other po ints lying in the ε -neighbourhood of that point.

Entropy
In informat ion theory, entropy is core concept that measures uncertainty about a stochastic event and it means that entropy describes the distribution of an event [13]. Entropy is a measure of disorder or more precisely unpredictability in a system. In entropy-based clustering, an object is added to that cluster such that upon addition the increase in intra-cluster entropy is minimu m among all other clusters. Since outlier is the observation that deviates from the inherent pattern of the data, so upon addition of such point to any cluster in the dataset, the increase in entropy is much h igher than a non-outlier point. Th is notion is an important criterion for entropy based outlier detection. Shannon denoted the entropy H of a discrete rando m variab le X with possible values {x 1 ,x 2 ....., x n } and probability mass function p(X)[8] as, where b is the base of the logarithm used. Previously, entropy has been a metric d ifficu lt to evaluate without imposing unrealistic assumptions about the data distributions [13]. Renyi proposed an entropy measure that lends itself to nonparametric estimation d irectly fro m data [13]. The mathematical formu la for Renyi's entropy is briefly described in section 5.4.

Locality Sensitive Hashing (LSH)-based Outlier Detection
This is a distance based approach in conjunction with a ranking scheme based on the concept of Locality Sensitive Hashing (LSH) [5]. The basic idea for LSH is to convert the data into manageable fingerprints and hash them so that similar data points are mapped to the same buckets with high probability.
Definition : A family H is called (R, c, P 1 , P 2 )-sensitive if for any two points p, q The first condition guarantees that similar points are hashed to the same bucket with high probability whereas the second condition says that distant points are hashed to the same bucket with small probability. A family will be useful only when P 1 >P 2 . In order to improve the efficiency of the outlier detection process some pruning techniques are used viz. PPSO, ANNS [20]. The whole framework of LSOD can be divided into a number of modules. The initial step is effectively a pre-processing step in which the dataset is divided into a number of clusters. So, the exact clustering technique employed is independent of the outlier detection framework. Further steps are briefly described below.

Outlier Likelihood Ranking
The points in the database are first ranked based on their likelihood to be an outlier. The resulting ran k-ordered list is then processed in the detection phase where the actual outliers are found. The intuition behind this outlier ranking order is-lower the rank, higher is the likelihood to be an outlier. The outlier likelihood rank of any object is g iven by a ranking function called LSH function h(v) that leverages p-stable distribution [20] [5].
where w is a parameter of the hashing procedure that denotes the size of the windows onto which the database points are projected. It is generally reco mmended that w=4 [20]. a i is a d-dimensional vector and the value of each dimension is drawn fro m the standard normal distribution. b i denotes a random b ias whose value is drawn fro m the uniform distribution Unif(0,w). The probability p q (d) of a point p that is at a distance d from another point q and is hashed to the same bucket is given by, In Equation 6, f p () is a strictly increasing function. For a fixed parameter w in Equation 5, p q (d) decreases monotonically with d. In other words, the collision probability between points p and q decreases as the distance q p − between them increases. The performance of the locality sensitive hashing (i.e. the hash family H) depends on the parameter R, which is an estimate of distance between a normal po int and one of its neighbours. There exists several research efforts focusing on addressing issues related to LSH parameter tuning [2] [7]. The LSH-based outlier detection [20] relies on ranking for efficiency, not correctness. Therefore, a slightly less accurate ranking will not significantly impact performance of outlier detection. However, an efficient approach for estimation of R is emp loyed in [20] based on the already generated clusters. First some pairs of points are sampled, where each pair of points are in the same cluster, and then calculate the distances between these pairs. Finally, set the median of these distances as the estimated value of R.

Ranking Methodology
For a g iven point q, let N q denotes the number of points that hash to the same bucket as q. We define rank(q) as follows [20] rank is the expected number of points in the database that hash to the same bucket as q. We can formally define E[N q ] as follows [20]:

Outlier Detecti on
After the ranking is over, the objects are processed in an increasing order or rank. Th is ordered ranking scheme has the advantage of processing most probable outlier candidates first. Again, based on the weakest outlier score and user defined parameter k, first L nu mber of outliers is returned as output.
Apart from the LSH-based outlier detection technique [20], we have compared our proposed outlier detection technique, RODHA, with three other outlier detection techniques viz. LOF [4], ORCA [3] and OutRank-b [16]. The Table 1 shows a general comparison of these four existing outlier detection techniques.

RODHA: The Proposed Outlier Detection Technique
RODHA (Robust Outlier Detection using Hybrid Approach) is designed using a comb ination of both distance and density based outlier detection approach in conjunction with entropy measure fro m informat ion theory. The basic framework of the RODHA is shown in the figure 2. It requires clustering of the data as a pre-processing step. Then the distance based approach defines an object to be an outlier when its minimu m d istance from all the cluster profiles is greater than the maximu m intra-cluster distance of all the clusters in the data. The density-based approach to outlier detection relies on the co mputation of the local neighborhood density of a point. Imp licit to this approach is the notion of distance but an additional criterion is that of neighborhood and the determination of number of points lying with in a neighborhood of interest. Finally, the notion of entropy based outlier is that a candidate outlier sample when added to its nearest cluster would increase the intra-c luster entropy by an amount much higher than a non-outlier samp le when added to the same cluster. where d is dimension of the points in the database. As a pre-processing step, the dataset D is divided into two parts, a large train ing set D trainset and a smaller test set D testset . The overall framework of the proposed technique consists of four phases as follows.

Clustering the Training Set
The given train ing set D trainset is clustered to produce k number of clusters C 1 , C 2 ,......C k and accordingly the objects are labeled. A lthough clustering is a pre -requisite to the outlier detection framework, the final outlier detection is independent of the exact choice o f the clustering method. We have used k-means clustering algorith m 1 [11]. One can emp loy other popular clustering methods like k-medoid, DBSCAN, fuzzy c-means etc clustering methods. The performance of k-means clustering depends heavily on the selection of initial cluster centroid. So, we have emp loyed a routine for selecting the farthest k objects as the init ial cluster centroid.

Init ial Centro id Selection
The farthest k objects fro m the training set, D trainset are selected as the initial centroid in the k-means clustering algorith m. The first centroid is selected randomly fro m the training set D trainset . The point farthest from the first selected point is selected as the second initial cluster centroid. Then, the next centroid is the point in the training set, D trainset , for which the sum of its distance fro m all the already selected centroid is maximu m. This process continues until all the user defined number of centroids are selected.
For datasets having very high dimensions, the farthest-k-objects selection using the above distance based approach suffers from curse of dimensionality. For those datasets, it is preferable to employ spatial index structures like R-tree or its family members, because it can reduce-(a) the cost of neighbourhood computation as average case complexity of searching in R-tree is O(log m n), where n is the total number of nodes in the R-tree and m is the number of entries in a node, and (b) the cost of finding the farthest point significantly.

Distance based Outlier Detecti on
Let us consider the dataset D has the spatial distribution as shown in the    This distance-based approach can detect outliers where the dataset is of convex in nature. But the approach fails for datasets of concave nature (as shown in figure 4) or where the outlier objects are lying marginally away fro m the boundary of the clusters. In figure 4, the two clusters C 1 , and C 2 are of concave nature. The object O 1 is supposed to be an outlier. By distance based approach we find d thres =d max1 and d min =d OC1 . But the condition for being an outlier i.e. d min >d thres is not satisfied by O 1 . So, the distance approach fails to detect O 1 as outlier. Same situation arises for object O 2 lying close to the boundary of the cluster C 2 . To handle such situation our proposed technique employs density based approach to detect outlier.

Density based Outlier Detecti on
The density based approach requires selection of a parameter value ε wh ich is the distance to check the availability of any training samples within the ε   Within the ε -neighbourhood of O 1 and O 2 there are so me points, but none are labelled i.e. none are the points included within any of the clusters in the data set. One very important aspect of this approach to be noted is the selection of the ε -value. The performance of this approach is very sensitive to the selection of the ε -value. If we take a larger ε -value, then some of the candidate outlier points might not be detected. Again the ε -value for an object should not be too small such that it finds no labelled points within the distance ε -neighbourhood and erroneously declares it to be outlier.
So, the value of the parameter should be selected experimentally. In the section 6, we have provided a heuristic method of select ing ε -value for some UCI Machine learn ing datasets[6] on wh ich we have applied our technique.

Entropy as a Parameter for Outlier Detection
In informat ion theory, entropy is core concept that measures uncertainty about a stochastic event and it means that entropy describes the distribution of an event [13]. Entropy is a measure of disorder or more precisely unpredictability in a system. In the field of data min ing, entropy has been used in clustering applicat ions [13] where an object is included to that cluster where its inclusion increases the intra-cluster entropy or disorder by min imu m. The notion of entropy in this perspective provides important criteria for outlier detection. By the definit ion, an outlier is an instance that is much different fro m the inherent pattern of the data. Such an object instance when added even to its nearest alike cluster would increase the amount of intra-cluster entropy much higher than a non-outlier object when added to the same cluster.
There are several mathemat ical formu lations exist to measure entropy of a system. One of the very popular methods of entropy is Shannon entropy. We have emp loyed an effective entropy measure known as "Renyi's Entropy" [13] [17]. It is a generalized form of Shannon entropy developed by Alfred Renyi.
Definition (Renyi's Entropy): Renyi's entropy for a stochastic variable X with probability density function (pdf) f x is given by Specifically fo r 2 = α we obtain, This is called Renyi's entropy. The expression can easily be estimated directly fro m data by the use of Parzen window estimation, with a mu ltidimensional Gaussian window function. Assume that cluster C k consists of the set of discrete data points x i , i=1,2,....N k . Now, the probability density estimate based on the data points of C k , is given by [13] [26] (11) where N k is the number of data points in C k , and we have used symmetric Gaussian kernel of covariance mat rix ∑= Since the entropy is calculated based on points assigned to the same cluster, we refer to 12 as the within-cluster entropy. We have used entropy based clustering as a support to detect outliers and have found that there is a very high increase in the within-cluster entropy for the points that are detected as outliers.

Complexi ty Analysis
The framewo rk of the proposed outlier detection approach uses clustering as a pre-processing step. We have used For a large dataset with small nu mber of clusters, we can ignore k, then the overall co mplexity becomes O(N*I+N*dim).

Environment Used
The proposed RODHA algorithm is imp lemented in a computer system with processor Intel(R) Core(TM) 2 Quad CPU Q6600 @ 2.4 GHz and RAM of 2 GB in a 32-b it Windows 7 operating system. The algorith ms are programmed in programming language C (Borland C++ version 4.5) and Matlab (version 7.6.0 R2008a).

Datasets Used
Several synthetic and real life datasets are used for testing the performance of the proposed RODHA and the LSH-based outlier detection algorith m. The real-life datasets are downloaded fro m UCI Machine Learning Repository website. In the Table 2, we have provided the basic informat ion about the data sets considered for our experiments. For the data sets having missing values, prior to implementing the outlier detection algorith ms, we estimated the missing values using two very popular estimation techniques-KNN Imputation [18] and LLS Imputation [14].    (14) where y guess and y ans are vectors whose elements are the estimated values and the known answer values respectively, for all missing entries. The mean and the standard deviation are calculated over missing entries in the entire matrix.

Experi mental Results
As mentioned earlier, we have imp lemented the proposed outlier detection algorithm (RODHA) and the LSH-based outlier detection [20] on the datasets tabulated in Table 2 fro m UCI Machine Learning data set archives. The RODHA algorith m is co mpared with four other outlier detection algorith ms-LSH-OD [20], ORCA [3], LOF [4], and OutRank-b [16]. These techniques are compared based on the detection rates of outliers. The detection rate (DR) is calculated based on the ROC analysis proposed in [9].We have downloaded the executable versions of LOF [1] and ORCA [3]. Results of LOF and ORCA are also reported for these datasets in the colu mns 2 and 3 respectively. The detection rate values for OutRank-b are taken fro m the paper [16]. The co mparison of these five different techniques over 14 different synthetic and benchmark datasets is shown in the Table 3.

Discussion
The effectiveness of the proposed technique depends upon the value of the user defined parameter ε and that of LSH-based technique depends upon user defined parameter k. So, in the last four colu mns of the Table 3, we have reported the detection rates of LSH-OD and the proposed technique RODHA along with the value of respective user defined thresholds (k and ε ). Out of the 24 datasets (1 synthetic and 23 benchmark UCI datasets) except for (Vehicle and Hill Valley) where detection rate is slightly less, the proposed outlier detection algorith m shows excellent performance over all other datasets.

Entropy Me asure as an Outlier Detector
In our proposed technique of outlier detection we have used entropy fro m Information theory as a support to detect outlier. When one candidate outlier sample is added to its nearest cluster, it increases the within cluster entropy significantly than a non-outlier samp le. So, we have used this significant increase of entropy as a weightage to declare the object an outlier. Figure 8 shows the effectiveness of entropy for outlier detection for a synthetic data set of figure 7. The bar diagram (figure 8) shows the change (i.e. increase) in intra-cluster entropy for the objects in the test set. The longer bars are the increase in entropy for outlier points that are much higher than the rest non-outlier points in the test dataset. In order to test how sensitive the proposed algorithm towards detecting outliers, we have made a synthetic dataset as shown in the figure 9.The points in the synthetic dataset is distributed in six d ifferent ways. Here N 1 , N 2 are two normal clusters, O 1 is distinct outlier, O 2 is distinct in liers, O 3 is equidistant outlier, O 4 is border in lier, O 5 chain outlier, O 6 is compact group of objects too small in nu mbers to form a cluster and O 7 is outlier of "stay together" nature. To test the effectiveness of the algorithm, we label these as candidate outliers and run the outlier detection algorith m. The outliers returned by the technique is compared with the candidate outliers and we find the accuracy of the detection equal to 0.98 i.e. the algorith m is sensitive to such outliers with an accuracy level of 98%.  . Illustration of six different cases: N1 and N2 are two normal clusters, O1 is the distinct outlier, O2, the distinct inlier, O3, the equidistance outlier, O4, the border inlier, O5, a chain of outliers, O6 is another set of outlier objects with higher compactness among the objects and O7 is an outlier case of "stay together" 6.7. Selection of ε threshold in the proposed RODHA

Algorithm
The effect iveness of the proposed outlier detection algorith m depends on the choice of the value for the threshold ε , which is a neighbourhood distance. The threshold value varies fro m data set to data set and this variation is very high since the datasets varies in number, type and range of attribute values. So, in order to provide a relatively co mmon range of selecting the threshold value for different data sets, we normalize the data set attribute value within a range of 1.0 to 5.0 as a pre-processing step. While normalizing data within this range we discarded the attributes that take binary values or some fixed number of constant values. In the following figure 10 we have provided a heuristic method of selecting the ε -value where the detection rate is plotted against the ε -value for different datasets and the accuracy of detection is ma ximu m fo r the threshold ( ε ) value within the range 0.4 to 1.5. As mentioned earlier, proposed outlier detection technique, emp loys entropy as a weighage to declare an object to be an outlier. Here, fo r a candidate outlier detected in distance and density based outlier detection phase of the algorithm, are made to pass through another entropy based outlier detection phase, where the the object is added to its nearest cluster and the resulting increase in the intra-cluster entropy is co mpared to an entropy difference threshold τ . The proposed outlier detection technique depends on both the user-defined thresholds-ε and τ . In the figure 11, shows the variation of detection rate against the values of user defined threshold τ for so me datasets. The detection rate of the algorith m stays almost steady between 0.93 to 0.97 for value of τ between 0.2 to 3.5, beyond which although it shows higher detection rate, but τ -value becomes so large that practically no outliers get detected.

Selection of threshol d k i n LS H-based outlier Detecti on Techni que
The performance of the LSH-based outlier detection scheme largely depends on the selection of k. Again, since the datasets are different in terms of number of objects, attribute value and range of attribute values, so the value of k also varies accordingly. However, here we have employed a heuristic method to find a most probable range of k leading to better result. In figure 12, we have shown the variation of accuracy (detection rate) of the algorithm with different value of k. We see that for the standard datasets namely Iris, Breast Cancer W isconsin, Statlog Heart, e-Coli, Yeast, Housing and Wine the accuracy remains almost steady between 96% to 99% for value of k fro m 10 to 25. Beyond k value 25, the performance decreases significantly. Again although, for value of k between1 to 10 the graph shows the accuracy level of around 98%, but in that range the number of outlier detection is quite low. So, we consider that range of the value of k for which the accuracy level is relatively better and also the algorith m detects the probable outliers. In our experiments, better result of the outlier detection algorith m is found with k value in the range of 10 to 25.

Conclusions and Future Work
We have developed a hybrid supervised outlier detection algorith m based on both distance and density based approach. The effectiveness of the algorithm results from the combined distance and density based outlier detection approaches. The distance based approach alone is able to detect outliers for datasets where objects are uniformly distributed among the data clusters. The weakness of distance based approach in detecting outliers for non-unifo rmly distributed datasets is τ compensated by density based approach by considering the local density around a candidate outlier object. Furthermore, the incorporation of entropy for outlier detection makes it more robust and sensitive than other existing outlier detection techniques. The computation of within-c luster entropy using Renyi's entropy measure has an advantage as it lends itself nicely to nonparametric estimation directly fro m data [25] and it considers how data are distributed within the cluster. Again, the proposed RODHA has a linear time co mplexity.
The algorith m is tested on synthetic and real-life datasets fro m UCI M L Repository. The detection performance of the algorith m is co mpeting excellent than other existing algorith ms. In the present work, the datasets on which the proposed technique is tested are of integer or real type. So, our work is undergoing to extend the algorith m to work on mixed type datasets. Apart from this, the performance of the algorith m will be tested over network intrusion datasets.