Statistical Score Calculation of Information Retrieval Systems using Data Fusion Technique

Effective information retrieval is defined as the number of relevant documents that are retrieved with respect to user query. In this paper, we present a novel data fusion in IR to enhance the performance of the retrieval system. The best data fusion technique that unite the retrieval results of numerous systems using various data fusion algorithms. The study show that our approach is more efficient than traditional approaches.


Introduction
A retrieval system is a mach ine that receives the user query and generate the relevance score fo r the querydocument pair. The process of finding the needy information fro m a repository is a non-trivial task [1][2][3] and it is necessary to formulate a process that effectively submits the pertinent docu ments. The p rocess of ret riev ing germane articles [4] is termed as Info rmation Retrieval (IR). It deals with the representation, storage, organization of and access to the info rmat ion items [3]. Fusion is a technique that me rges results retrieved by d ifferent systems to form a unique list of documents. Document Clustering is based on particular ran ked list and does not take benefit of mu ltip le ranked list. The fusion function accepts these score as its output for the query docu ment pair. A static fusion function has only the relevance scores for a single query-document pair as its inputs. A dynamic fusion function can have mo re inputs. To construct a dynamic fusion function that can adjust the way it fuses mult iple retrieval systems relevance scores for a query docu ment pair using additional input features such as query, retrieved documents and joint distribution of retrieval systems relevance score for the query. Various models, schemes and systems have been proposed to represent and organize the document collection in order to reduce the users' effort towards finding relevant information [5]. In this study we present three different data fusion methods namely Rank Position, Borda Count, and Condorcet method in ranking retrieval systems. There are four feature selection techniques including Fisher Criterion, Go lub

Related Work
Fo x and Shaw showed the five co mbination function for combin ing scores [6]. They are as fo llo ws: Co mbMIN = Min imu m o f Indiv idual Similarities Co mbMAX = Maximu m of Individual Similarit ies Co mbSUM = Su mmat ion of Ind ividual Similarit ies Co mbANZ = Co mbSUM ÷ Nu mber of non zero Similarities Co mbMNZ = Co mb SUM × Nu mber of non zero Similarities.
Fusion functions which are different fro m Co mb -functions with respect to the generation of answer sets, are also found in the literatures [8]. These functions assign ranks to the documents in the answer set against the relevance score assignment mechanism adapted in Co mb-functions. Few such fusion techniques which emulate the social voting schemes, are the Borda and Condorcet fusions [8]. Borda Fuse and Condorcet's fuse, and showed that the use of social welfare functions (Roberts, 1976) as the merg ing algorithms in data fusion generally outperforms the Co mbMNZ algorith m. Extensive work on Co mb functions has been carried out by Lee [9][10][11] and based on the results he proposed few new rationales and indicators for data fusion. He concluded that CombMNZ is the better performing function than the others. The Probabilistic approach [12] differs fro m the Co mb-functions in the way it selects a best performing strategy from a pool based on a predetermined probability value. The probabilistic model selects only one strategy from the pool wh ile all other strategies remain unused. Hence, evolutionary algorithms are used to select the best performing strategies [13]. Meng and his co-workers (2002) indicate that metasearch software involves four components: 1. Database search engine selector: the search engines [database] to be mingle selected using some system selection methods 2. Query reporter: the queries are submitted to underlying search engines.
3. Docu ment Select ion tool: Documents to be used from each search engine are determined. The simplest way is the use of the top documents.
4. Unificat ion of Result : The results of search engines are combined using merging techniques. (1)

Lees Overlap Measure
where R i is the number of relevant documents and N i is the number of nonrelevant documents returned by the system i respectively. The ratio of the two systems found to be an important predictive factor for the improvement of the combination. The similarity measure is the two systems on relevant document is less important than on relevant ones. After normalizing the scores for each system on each query by dividing their respective means we found the optimal combination for each possible. For each feature, we use one of the statistical methods such as the traditional t-test. Large score suggests that the corresponding feature has different expression levels in the relevant and irrelevant documents and thus is an important feature and will be selected for further analysis. Besides that some researchers used a variation of correlation coefficient to select features, for examp le Fisher Criterion [13] and Golub Signal-to-Noise.

Rank position method
The rank position of the retrieved documents are used to merge the documents into a single list . The rank position is determined by the retrieval system. We call d as the original document, while its counterparts in all other documents list are called Reference docu ments of d. The following equation shows the statistical score calcu lation of document I using the position information of this document in all the systems (j=1,2,3,4…n).
In this summat ion, systems not ranking a document are omitted. The unite of the top documents is treated as reproduced results.

Borda Count Method
Borda count and Condorcet method are based on democratic election strategies. The person with h igh score gets n votes and each successor gets one vote less than the predecessor i.e (n-1). If there are persons who are not interested in voting process, then the score is evenly divided among unranked candidates.Then, for each subsequent, all the votes are supplemented and the alternative with the highest number of score wins the election.

Condorcet method
In the Condorcet election method, voters rank the candidates in the order o f part iality. It is a distinctive method that denote the winner as the candidate. Which p revail each of the other candidates in a pair-wise evaluation. To rank the documents we use their win and lose values.

Selection of Information Retrieval Systems for Data Fusion Technique
We consider three approaches for the selection of in format ion retrieval systems to be used in data fusion.
Best: The best performing retrieval systems that achieve high percentage of the relevant documents retrieved are emp loyed for statistical score calculation.
Normal : All systems to be ranked are used in data fusion. Bias: The dissimilarity measure of the retrieval systems are used in data fusion.
The Fisher Criterion, Golub Signal-to-No ise , trad itional t-test and Mann-Whitney rank sum statistic were applied to calculate the statistical score, S, fo r the IRs. In these techniques, each system was measured for correlat ion with the class according to some measuring criteria in the formu las. The systems were ran ked according to the score, S, and the top ranked relevant documents in the IRs were selected. The Fisher Criterion, fisher is a measure that indicates how much the class distributions are separated. The coefficient has the following formu la: Whereµ is the mean and is the variance of the given IR whose documents are top ranked or otherwise in class i. There were two IR classes in this experiment, i.e. the relevant documents in IR and the non-relevant documents in IRs. The statistic gives higher scores to IR system that returns relevant document that are retrieved with respect to the user query, whose mean differ great ly between the t wo classes, relative to their variances.
Go lub used a measure of correlation that emphasizes the "Signal-to-Noise" ratio, signaltonoise, to rank the relevant documents that are retrieved from the IRs. It is very similar to the Fisher Criterion but use another related coefficient formula as shown below: Where µ is the mean and is the standard deviation of the relevant documents retrieved in class i.
Traditional t-test,ttest assumes that the values of the two class variances are equal. The formu la is as follows: Where µ is the mean of the relevant documents in class i and is the pooled variance. The Mann-Whitney rank sum statistics, mann has the following formu la: Where is the sizes of class i, and 1 is the sum of the ranks in class i. The score,S, for each relevant documents retrieved in the IR is thus calculated by using the formu la in these statistical techniques.
The bias concept is used for the selection of IR system for data fusion. The cosine similarity measure is given by the following equation: The bias between these two vectors is defined by subtracting the similarity value form 1.
( , ) = 1 − ( , ) (9) We may use any of the combination of the above measure to calculate the statistical score of the in formation ret rieval systems.

Discussion
So far, our study suggest that, for our choice of retrieval systems, there is an opportunity to improve the retrieval performance by fusing the above mentioned approach. Our preferred design of effective statistical score calculation of informat ion retrieval systems is a multilayer technique to maximize precision and imp rove the retrieval performance that satisfies the user needs. In this paper we have summarized various methods that are used in different art icles published in the journal thereby incorporating and integrating few of the approaches may lead to better precision and recall values. Our significant contribution is thereby invoking the methods thereby integrating few of the techniques from various research articles so that it will be useful to the researchers for their valuable work in the future.