An Empirical Investigation of Attribute Selection Techniques based on Shannon, Rényi and Tsallis Entropies for Network Intrusion Detection

Intrusion Detection Systems of computer network perform their detection capabilities by monitoring a set of attributes from network traffic. Since some attributes may be irrelevant, redundant or even noisy, their usage can decrease the intrusion detection efficiency as well as increase the set of attributes. In this context, selecting optimal attributes is a difficult task considering that the set of all attributes can assume a huge variety of data formats (for example: symbol set, e.g. binary, alphanumeric, real number, etc., types, length, among others). In this work, it is presented an empirical investigation of attribute selection techniques based on Shannon, Rényi and Tsallis entropies in order to obtain optimal attribute subsets that increase the detection capability of classifying network traffic as either normal or suspicious. Simulation experiments have been carried out and the obtained results show that when Rényi or Tsallis entropy is applied the number of attributes and the processing time are reduced and, in addition, the classification efficiency is increased.


Introduction
According to [1], a computer network intrusion is defined as any set of actions that attempt to compro mise the integrity, confidentiality or availability of a network resource. In general, intrusion attempts are external malicious actions that have the purpose of intentionally violating the system security properties. Co mplete or part ial intrusion is a result of successful attacks, which exp loit the system vulnerabilities. Since invulnerable co mputer networks are practically impossible of achieving, it is mo re reasonable to assume that intrusions can happen. In this way, the ma in challenge in network security is to determine if any network action is either normal or intrusion suspicion.
In co mplex do mains, such as network Intrusion Detection System (IDS), a huge amount of activity data is collected fro m the network generating large log files and raw network traffic data, in wh ich hu man inspection is impossible. Thus, these activity data must be compressed into high-level events, called attributes. After it, a set of attributes is obtained and monitored by the IDS in order to detect intrusion attempts.
However, there are some attributes with false correlat ions, hiding the underlying process, and other that may be either irrelevant or redundant (its informat ion is somehow included in other attributes). In this way, removing these attributes, or rather, selecting an optimal attributes set that adequately describes the network environment are essential in order to ach ieve fast and effective response against attack attempts, reduce the comp lexity and the co mputation time, and increase the precision of the IDS [2]. In this way, development of methods for selecting optimal attributes is welco me.
In this work, it is investigated some attribute selection approaches through a comprehensive comparison of C4.5 decision-tree model based on Shannon entropy [3] with other three attribute selection methods (proposed by the authors in prev iously papers), namely, C4.5 based on Rényi entropy [4], C4.5 based on Tsallis entropy [5] and an approach that combines Shannon [6], Rényi and Tsallis entropies.
In order to evaluate the classification performance of these methods, it was considered four attack categories (DoS, Probing, R2L and U2R) based on KDD Cup 1999 data [7], and the following classification models: CLONal selection ALGorith m (CLONA LG) [8], Clonal Selection Classification Algorithm (CSCA) [9] and Artificial Immune Recognition System (AIRS) [10].
Experimental results show that the classification efficiency of optimal attributes subset based methods is Techniques based on Shannon, Rényi and Tsallis Entropies for Network Intrusion Detection comparable to that based on complete attributes set for CLONA LG and CSCA classification models.
The paper is organized as follows. Section II provides more detail info rmation about the attribute selection methods. The data set, classifiers, and performance metrics used in the experiments are described in Section III. Results are reported in Section IV and conclusions are shown in Section V.

Attribute Selection
Attribute selection is a strategy of removing irrelevant and redundant attributes in order to avoid performance degradation (for instance, speed, detection precision, etc.) of algorith ms of data characterization, ru le ext raction, designing of predictive models, and others.
Considering a given dataset that can be characterized by N attributes, the objective of any attribute selection process is to find a minimu m nu mber M o f optimal attributes that are capable of describing the dataset as well as with N attributes in such a way that the characteristic space is reduced according to some criterion [11].
Attribute selection can be categorized as filter or wrapper model. The filter model consists in selecting attributes independently of the chosen learning algorith m by examining intrinsic characteristics of the data and by estimating the quality of each attribute considering only the available data. In contrast, the wrapper model consists in evaluating the attributes subset performance by applying a predetermined learning algorith m on the selected attributes subset. In this way, for each new attributes subset, the wrapper model needs to learn the classificat ion algorithm and, based on its performance, to evaluate and determine which attributes should be selected. In general, this model finds the best attributes considering the predetermined classification algorith m resulting in better learning performance, but it shows to be more co mputationally expensive than the filter model [12].
Since there are 2 N possible subsets considering N attributes, an exhaustive search for an optimal attributes subset may be impracticab le, especially when N and the number o f data classes is increased. Therefo re, heuristic methods that explore reduced search space are common ly used for attribute selection. These methods are typically greedy in the sense that they make a locally optimal choice in the hope that this choice will lead to a globally optimal solution. In practice, such greedy methods are effective in estimating optimal solution [11].
For its turn, Decision Trees are supposed to be effective classifiers in a large variety of domains. Most of decision tree algorith ms use standard top-down greedy approach. The learning p rocess of decision trees is based on an induction process where is used training dataset described in terms of attributes. The decision tree result is a directed graph where each internal node denotes a test on the selected attribute, each branch represents an outcome of the respective test and each leaf node corresponds to a class label, as shown in Figure 1.
Initially, considering the complete set of attributes, the decision tree algorith m selects an optimal attribute based on some criterion that partitioning the data into subsets according to the attribute values. Next, this process is recursively applied to each partitioned subsets and it is fin ished when a leaf node is obtained, i.e., the data in the current subset belongs to the same class.
In our attribute selection approach, a decision tree induction is used for attributes selecting. In this way, the attributes that do not appear in the designed decision tree are considered irrelevant. Consequently, the attributes that corresponds to the internal nodes are selected to form the optimal attributes subset. The most popular decision tree algorithms are the ID3 (Induction of Decision Tree) [13] and its successor, the C4.5 algorith m [3]. Using a top-down process, both algorithms are capable of designing decision trees by selecting appropriate attribute for each decision node based on Shannon entropy measure [6].
Specifically, in ID3 algorith m, the best attribute in each iteration step is that with highest mutual info rmation among all others. Although achieving good results, it presents high bias in favor of attributes with large span of values. To try to solve this problem, Quinlan proposed the C4.5 algorithm comprising of a normalization stage, called gain ratio, in which the apparent gain assigned to large span attributes is adjusted by it [3]. For more details about C4.5 decision trees, see [14].
As known, there are others entropy measures, such as Rényi and Tsallis entropies. In theory, they can be applied in attributes selection schemes. Hence, it is described in this paper an empirical investigation in order to assess whether these entropy measures are adequate in designing attributes selection schemes. In the next sections, Shannon, Rényi and Tsallis entropies are duly described.

Shannon Entropy
Entropy is a statistical measure related with the amount of information into a random variab le. Based Now considering a set of N attributes A i , where i = 1, ... N and each attribute A i can assume v i finite values, Shannon has defined other basic concept in information theory, the mutual information, I(C;A i ) that measures the dependence between two random variables, in our case C and A i . I(C;A i ) is expressed in terms of Shannon entropy as: where H(C | A i ) stands for the conditional entropy of C given A i . The mutual informat ion can be interpreted as the amount of uncertainty fro m C which is decreased by the knowledge of A i .
Other entropies measures have been proposed as, for instance, Rényi [4] and Tsallis [5]. Rényi and Tsallis entropies are based on an additional parameter α used to make them more or less sensitive to the considered probability distribution shapes.

RÉNyi Entropy
The Rényi entropy constitutes a measure of informat ion of order α, having Shannon entropy as the limit case, and is defined by: Using Rényi entropy of order α ϵ (0, 1), the mutual informat ion can be given as: (4)

Tsallis Entropy
Another generalized entropy, defined by Constantino Tsallis [5], is given by: For α > 1, Tsallis mutual info rmation is defined as [15]: Using Shannon entropy, events with h igh or low probability have no different weights in the entropy computation. However, using Tsallis entropy, for α > 1, events with high probability contributes mo re than low probabilit ies ones. Hence, the higher is the value of α, the higher is the contribution of high-probability events. In the same way, increasing the values of α (α → ∞), Rényi entropy is increasingly determined by events with higher probabilit ies, and lowering the values of α (α → 0), the events are mo re equally, regardless their probabilities [16].

Proposed Attri bute Selection Schemes
In this work, aiming to select an optimal attributes subset, four different approaches in order to identify four attacks categories have been considered. In addition, it was considered the filter model based attribute selection. In this way, it was designed C4.5-based decision trees, i.e., gain ratio was considered, taking account Rényi entropy versus Shannon entropy and Tsallis entropy versus Shannon entropy. Moreover, it was designed C4.5-based decision trees considering a combination (ensemble) of Shannon, Rényi and Tsallis entropies. The proposed attribute selection schemes are shown in Figure 2. The ensemble approach combines the results from the individual attribute selection schemes in order to improve the selection of optimal attributes subset by avoiding relying only on a single approach.

Simulation Environment
In order to evaluate the proposed attributes selection schemes and to design the classificat ion models, it was used the WEKA toolkit (Waikato Environment for Knowledge Analysis) [17].
In WEKA, the source code of the class J48 for generating standard-C4.5 based decision tree was mod ified by the authors using JAVA programming language, replacing Shannon entropy by α-dependent Rényi and/or Tsallis entropies.

Data Set Descripti on
In general, to evaluate IDS schemes, dataset benchmarks are used as, for instance, the intrusion dataset available in Knowledge Discovery and Data Mining Competition -KDD Cup 99 [7] for both training and testing. This dataset is still used by researchers because it has the capability to co mpare different intrusion detection techniques on a common dataset base.
In the KDD99 database, any network connection (or instance) is comprised of 41 attributes and each instance is labeled either as normal or as an attack-specified type. These attributes are shown in Tab le 1 and its meaning can be found in [7].
In KDD99 database, there are 494,021 instances in wh ich 97,278 are considered normal and 396,744 are labeled as attacked by 22 different types that can be classified in 4 main categories as follo ws:

Shannon entropy measures
Subset of attributes Techniques based on Shannon, Rényi and Tsallis Entropies for Network Intrusion Detection • Denial of Service (DoS) -attacks fro m this category lead to deny of leg itimate requests usually by network flooding, wh ich is defined as a very large amount of connections to the same host in a very short time.
normal (2573) wareszclient (60) warezmaster (20) normal (1934) • Probing -It is an attack category based on scanning of the network in order to get information or vulnerabilities. Probing actions is based on sending a huge amount of packets to different hosts in a short time with very short duration.
• Remote to Local (R2 L) -attacks fro m th is category can be characterized by attempts of remote-machine user to get access to a local server.
• User to Root (U2R) -It is an attack category characterized when an authorized user tries to get access as super user (root).
Usually, network traffic data samples are necessary to be collected in advance to design an intrusion detection system. However, co mplete attack information is very d ifficu lt to obtain because, in real wo rld, intruders constantly develop new attack methods in order to exp loit system security vulnerabilities. Since, in general, the collected samples always present some uncertainty, as only limited informat ion about intrusive activities is available, a subset of attack fro m each category was randomly selected fro m KDD99 database in order to simulate the uncertainty problem and to decrease co mputational cost without compro mises the research results. As shown in Table 2, each category contains instances corresponding to attack types and normal behavior and its individual amount is shown in brackets.

Performance Metrics
In a binary classification problem aiming to distinguish normal behavior patterns (positive) or suspicious attack patterns (negative), any classifier is supposed to label instances as either positive or negative. The classifier decisions can be represented in a structure, known as a confusion matrix. The confusion matrix has four categories: true positive (TP) (i.e., positive instances classified correctly as normal), false positive (FP) (i.e., negative instances classified as normal), t rue negative (TN) (i.e., negative instances classified correctly as attacked), and false negative (FN) (i.e., positive instances classified as attacked).
The amount of instances fro m the database (outcomes) forms the basis for several other performance measures that are well known and common ly used for classifier evaluation. Therefore, the analysis of the proposed attributes selection approaches described previously was carried out by means of the performance measures exp lained belo w.
The Area Under Receiver Operating Characteristic (ROC) Curve, called AUC, is a single-value measurement originated fro m signal detection field and has been widely used to measure classification model performance [18]. The value of the AUC ranges fro m 0 to 1. The ROC curve is used to characterize the trade-o ff between true positive rate, It provides an effective way for performance comparison among classifiers of imbalanced datasets. A classifier that gives a large area under the ROC curve is preferable over a classifier with a smaller area under the curve. A perfect classifier p rovides an AUC equals to 1.
For its turn, the Kappa statistic is a method that compensates random hits [19]. Th is is originally a measure of agreement between two classifiers. However, it is emp loyed as a classifier performance measurement because it considers random successes as a standard [20].
The value of the Kappa statistic ranges fro m 0 (total disagreement) to 1 (perfect agreement) and it is less expressive than ROC curves when applied to b inary classification. Ho wever, for mu ltip le class problems, the Kappa statistic is very useful for measuring the accuracy of the classifier while co mpensating random successes.
The main d ifference between CR and the Kappa statistic is the scoring of the correct classifications. CR scores all the successes over all classes, whereas the Kappa statistic scores the successes independently of the class. The latter is less sensitive to randomness caused by the different number of instances in each class.

Classifiers -AIS algorithms
The Artificial Immune System (AIS) is the class of adaptive computational algorith m that emu lates processes and mechanism inspired fro m biological immune systems. These algorithms use learning, memory, and optimization capabilit ies of the immune system to develop computational tools for classification, optimizat ion, pattern recognition, novelty detection, process control, among others. AIS is supposed to develop adaptive systems capable of solving problems at different domains [21].
In this work, the classification performance is obtained considering the following classificat ion models: clonal selection algorith m -CLONALG [8], clonal selection classification system -CSCA [9] and artificial immune recognition system -AIRS [10]. These algorithms simulate the antigen-antibody recognition process by evolving a population of B-cells in order to recognize antigens (suspicious attack patterns from the training set). They are applied in the attribute subsets selected by the four proposed attribute selection approaches.
The CLONA LG is based on clonal selection theory as proposed in [8]. Its goal is to develop a memory set of antibodies that represents a solution for a specific problem. It describes the basic feature of an immune response against an antigenic stimulus that consists on the fact that only those cells that recognize any antigens are selected to proliferate. The selected cells are subject to an affinity maturation process, which improves their affin ity capability with the antigens. The CLONA LG was implemented by Bro wnlee [9] in W EKA toolkit.
The CSCA was developed by Brownlee [9] and is formulated as a fitness function that maximizes the number of patterns classified correctly and minimizes misclassification. In CSCA, many generations are carried out and, in each generation, the entire set of antibodies is exposed to all antigens.
Finally, the AIRS, a supervised learn ing algorith m that is used for classification problems, was proposed in 2001 [22]. The AIRS [10] is a clonal-selection-inspired procedure that perform cloning and somat ic hypermutation for maturating a set of recognition cells (or memory cells) which are representative of the training data that the model was exposed to. It is suitable for classifying unobserved cases and it uses a single iteration on a set of train ing dataset.
In the AIRS algorith m, any B-cell is defined as an Artificial Recognition Ball (ARB) that consists of an antibody that indicates: the class it belongs, the number of resources held by the cell, and the current stimulation value of the cell (defined as the similarity between the ARB and an antigen). The ARB population is trained during several cycles of competition for limited resources. The best ARBs receive the highest number of resources, and no-resource ARBs are eliminated fro m the population. In each training cycle, the best ARB classifiers generate mutated clones that enhance the antigen recognition process, whereas the ARBs with insufficient resources are removed fro m the population. After training, the best classified ARB are selected as memo ry cells, and they are used to classify novel antigens.
The so-called AIRS1, the first version of the AIRS, performs its tasks using data reduction. This means that it does not use the complete train ing data for generalization and the resulted classifier represents the training data with reduced or min imu m number of instances. It was adopted in this work. Other versions have been presented (e.g. AIRS2, parallel AIRS2), but they did not be tested in this work due the high volume of the datasets, which generate a high increase overall runtime.

Experimental Results
Considering the previously experimental simulat ions results obtained by the authors, shown in [14], where it was designed decision trees based on Shannon, Rényi and Tsallis entropies, here was chosen the best designed decision trees in terms of classification efficiency and tree size. For example, the decision tree designed by Rényi entropy with α = 0.5, and the decision tree designed by Tsallis entropy with α = 1.2 were selected considering the DoS category.
After choosing the decision trees, a subset of attributes was indiv idually selected for each dataset according to individual category of attacks. Moreover, a new attributes subset was selected based on the ensemble approach extracted by using of Shannon, Rényi and Tsallis entropies. Since different attack categories may have different optimal attribute subsets, four experiments have been performed in order to evaluate the attribute subsets that are more suitable for detecting individual category of attacks according to a g iven entropy formu lation. The experimental results are shown in Table 3. The attributes subsets selected by the ensemble approach are shown in Table 4. As can be seen in Table 3, as expected, so me selected attributes are different for different attack categories, because different types of attack have evidently their own patterns. In addition, it is important to notice that attributes 20 and 21 do not show any variations in the data set. Thus, they have no relevance to intrusion detection.

Experi mental Result Analysis
In the experimental procedures, the first three classification models are applied to the original data sets (with 41 attributes) in o rder to obtain classification performance on the testing instances. Ne xt, the classification results of these algorithms are used to compare the effectiveness of the four proposed attribute selection techniques. The criteria used to evaluate the effectiveness of the selected attributes are kappa statistic [19] and AUC. The result is shown in Table 5.
The experiments were carried out using ten-folds cross validation approach to control their validation. User-defined parameters for each algorith m have been optimized to achieve the best possible classification accuracy. The experimental results were obtained considering the network-traffic training data sets described in Table 2.
Analyzing the experimental results on the attribute selection schemes performance, it is observed that they are significantly different at the 1%-level (whether the difference is statistically significant). Furthermore, the performance values have varied depending on both the classifiers and the performance metric used to evaluate the models.
The attribute selection has decreased significantly the number of attributes and data dimensionality, lead ing to a better performance of the AIS algorithm, resulting in lesser running time co mpared to the situation when the co mplete attributes set of the original database was used.
Fro m the Tab le 5, the detection results on KDD 99 dataset indicate that the performance remains almost the same or even becomes better for CLONA LG and CSCA classification models designed considering DoS, R2L and U2R datasets by any attribute selection technique compared when the complete data set (with 41 attributes) is used.
In particu lar, Tsallis entropy achieves no improvement in performance (see kappa and AUC values on Table 5) for Clonalg/CSCA and AIRS1 algorith ms on Probing/R2L datasets. Although, it achieves best result when models are designed using U2R dataset and CLONA LG algorithm.
For DoS, R2L and U2R attacks categories on the AIRS1 algorith m, the classification efficiency, in terms of kappa statistic and considering the selected attributes by the four attribute selection approaches, was significantly worse compared with the co mplete data set.
Based on Table 5, the preliminary results have pointed out that when an attribute selection scheme performed best in terms of a performance metric, this may not be true when other performance metric is used to evaluate the model. For example, using DoS dataset, Tsallis entropy performed best on AUC fo r any IAS algorith m and the ensemble approach performed best (excluding the complete attribute set) in terms of kappa performance metric when models are designed using AIRS1 algorithm.  Another relevant result, when co mpared with Shannon entropy, is that using Tsallis entropy, it was achieved the same amount or even smaller set of attributes to detect attacks for all attacks categories and using Rényi entropy, it was achieved the same amount or smaller set of attributes for Probing, R2L and U2R attacks categories.

Conclusions
In this paper, it was presented an evaluation of Shannon, Rényi and Tsallis entropies and their applications in the area of intrusion detection system. Additionally, it was proposed an ensemble approach that comb ines the attributes selected by Rényi, Tsallis and Shannon information measures. In general, the experimental results have shown that selecting attributes based on Rényi, Tsallis entropies and ensemble approach achieve better results considering individual categories. Moreover, attribution selection approach based on Rényi or Tsallis entropies has reduced the amount of attributes and computational time. Fo r future research, it will be used more detailed attributes from real network traffic that supposedly are able to better characterize packet contents as well as header data.