Using Data Mining Technique to Predict Cause of Accident and Accident Prone Locations on Highways

Road accident is a special case of trauma that constitutes a major cause of d isability, untimely death and loss of loved ones as well as family b read winners. Therefore, pred icting the likelihood of road accident on high ways with particular emphasis on Lagos – Ibadan express road, Nigeria in order to prevent accident is very important. Various attempts had been made to identify the cause(s) of accidents on highways using different techniques and system and to reduce accident on the roads but the rate of accident keep on increasing. In this study, the various techniques used to analyse the causes of accidents along this route and the effects of accidents were examined. A technique of using data mining tool to predict the likely occurrence of accident on highways, the likely cause of the accident and accident prone locations was proposed using Lagos –Ibadan highway as a case study. WEKA software was used to analyse accident data gathered along this road. The results showed that causes of accidents, specific time/condition that could trigger accident and accident prone areas could be effectively identified.


Introduction
Road accident is a special case of trauma that constitutes a major cause of disability and untimely death. It has been estimated that over 300,000 persons die and 10 to 15 million persons are injured every year in road accidents throughout the world. Statistics have also shown that mortality in road accidents is very high among young adults that constitute the major part of the wo rk force. In actual fact, accidents kill faster than AIDS and it g ives no preparatory time to its victims. In order to combat this problem, various road safety strategies have been proposed and used. These methods mainly involve conscious planning, design and operations on roads. One important feature of this method is the identification and treatment of accident prone locations commonly called black spots; black spots are not the only cause of accidents on the highway. Also various organizations such as Police High Way Patrol, Veh icle Inspection Officer (VIO), Federal Road Safety Co mmission (FRSC) among others are charged with the responsibility of maintaining safety thereby reducing road accidents. However, lack of good forecasting techniques has been a major hindrance to these organizations in achieving their objectives.
Decision Trees have emerged as a powerfu l technique for modelling general input / output relationships. They are tree -shaped structures that represents a series of ro les that lead to sets of decisions. They generate rules for the classification of a dataset and a logical model represented as a binary (two -way split) tree that shows how the value of a target variable can be predicted by using the values of a set predictor variables. Decision trees, which are considered in a regression analysis problem, are called regression trees. Thus, the decision tree represents a logic model of regularities of the researched phenomenon.

Acci dents along Lag os -Ibadan Express Way
Lagos to Ibadan Express road is one of the busiest roads in Africa. Th is is because. Lagos was the capital of Nigeria until the seat of government moved to the Federal Capital Territory Abuja and also the headquarters of many national institutions while Ibadan is said to be the largest city in black Africa. The traffic along this route is very heavy because it is a gateway linkage o f the heavy traffic going fro m the Northern, Eastern and Majority of Western states.  Several works have been carried out by different researchers both on road accident analysis and forecasting, using Decision Tree and Artificial Neural Networks. Martin, Grandal and Pilkey (2000), analysed the relationship between road infrastructure and safety by using a cross-sectional time-series data base collected for all 50 U.S. states over 14 years. The result suggested that as highway facilit ies are upgraded, there are reduced fatalit ies. Ge lfand (1991) studied the effect of new pavement on traffic safety in Sweden. The result of his study shows that Traffic accidents increased by 12 % after one year of resurfacing on all types of roads. Akomolafe (2004) employed Artificial Neural Network using mu ltilayer perceptron to predict likelihood of accident happening at particular location between the first 40 kilo meters along Lagos-Ibadan Express road and discovered that location 2 recorded the highest number of road accident occurrence and that, tyre burst was the major cause of accident along the route. Ossenbruggen (2005) used a logistic regression model to identify statistically significant factors that predict the probabilities of crashes and injury crashes aiming at using these models to perfo rm a risk assessment of a given reg ion. Their study illustrated that village sites are less hazardous than residential and shopping sites. Abdalla et al (1987) studied the relationship between casualty frequencies and the distance of the accidents from the zones of residence. As might have been anticipated, the casualty frequencies were higher nearer to the zones of residence, possibly due to higher exposure. Ako molafe et al (2009) used geo spatial technology to identify various positions along major roads in Nigeria. The study revealed that the casualty rates amongst residents from areas classified as relatively deprived were significantly higher than those from relatively affluent areas.

Process of Data Mi ning
The process of data mining consists of three steps which are:

Data Preparation
This includes; Data collection, Data clean ing and Data transformation.

Data Modeling
This research considers the data of accident record between the first 40km fro m Ibadan to Lagos. The data were organized into a relat ional database.
The unknown causes in Table 3.2 may include other factors such as Law enforcement agent problems, attitude of and Accident Prone Locations on Highways other road users, inadequate traffic road signs, traffic congestion and general vehicle conditions The sample data used covered the period of 24 Months, that is, January 2002 to December 2003 as indicated in Fig.

3.1.
The output variable is the location and the locations can be divided into three distinct reg ions tagged regions A, B and C, mean ing we have three outputs. Where First location 1 -10km is Region A or location 1, Above10km -20km is region B or Location 2 and above 20km is reg ion C or Location 3 The data sample used covered a period of twenty four Months starting fro m January 2002 to December 2003.The data were collected by Akomo lafe (2004) and this is presented in Table.3.3.

Deploy ment
In this stage, new sets are applied to the model selected in the previous stage to generate predictions or estimates of the expected outcome.

Analysis
The major step required to obtain result of the research was carried out by analysing the data using WEKA. W EKA is a collection of machine learning algorith ms and data processing tools. It contains various tools for data pre-processing, classification, regression, clustering, association rules and visualization. There are many learning algorith ms imp lemented in WEKA including Bayesian classifier, Trees, Rules, Functions, Lazy classifiers and miscellaneous classifiers. The algorith ms can be applied directly to a data set. WEKA is also data mining software developed in JA VA it has a GUI chooser fro m which any one of the four majo r W EKA applicat ions can be selected. For the purpose of this study, the Exp lorer application was used.
The Exp lorer window of W EKA has six tabs. The first tab is pre-process that enables the formatted data to be loaded into WEKA environ ment. Once the data has been loaded, the preprocess panel shows a variety of informat ion as shown in figure 4.3 belo w.

Weka Classifiers
There are several classifiers available in WEKA but Function Tree and Id3 were used in this study in case of Decision Tree. Pris m Rule based learner was generated using WEKA. Attribute importance analysis was carried out to rank the attribute by significance using information gain. Finally, correlation based feature subset selection (cfs) and consistency subset selection (COE) filter algorith m were used to rank and select the attribute that are most useful. The F-measure and the AUC wh ich are well known measures of probability tree learning was used as evaluation metrics for model generated by WEKA classifiers.
Several numbers of setups of decision tree algorith ms have been experimented and the best result obtained is reported as the data set. Each class was trained with entropy of fit measure, the prior class probabilities parameter was set to equal, the stopping option for pruning was misclassification error, the minimu m n per node was set to 5, the fraction of objects was 0.05, the maximu m number of nodes was 100, surrogates was 5, 10 fold cross-validation was used, and generated comprehensive results.
The best decision tree result was obtained with Id3 with 115 co rrectly classified instances and 33 incorrectly classified instances which represents 77.70% and 22.29% respectively.
Mean absolute error was 0.1835 and Root mean squared error was 0.3029.
The tree and ru les generated with Id 3 algorith m are given thus: and BRAKE-FAILURE = FA LSE and ROAD-PROBLEM = FA LSE and UNKNOWN-CAUSES = FA LSE and ROBBERY-ATTA CK = FA LSE then LOCA-TION2

Discussion
There are 50 rules generated fro m this tree. Rule 1-18 indicate the occurrence of accident in Location 3 and ru le 19-50 also shows the occurrence of accident in location 2.Th is indicate that, location 2 has the highest number of road accident occurrence with Heavy-vehicle in the afternoon and during the dry season.
Rule 41 is the best one that can be used for prediction. The rule says that, Tyre bust is the cause of road accident with heavy vehicle within location 2 in the day time and during the dry season.

Conclusions
Using WEKA software to analy ze accident data collected on Lagos-Ibadan road, it was found that decision tree can accurately predict the cause(s) of accident and accident prone locations along the road and other roads if relevant data are gathered and analyzed as in this case.
In Decision Tree Performance analysis, the, dataset were experimented with two algorith ms; Id3 and FT (function tree) For Id3 algorith m, there were 115 correctly classified instances and 33 incorrectly classified instances which represent 77.70% and 22.29% respectively. Mean absolute error was 0.1835 and Root mean squared error was 0.3029.
Also for functional tree algorithm (FT), total number of tree size was 5 with 105 correctly classified instances representing 70.27% and 44 incorrectly classified instances representing 29.73%.
Fro m the detailed accuracy by class and confusion matrix, Id3 attained accuracy rate of 0.777 and FT attained accuracy rate of 0.703.