Neuro-PCA-Factor Analysis in Prediction of Time Series Data

Many related parameters have been considered to predict any physical problem in the world. Many of them are not significant or they are h ighly correlated with other parameters. But, some parameters are play ing significant role in prediction of the problem. These are giving necessary and sufficient informat ion and not correlated with the others. The output of the problem can be pred icted by considering fewer significant parameters instead of all. In this paper, an effort has been made to find the significant environmental parameters in production of mustard plant using principal component and factor analysis. The environmental parameters like maximum and min imum temperature, rain fall, maximum and min imum humid ity, soil moisture at different depth and sun shine have been affected the growth of mustard plant. The affect has made by all parameters are not same and more complex to predict the growth of muster plant with all parameters. The principal component and factor analysis have been used here to reduce the environmental parameters. These analyses have been used to find the significant parameters that have been greatly participated in growth of mustard plant. Finally, art ificial neural network has been applied on highly significant parameters to predict the production of mustard plant at maturity.


Introduction
Fro m journal study, it has been proved that the main application of principal component and factor analysis are (1) to reduce the number of variab les and (2) to detect the structure in the relationship between variables, that is classify variables. Therefore, factor analysis and principal component analysis are applied as data reduction or structure detection methods. The term factor analysis was first introduced by Thrustone [6], 1931.A hands-on how-to approach can be found in Stevens(1986); more detailed technical descriptions are provided in Cooley and Lohnes (1971); Harman (1976); Kim andMuller (1978a,1978b); La wley and Maxwell(1971) Lindeman, Merenda and Go ld(1980); Morrison(1967);or Mulaik(1972). The interpretation of secondary factors in hierarchical factor analysis, as an alternative to traditional oblique rotational strategies, is explained in detail by Wherry (1984). When min ing a dataset comprised of nu merous variables, it is likely that subsets of variables are highly correlated with each other. Given high correlation between two or more variables it can be concluded that data that these variables are quietly redundant thus share the same driving principal in defining the outcome of the interest. The use of principal co mponent analysis techniques [3] is well established in many fields such as pharmacology, climatology, numerous aspects of the life science, economics, ( Jolliffe, 1986,Faloutsos,Korn, Labrinidis, Kaplunuorich ,& Perkovic 1997; Preisendorfer, 1988; Shu m,lkeuchi, & Reddy 1997) and even relig ious studies! See example Walker (2001) who has provided a very illustrative a and imaginative use of this statistical methodology.
S. F. Bro wn, A. Branford and W. Moran [33] p roposed that artificial neural networks were powerful tool for analyzing data sets where there were co mplicated nonlinear interactions between the measured inputs and the quantity to be predicted. F. G. Donaldson and M. Kamstra [42] investigated the use of Artificial Neural Net work (ANN) to co mbine time series forecasts of stock market volatility fro m USA, Canada, Japan and UK. The authors presented techniques of combining procedures to a particular class of nonlinear combining procedure based on Artificial Neural Network (ANN). H. J. Zimmermann [34] presented the application of fu zzy linear programming approaches to the linear vector ma ximu m problem. It showed the solutions obtained by fuzzy linear programming were always efficient solutions. In a fuzzy environ ment a decision could be viewed as the fuzzy objective function, which was characterized by its me mbership functions and the constraints. G. A. Tag liarini, J. F.
Christ and E. W. Page [36] demonstrated that artificial neural networks could achieve high computation rates by employing massive number of simple processing elements of high degree of connectivity between the elements. Neural networks with feedback connections provided a computing model capable of exp loit ing fine-grained parallelism to solve a rich class of optimization problems. This paper presented a systematic approach to design neural networks for optimization applications. M. Lavio lette, J. W. Seaman Jr, J. D. Barrett and W. H. Woodall [35] presented that fuzzy set theory had primarily been associated with control theory and with the representation of uncertainty in applications in artificial intelligence. Fuzzy methods had been proposed as alternatives to statistical methods in statistical quality control, linear regression and forecasting. M. Lavio lette, J. W. Seaman Jr, J. D. Barrett and W. H. Woodall [35] and stated the difference between fuzzy and probabilistic logic and stated advantages of fu zzy logic controller. The distinction between randomness and fuzziness was based on the different types of uncertainty captured by each concept. R. G. Alamond [37] presented the comparison between fuzzy set theory and probability theory, problems with probability, certain applications in fuzzy set theory. Uncertainty meant the incident, which was not known to happen in a single experiment but could be predicted the behavior of many similar e xperiments.
Melike Sah and Konstantin, Y.Degtiarev [28] proposed a novel improvement of forecasting approach based on using time-invariant fu zzy t ime series on historical enro llment of the university of Alabama. They co mpared the proposed method with existing fuzzy t ime series time-invariant model based on forecasting accuracy.
Tahseen Ahmed Jilani, syed Muhammad, Agil Burney and Cemal A rdil [38] proposed a method is based on frequency density based partitioning of the historical enrollment data. They proved that the proposed method is the based method of forecasting accuracy rate fo r forecasting enrollments than the existing methods.
Using the value of shoot length, it has been observed that artificial neural network gives better results as compared to fuzzy logic and statistical models [15]. An effort has been made using neural network based on fuzzy data on mango export quantity and revenue generated fro m it. [16].
The different type of research work ( [19]- [27]) has been carried out using fuzzy logic and artificial neural network to forecasting rainfall, temperature and thunder strorms. They compared the proposed method with existing fu zzy time series time-invariant model based on forecasting accuracy. S.Kotsiantis, E. Koumanakos, D. Tzelepis and V. Tampakas [29] e xp lored the effectiveness of machine learning techniques in detecting firms that issue fraudulent financial statements(FFS) and deals with the identification of factors associated to FFS. Tahseen Ahmed Jilan i, Syed Muhammad, Agil Burney and Cemal Ardil [30] proposed a method is based on frequency density based partitioning of the historical enrollment data . They proved that the proposed method is the based method of forecasting accuracy rate for forecasting enrolments than the existing methods. A lots of research work also have been conducted for the prediction of several things ( [19]- [30]).
In this paper, an effort has been made to find the significant environment parameters which are affected the growth of mustard plant using principal co mponent and factor analysis. The environmental parameters like ma ximu m and minimu m temperature, rain fall; ma ximu m and minimu m humid ity, soil mo isture at different depth and sun shine have been taken. Finally, the parameters have been reduced and only few parameters have been used to predict the growth of mustard plant. To pred ict the gro wth of the mustard plant can be measured by observing the growth of its shoot length only.
As new leaves of plant may appear and old leave may fall down. The roots are going deeper to deeper inside the soil. This is the reason, the shoot length has been considered here to predict the productivity of mustard plant. At initial stage, using the reduced parameters, the shoot length of the mustard plant has been predicted by artificial neural network (ANN) . Least square method has been applied on predicted shoot length to find the shoot length at maturity. Finally, the productivity of plant has been predicted fro m shoot lengthy at maturity (after 95 days).This type of effort has not been used in prediction the growth of mustard plant that is the reason for ma king the effort in this paper.

Princi pal Component Analysis
PCA ([1]- [5]) transforms the orig inal set of variables into a smaller set of linear co mbination that account for most of the variance of the original set. The principal co mponent analysis has been determined almost total variation of the data as much as possible using few factors [43]. The first principal co mponent, PC (1), accounts the ma ximu m o f total variation in the data. PC(1) is represented by linear co mbination of the observed variables Xj, j=1,2,3….p -say PC (1) = w (1)1 X1 +w (1)2 X2+……+w (1)p Xp , where the weights w(1)1, w(1)2, ….. w(1)p have been chosen to maximize the ratio of the variance of PC(1) to the total variation, subject to the constraint that S 1-p w 2 (1) =1 Now, The second component, PC (2), is uncorrelated with PC(1) and represents the ma ximu m amount fro m the total variation not already accounted for by PC (1) . In general, the m th principal co mponent is that weighted linear co mbination of the X 's PC(m) = w (m)1 X1 +w (m)2 X2+……+w (m)p Xp which has the largest variance of all linear co mb inations that are uncorrelated with all of the prev iously extracted principal components. In this way, as many as possible principal components are extracted.

Factor Anal ysis
Factor analysis is used to identify underlying variab les, or factors which are correlated within a set of observed variables [6]. Factor analysis has also been used in data reduction by identifying a s ma ll number of factors of the variance observed in a much larger nu mber o f variables.

Artificial Neural Network (ANN)
An ANN (Artificial Neural Network) is composed of collection of interconnected neurons that are often grouped in layers. In feed fo rward back propagation neural network (FFBP NN) does not have feedback connections, but errors are back propagated during training. Erro rs in the output determine measures of hidden layer output errors, which are used as a basis for ad justment of connection weights between the input and hidden layers. Adjusting the two sets of weights between the pairs of layers and recalculating the outputs is an iterative process that is carried on until the errors fall below a tolerance level. Learning rate parameters scale the adjustments to weights. A mo mentum parameter can be used in scaling the adjustments from a previous iteration and adding to the adjustments in the current iteration. The layout of feed forward back propagation neural network is furnished in figure 3.

Data used in this Paper
A statistical survey has been conducted by a group of certain agricultural scientists on different mustard plants under the supervision of Prof. Dilip De, Bidhan Chandra Krishi Viswavidyalaya West Bengal, India. The objective of the survey was to find the productivity of different mustard plant at maturity (after 95 days).The data has been collected in two stages. At first, after p lantation, the read ing has been taken on different parameters like shoot length, number of leaf, nu mber of roots and root length of the plant up to 28 days. The data has been taken in so me day's interval so that the changes of parameters have been identified. Secondly, the shoot length and productivity (seed weight) at maturity (after 95 days) have been taken. The environment data like ma ximu m and minimu m temperature, rain fall; ma ximu m and minimu m hu midity, soil moisture at d ifferent depth and sun shine have been collected during the year. In another paper [39], the authors have proved that the mustard plant must be planted fro m November to February. No w, except the shoot length, all other plant parameters cannot be measured as plant is growing. The leaves may appear and fall down and the roots are going inside the soil. So, shoot length has been used to predict the growth of mustard plant. In this paper, environmental data during in itial growth(November to February) , init ial shoot length of different t ime instances and seed weight at maturity fro m this survey are furnished in table 1(a), 1(b) and 1(c).

Principal Component Analysis
Step 1: After the plantation, the environmental parameters have been collected during the harvest period of growing stage of mustard plant is furnished table 1(a). Using Statistica 7 software package, the correlat ion matrix [40] of table 1(a) is furnished in table 2.
Step 2: The eigen values, total variances, commulat ive eigen vector and percentage of contribution is furnished table 3. Step 3: When analyzing correlat ion mat rices (table 2), the sum of the eigenvalues is equal to the number of (active) variables fro m wh ich the factors were extracted (co mputed), and the "average expected" eigenvalue is equal to 1.0. Many criteria are used in practice for selecting the appropriate number of factors for interpretation; the simplest is to use (retain for interpretation) as many factors as the number of eigenvalues that are greater than 1. In this example, only the first three eigenvalues are greater than 1, accounting for approximately 92% of total variat ion. The values of all eigen values has been shown in figure 2.  The eigenvalues in the table 4 are arranged in decreasing order, indicat ing the importance of the respective factors in explaining the variation of the data. The factor corresponding to the largest eigenvalue (4.742433) accounts for approximately 52.7% of the total variance. The second factor corresponding to the second eigenvalue (2.576318) accounts for appro ximately 28.7% of the total variance, and so on.
Step 4: Another method for determining the nu mber of factors to interpret (retain ) is to construct the so-called scree plot (Cattell, 1966). Specifically, the successive eigenvalues will be shown in a simple line plot is shown in Figure 1. Cattell suggests finding the place where the smooth decrease of eigenvalues appears to level off to the right of the plot. No more than the number of factors to the left of this point should be ext racted.

Figure 1. Number of Eigen Values
Step 5: The eigen vector corresponding the table 1(a) has been furnished in table 5. The nu mber of co mponent components has been display corresponding the eigen value i.e three components (1, 2 & 3). Step 6: As the value of three eigen value has been calculated greater than 1.00. So, three co mponents from table 5 have been taken and furnished in table 6. Step 7: To find the significant variable fro m table 6, the following method has been applied. In principal co mponent analysis, one component is linear co mb ination of all variables. To find the part icular variable on which the co mponents is mostly depend on, the following method has been described below The first component corresponding to the first eigen value 4.742433 is most correlated with Min humidity (high negative correlation). So, co mponent1 is dependent on min humidity. The other dependency can be found those variable which is under 10% of min humid ity (highest value in component 1) i.e., (-0.447332-0.0447332) o r -0.4025988. Fro m the co mponent1 (table 6), it has been observed that the value of others variables less than 0.4025988 (negative correlation).
So, no other variable is play dominant role in component 1. If, more than one variable have been predicted as significant variables, one correlation matrix will be co mputed and depending on correlation, the significant variab le will be calculated.
Thus in component 2 corresponding eigen value 2.576318 and it's dominating variab le ma x hu mid ity and after reduction of 10% of this is 0.4537017 not correlated with other variables.
Finally, fro m co mponent 3 sun shine is most dominant variables.
Step 8: It has been observed that component1, compo-nent2 and component3 are dependent three variables min humid ity, ma x humid ity and sun shine. So, without considering 9 variables, three has been given 92% solution of this problem.

Factor Anal ysis
Step 1: The same data furnished table 1(a) has been used in factor analysis and Using Statistica 7 software package, factor loading have been calculated in factor analysis are furnished in table 7and table 8 The eigen values have been taken which are greater than 1. The factors have been taken same as number of eigen values.
Step 2: In factor analysis [41], one variable is linear co m-bination of all factors. The factor value which is greatest of all factors has been ma rk in the row o f all variables. In each factor, it has been found the greatest value fro m all ma rks values; the corresponding variable has been taken. Using this method, the variable Min hu midity has been selected in factor 1 fro m table 8. Fro m the other two factors, the height loading of other two factors are -0.805376 and 0.969639, i.e., ma x hu midity and Sun Shine. So, three variables ma x hu mid ity, min hu midity and sun shine have been calculated as the significant variab les.

Artificial Neural Network (ANN)
Under artificial neural network system, a feed forward back propagation neural network is used which contain three layers. One input layer, one hidden layers contain 3 neurons and one output layer contain one neuron.
The values of the artificial neural network parameters are in itialized by newff function which has been created develop new neural network and init ial values in mat lab 7 package. The momentu m parameter is taken as 0.7, learn ing rate 0.05, init ial b ias of hidden layer[0.2, 0.3, and 0.5] and initial bias of output layer is[0.2].
After applying PCA and Factor Analysis, the significant environ mental parameters and related shoot length are furnished in table 9. In neural network the input parameters are ma ximu m humid ity and min hu midity and sunshine and target is shoot length. Using training and testing the predicted shoot length is furnished in table 10.

Result
In this methodology, it has been proved that out of nine environmental parameters, three of them (ma x hu midity, min humid ity and sun shine) have been are p layed significant ro le for growing the mustard plant. If these three parameters are available sufficiently, the growth of mustard plant will be healthy and they will be produced huge yields. The shoot length can be predicted using ANN which furn ished table 7 and linear equations. The final shoot length after 95 days is 135.88 cm and the corresponding pod yield has been predicted 2.679g m (fro m table 1(c)).

Conclusions and Future Work
The principal co mponent and factor analysis, same result can be produce using fewer parameters without considering all related parameters for a physical problem. The ANN used for train ing and testing to predict the productivity after finding the shoot length at maturity. It is a supervise learning which provide the target. This result can be cross examined using fuzzy logic, genetic algorithms in future.