Genotype Division for Shared Memory Parallel Genetic Algorithm Across Platforms and Systems

In this paper we present a concurrent implementation of coevolutionary genetic algorithm (GA) designed for shared memory architectures such as multi-core processor platforms. Our algorithm div ides the chromosome among the processes, and not the population as it is the case for most parallel implementations of the GA. This approach results in a division of the problem to be solved by the GA into sub-problems. We analyze the influence on performance and speedup of several parameters defining the algorithm, such as: a synchronous or asynchronous informat ion exchange between processes and the frequency of communication between processes. We also examine how the problem separability influences the general algorithm performance. Finally, we compare different operating systems and platforms in the evaluation process. Our paper shows that this approach is a good way to take advantage of multi-core processers and improve not only the execution time, but also the fitness in many cases.


Introduction
The hardware develop ments fro m recent years have made mu lti-core arch itectures a common place in the industry. The issue we are now facing is taking advantage of such platforms.
Parallel and distributed versions of the genetic algorith ms (GAs) are popular and diverse. The simplest parallel models are function-based where the evaluation of the fitness function is distributed among the processes [14]. The most popular parallel models are population-based where the population itself is distributed in niches [15], also called islands sometimes [9]. Such models require a periodic migrat ion of indiv iduals between the sub-populations. The shared-me mo ry GAs are a subset of the parallel and distributed models. The parallelizat ion techniques can be ported and adapted fro m one type of architecture to the other, but with specific features that can be optimized in each case. A survey of these algorith ms can be found in [1].
There are positive arguments in favor of models based on population division, such as a high degree of independence for each process. Among the drawbacks we can cite the frag mentation of the population into small pieces. This can generate issues such as the premature converge of the population to a local optimu m or general loss of diversity. Larger populations have been reported to perform better by several studies or even to be necessary to the success of parallel imp lementations [3].
In a different direct ion, coevolutionary algorith ms have interested and fascinated researchers for a good number of years. Even though competitive coevolution is the more popular form, the cooperative form has been proven to give good results [2]. These approaches decompose the problem into parts evolving separately [12]. For the purpose of the fitness evaluation, these parts are assembled into a co mplete chromosome. [7] argues that it is not the separability of the problem that makes these approaches successful, but their increased exp loratory power. So me theoretical studies of the conditions under which these algorithms can achieve the global optimu m have been proposed [10].
The model that we propose in this paper bridges the gap between these two approaches. It is a variant of the cooperative coevolutionary approach designed for shared memory parallel architectures. While the usual cooperative approach is imp lemented for problems that are naturally d ivisible into subpopulations, our model generalizes this technique by making it applicable to any problem.
Our model is based on a d ivision at the genotype level of the population into several agents or processes. It is not an algorith mically equivalent version of the genetic algorith m, or of a standard cooperative coevolutionary algorith m, but a hybrid model designed for parallel architectures. This model can potentially run faster and achieve better results than the standard GA. In our approach, each process receives a part ial chromosome to evolve. A ll the genetic operations are restricted to this subset of genes. For evaluation purposes, a template is kept by every process containing information about the best genes found by all of the other processes up to that point. A periodic exchange procedure keeps this informat ion up to date.
Finally, when aiming to optimize the genetic algorith ms for massively parallel architectures, it becomes undesirable both to split up the population into too small nests, and to divide the chromosome too much. A hybrid approach can be a good compro mise and for this purpose both population and chromosome d ivision models need to be studied thoroughly. This paper contributes to the study of the less observed of the two approaches.
The paper is structured the following way. Section 2 presents the details of our parallel model for the genetic algorith ms. Sect ion 3 introduces the three test problems that we used for out experiments. Section 4 shows the experimental results and the paper ends with conclusions.

Chromosome Division Model
Our model for parallel genetic algorith ms follows a similar idea to the one described in [16]. The difference is that the current model is imp lemented for shared memo ry arch itectures as opposed to a Beowulf cluster, and the experiments use a different set of problems. Preliminary results were also presented in [17,18], although the set of problems used here is almost entirely new.

Problem Di vision
According to the most popular approach to parallel genetic algorithms, wh ich is the island one, the population is decomposed in several islands or niches, each of them evolving in parallel. In such a model, the evolution in each population is self-contained, and the only thing that makes it a unified process is an occasional migrat ion of individuals between the islands.
Our motivation co mes fro m the fact that smaller populations can more easily lead to suboptimal solutions and premature convergence. The chromosome division allows us to maintain a larger population for each process using the same a mount of memory.
The idea behind th is parallel model is that the problem to be solved is divided into several tasks. Each task is then assigned to a different process that will focus on it wh ile exchanging information with the other processes. This is similar to a multi-agent approach that has been proved efficient for many applications before, in a variety of contexts.
Thus, in the approach proposed in this paper, the d ivision happens along the genotype. The genes composing each chromosome are divided among the processes, such that the task of each process consists of evolving a pre-determined part of the chro mosome. Each process performs the standard genetic operations on its subset of genes. When the fitness evaluation is needed, the subset of genes is inserted into a global temp late that allows it to be seen as a complete chromosome, as shown in Figure 1. The top chro mosome in this figure represents the template, which is a chro mosome with some missing genes. The pieces marked as "Population for P 1 " represent partial chromosomes that complete the genes missing in the template. They are assembled together into a complete chro mosome that can be evaluated with the usual fitness function. This template is periodically exchanged between the processes.
Our intention was to develop a model that can be applied regardless of the separability of the problem, but the test problems we chose are also designed to show how this aspect influences the fitness performance and the speedup.
All the genetic operators are performed according to their standard definition, with the exception that they are restricted to the subset of the genes assigned to each process.
Let n be the size of the chromosome, with the indexes for the genes going from 0 to n-1. Let us suppose that we have p processes. Each process will receive a part of the chromosome of size n p = n/p. Then the process or agent with identity number id, 0 ≤ id ≤ p-1 will be in charge of the genes in the interval[id*n p , (id+1)*n p -1]. Practically, the chromosomes are stored for each process in two local collections representing the previous generation and the new one to be constructed fro m it. These collections are swapped after the co mplet ion of each new generation in a manner that is similar to the double buffer model in Open GL. Thus, the most efficient way to handle the template is to copy it over all the chro mosomes in both of these collections leaving out the genes of indexes assigned to the current process. This way, even though the genetic operators are localized, the fitness evaluation can proceed directly to the evaluation. The memo ry use in this procedure can be improved, by actually storing only the pieces of chromosomes assigned to each process, but that is not necessary unless we deal with a very large population and / or very large chro mosomes.
The mutation operator is not affected by the chromosome division, and neither is the fitness-proportionate selection. The crossover makes our approach algorith mically different fro m the sequential one. W ith a one-point crossover as the base operator, each process chooses a crossover site within the range assigned to it. Thus, with 2 processes our overall operator is partially equivalent to the 2-point crossover. The second and more important difference with the sequential one is, though, being able to evaluate the quality of parts of the chromosome separately and have better chances to achieve a good solution overall.
We have used a 1-point crossover with probability 0.8 and a classic mutation with probability 0.01. The reproduction form is elit ist for each process, preserving a single best chromosome fro m one generation to the next. Thus, even though the new temp late replaces the old one after each exchange phase, the new genes are either identical to the old ones, or proven to perform better.

Fi tness Evaluati on
There are a good number of benchmark fitness functions for genetic algorith ms that are separable, meaning that they can be divided into sub-problems such that the evaluation of each of them can be accomplished independently. Our model is not restricted to these types of problems specifically, but is rather designed in a general way so that it can be applied to any fitness function. However, the evaluation procedure can be optimized for separable functions and a greater performance can be achieved in terms of execution time and use of each CPU core. One of out test problems will showcase this situation.
We start with the assumption that to evaluate the fitness function for any combination of genes, we need a full set spanning from 0 to n-1. Thus, to evaluate a partial chromosome, we need to complete it with the template. The evaluation consists of plugging the partial chromosome into the common template, and then passing this complete indiv idual to the fitness function.
An exchange procedure insures that the template is kept reasonably up to date with respect to the latest best performing genes obtained by each process. During the e xchange phase, each process copies the genes of the best chromosome found so far in terms of fitness to a global "best chromosome" shared by all the processes. After all of the processes have finished this update, each of them updates its own template fro m the global best chromosome. This procedure takes place periodically and can involve a synchronization of the processes in terms of number of generations produced in between the exchanges. Since only one chromosome is exchanged in the process, our model is coarse-grained. The next section talks about this aspect more in detail.

Synchronous vs Asynchronous Exchange
The only commun ication between the processes happens during the exchange procedure. Fro m a synchronization point of view, we propose to compare two approaches. In the first one, each exchange phase happens after the exact same nu mber o f generations for each process, and this is accomplished through a barrier call. In this model, called synchronous, all the processes evolve approximately at the same time and during the exchange they need to wait for all of them to reach the entry point in order to proceed. The genes in each chromosome represent a fairly ho mogeneous evolutionary step during the fitness evaluation.
In the second approach, called asynchronous, each process can update the global best chromosome periodically without having to wait for any of the others. Thus, genes evolved in substantially different generations are combined for the evaluation of the fitness.
The synchronous exchange procedure is shown below in C++ based pseudo code. In this algorithm we assume that the indexes in the partial chro mosome are kept consistent with the position of the genes in the co mplete chro mosome. To make the procedure easier to understand, the id of the process is used as an index for the best partial chro mosome and for the template. Practically, our imp lementation is object oriented, the exchange function is a class method, and these variables marked with the id are class attributes. The global variab les shared by all the processes are shown with capitalized names.
void Exchange_synch(int id) { np = Ch ro mosome_Size/ Nu mber_Of_Proc; Barrier(Nu mber_Of_Proc); for (i=id*np; i<(id +1)*np; i++) We can see below the asynchronous version of the exchange procedure where each process updates the best chromosome and its own template periodically without having to wait for the others. All the name conventions are the same as above.
void The population is initialized for each process randomly, as it is usually the case. The template is init ialized by calling the function exchange before the evolution p rocess starts.
The exchange takes place every few generations, 10 for most of our experiments. Another question we will attempt to answer here is how much this exchange period interferes both with the execution time and with the performance in terms of best fitness achieved.
We have chosen three problems to test our parallel model with, two of them being of the benchmark type, and one a real-world problem. The specifics of each of them should allo w us to showcase different features of our program. The benchmark problems consist of fitness functions of linear complexity over the nu mber of genes and also uniformly fast to compute.
The real-world problem is computationally more expensive and non uniform over the set of chro mosomes. For this last function, a global optimu m is not known.
The two benchmark problems are chosen with a fitness that is linear over the chro mosome length to show the speedup potential of our model for the most common category of problems. These functions are very similar to other problems used for benchmarking in various studies. The real-world problem is a difficu lt one chosen to showcase the potential of our model in terms of quality of solutions.
A second aspect that differentiates these problems is the reciprocal influence of genes at different locations in the chromosome in the computation of the fitness or separability. For the real-world problem, such influences are present, and a good performance cannot be achieved in the absence of proper process coordination. For the first benchmark problem, there is an even higher degree of reciprocal influence of the genes from one process to another than for the real-world problem. For the second benchmark problem the fitness influence is localized to the genes assigned to each process, meaning that this function is highly separable, and the algorith m is optimized to take advantage of it. Thus, we hope to show how our model can behave in each of these three situations.
For reasonable comparison grounds for the three problems we have used the same experimental settings as much as possible: population size (50), number of generations (1000), and chro mosome size (360). The nu mber of genes is determined by the nu mber of parameters defining the real-world problem, and we were ab le to configure the two benchmark problems with the same value.

Benchmark Problems
The first problem is known in the literature as the Rosenbrock function [8,13] or the DeJong function [6], and as a difficult optimizat ion problem. It consists of min imizing the following function: The minimu m of 0 is achieved for x = y = 1. The difficulty of this function is that local minima with x = y are relatively easy to achieve, but both variables need to move towards the value 1 at the same time to find the global minimu m. For this problem, the first half of the genes represents the value of x and the second half the value of y. Thus, in the parallel mode, the processes will be highly dependent on each other to achieve a good performance, which is the reason for choosing this function. For our experiments we have used 360 b inary genes to be consistent with the two other problems.
The second function is of a category that has been used to test deceptive aspects of the fitness landscape [5]. Th is function maps each group of 3 b inary genes in the sequence to a value based on Table 1 and then adds them up over the entire chro mosome. This problem p resents a difficulty to hill-climbing methods because the sequence of highest fitness, 111, is isolated fro m the suboptimal solution that is 000. Thus, the sequences that are close to the optimal solution are of lo w fitness, while those close to the suboptimal ones have a higher fitness, to mislead the algorith m. With a population size of 360, the optimal solution has a fitness of 3600 while the suboptimal one of 3360. Since the value of each set of 3 genes is computed separately fro m any other set of 3 genes, this fitness is entirely separable. We use this problem to show the speedup potential for highly separable functions.

Real-Life Problem
The third problem we are using consists of optimizing the parameters defin ing a pilot for a simu lated motorcycle. For this problem, the evaluation requires significantly mo re computations, and thus it will allow us to observe the improvement in perfo rmance in that respect. Contrary to the linear functions, the complexity of evaluating a chro mosome is not uniform, but can vary significantly fro m one indiv idual to the next. Th is constitutes an additional challenge for the parallel model. Th is function is partially separable.
The physical model of the motorcycle has been more extensively described in [19] and is close to [4]. The motorcycle is modeled as a system co mposed of several elements with various degrees of freedom, consisting of position and orientation on the road, speed, rotation of the handlebars, and leaning.
The driver's input into the system is defined by the tuple u=(τ, β f , β r , φ, α) where τ is the accelerat ion in the direction of movement provided by the throttle/gear control, β f , β r are forces applied on the front and rear brakes respectively, φ is the leaning angle, and α is the handlebar turning angle. This driver can be either a hu man player or an autonomous agent controlling the vehicle.
The movement is defined by Newtonian mechanics, where the acceleration is defined by g ravity, friction, drag, and the throttle. The brakes are factored into the frict ion force.
The autonomous pilot uses perceptual informat ion to make decisions about the vehicle driving. This information consists in the visible front distance, the lateral distance to the border of the road from the current position of the vehicle and fro m a short distance ahead of the vehicle, and the slope of the road.
The motorcycle is driven by several control units (CUs), each of them controlled by an independent agent. The current CUs are the gas (throttle), the brakes, and the handlebar/leaning. Each of these CUs is independently adjusted by an agent whose behavior is intended to drive the motorcycle safely in the middle of the road at a speed close to a given limit . The agents behave based on a set of equations relating the road conditions to action. The full set of equations is described in [19]. The equations comprise a fair nu mber of coefficients and thresholds, and these are the values that are evolved by genetic algorith ms.
To apply the GA to this problem, we chose a representation where each configurable coefficient is assigned 10 binary genes, and the chromosome results by concatenating all of the coefficients. As we have 36 coefficients, the chromosome has a length of 360.
A chromosome is evaluated by running the motorcycle in a non-graphical environment once with the pilot configured based on values obtained by decoding the chromosome over a test circuit presenting various turning and slope challenges. Each run can end either by co mpleting the circu it, or by a failure condition. A failed circu it can be caused by one of the following three situations: a crash due to a high lean ing angle, an exit fro m the road with no immed iate recovery, or crossing the starting line without having reached all the marks, as when the vehicle takes a turn of 180 degrees and continues backward.
To compute the fitness we marked 50 reference points on the road and counted how many of them were reached by the motorcycle with a given degree of appro ximation. The fitness is computed as follows: F(x) = d m / d t + 1/(1 + t m ) (2) where d m is the number of points reached by the motorcycle, d t is the total number o f points, and t m is the total time taken until either the circu it was co mpleted, o r until a failure condition was detected.
Thus the fitness reflects both what percentage of the circuit the motorcycle has comp leted, and how fast it was capable of finishing the track. In general, a fitness value that is higher than 1 indicates comp letion of the circuit.
This problem is partially separable in a subtle way. For the motorcycle pilot to be able to function at all, it needs reasonable values for the parameters defin ing all of its agents. Thus, the pilot could be using the gas pedal perfectly well, but if the steering is ineffective, it will still fail. Once we have reasonable values for most of the agents of the pilot, each of them can be improved separately. Thus, this is an intermediate problem for the study of this aspect.

Experimental Results
In this section we present some of the experimental results with our model testing both the executions time/speedup and the fitness performance. Tab le 2 introduces the platforms that we have used for our computations. In the operating system column, XP stands for Microsoft Windows XP, W7 stands for Microsoft Windows 7, OsX stands for the Mac OsX 10.5.6, and Ub stands for Ubuntu 8.04. The program was implemented in standard C++ using the pthread library and its local version for each platform. The code that is being run is the same on all of the platforms for each reported experiment.
For the experiments concerning the speedup, there is no point in comparing approaches that don't perform a co mparable nu mber of co mputations. Since the speedup is our foremost interest here, we have chosen to run all the experiments with the same number of generations. A second factor needs exp lanation before we proceed: the population size. In our approach we can level the parallel model with the sequential one on only one of two aspects: the number o f evaluations performed on the whole, o r the number of genes generated on the whole. We have chosen to generate the same number of genes as in the sequential model for all of our experiments, which means running the experiments with the same population size. This is consistent with the fact that one of the goals of the chromosome division is to be ab le to run the evolution with a larger population for each island.

Synchronous versus Asynchronous Exchange
The first set of experiments compares the synchronous versus asynchronous exchange models on different platforms for a variety of nu mber of processes. Table 3 shows the average execution t ime as number of seconds for the Rosenbrock function. Table 4 shows the average execution time for the deceptive problem. The chromosome length is 360, the population is of size 50, and we have run 1000 generations in all the cases. The results are averaged over 100 runs. The colu mn labelled "Co m" identifies the exchange function as synchronous (S) or asynchronous (A). To comp lement these timing results, Tables 5 and 6 show the speedup in all of these cases computed as the execution time on a single process divided by the execution time of each mu lti-threaded run. A speedup of more than 100% represents a faster execution time in parallel than sequentially. Note that the speedup for 8 processes is not expected to be improved on any of the p latforms, since the maximu m number of co res that were available on any of the mach ines is 4. This table shows that on most mult i-core arch itectures, the execution time for a nu mber of processes less than or equal to the number of cores presents a speedup.  Table 7 shows the timing in seconds for the motorcycle driving problem. The settings in terms of chro mosome length, population size, and number of generations are exactly the same as for the linear functions, except that we only ran the GAs 10 t imes for this problem in each case, due to the length of time required. Even thought 10 runs may not seem like a large enough number, each of these 10 runs we report represents between 2 and 7 days of uninterrupted and exclusive computation time on each platform and thus for the measuring of the speedup we consider them sufficient. Since the evaluation itself can take a variable amount of time depending on how long the pilot lasts on the road before a failure condit ion occurs, we thought that a normalized measure for the time was necessary. For this purpose, we also recorded the total number of times that the function move was called for the motorcycle simu lation during the evaluation. We can consider these calls to be basic operations because they require a uniform amount of time. Since the function move is called repeatedly until either a crash condition occurs, or until the vehicle finishes the track, it is the number of such calls that introduces such variety in the evaluation time. Thus, this measure tells us the number of operations executed in every case. Table 8 shows the number o f such calls d ivided by 10 4 as an average over 10 runs. For measuring the fitness obtained by each model, the platform is not important since the same code is run each time and we can thus average the results over 50 runs each for each parameter setting. Based on Table 8, we can now compute a normalized speedup by first dividing the execution time by the number of moves. The results of this operation are shown in Table 9. Then Table 10 shows the speedup for this problem by dividing this new timing measure for the sequential case by its value for the parallel case. Fro m Table 10 we can see that the speedup achievement is lo wer for this problem than for the benchmark problems. Th is is due to the non-uniformity of the fitness calculation. Thus, the fastest process may need to wait for a long time for the lo west one to finish its task. Using an asynchronous model only improves the speedup for 8 processes on most platforms, which is contrary to the behavior observed on the benchmark prob lems. This can be explained by the imp rovement of the fitness achieved, since the better the pilot is, the more t ime it will be driving on the track without crashing. Even fo r this difficult problem, the parallel model makes good use of the mult iple cores. Finally, we need to observe the average fitness achieved after 1000 generation for all the problems to see if the parallel model can perform as well as the sequential one in terms o f quality of solutions or even better. Table 11 shows these results for all three functions, as an average of all the experiments performed on the various platforms presented in the timing tables. For the Rosenbrock function smaller values are better, while fo r the two others larger values are the goal.
We can see that the parallel model outperforms the sequential one in terms of fitness for two of the problems and achieves the same performance for the Rosenbrock function in asynchronous mode with 2 p rocesses.
For the most separable problem, the deceptive one, a higher problem div ision leads consistently to higher performance. For the part ially separable problem, the motorcycle pilot configuration, a d ivision into 2 processes is better than the sequential model in both cases. A higher division of the genotype doesn't always imp rove the performance further for this problem, the best performance being achieved with 2 processes and the synchronous model. For the asynchronous model, the best performance is achieved with 4 processes. This is consistent with the fact that the pilot is composed of 4 agents such that with 4 processes, each of them is in charge of one agent. For the non separable problem a div ision into 2 processes allo ws us to find the optimal solution as well as the sequential algorith m does, but a h igher division is not recommended.
Overall these results suggest that the asynchronous model is preferable to the synchronous one. The speedup is better in most cases, which is due to the min imization of the waiting time. Table 10. Speedup for the motorcycle project computed as the sequential normalized time divided by parallel time in Table 9 Platf.
Com There is an intuitive trade-off between the fast processes being at a disadvantage because of the template that represents an earlier version than their own evolution, and the late processes benefitting fro m the faster processes in later generations. Ou r study indicates that in the asynchronous model this trade-off is overall favorable to the quality of the solution. Significance Testing. We have performed a set of T-Tests on the fitness obtained by the various approaches to see if the parallel models performed significantly better than the sequential ones.
The first sequence of tests consisted in comparing the experiments based on the number of processes, each value of this parameter against all the others, in the synchronous mode and in the asynchronous mode separately. For the two benchmark problems, the difference was almost uniformly significant with a confidence of over 95\%, with the following exceptions for the deceptive problem in synchronized mode: 2 versus 4 processes, and 4 versus 8 processes, and for 4 p rocesses versus 8 in both synchronized and asynchronized modes for the Rosenbrock function. For the motorcycle problem most of the differences were not significant, with the exception of the synchronized model, 1 process versus 8, 2 versus 4, and 2 versus 8.
Another set of T-Tests was designed to determine if the synchronized results were significantly different fro m the asynchronized ones for each parameter settings. For the deception problems, the difference was significant for 4 and 8 processes. For the Rosenbrock problem the difference was significant for 2 p rocesses only. For the motorcycle problem the difference was not significant.

Infl uence of the Synchronizati on Peri od
The second set of experiments focuses on the aspect of the number of generations between the synchronization and exchange phases and on how it influences the overall performance. Fo r this purpose we chose the Rosenbrock function because it presents the highest degree of dependence of the genes on each other for the fitness.
We have run these experiments with a synchronization period taking several values fro m 1 to 1000. Fo r this set of experiments we have used the Mac OsX p latform with the Core 2 Duo processor. Figure 2 shows the speedup obtained under various settings of the parameter, where the legend indicates the number of processes and the synchronization type (S for syn-chronous, A for asynchronous). The speedup improves a substantial amount when going from a period of 1 to one of 5, and from 5 to 10, but after that the improvement slows down. It is also interesting to note that for a high amount of synchronization, the synchronous model is faster than the asynchronous model for every value of the nu mber of processes, but with less synchronization, the asynchronous model eventually becomes faster. Since the optimal solution has been found by the GA in many cases for this problem, as a measure of fitness we have used the percentage of runs in each case where the optimal solution was found. Figure 3 shows this parameter p lotted as a function of the synchronization period, where we separated the plot by number of processes for a better visual understanding. This figure indicates that with 2 processes and a communication period o f 10 o r 20, the asynchronous model finds the optimal solution almost all of the 100 runs and can match the performance of the sequential model. For most of these runs, the asynchronous model performs better than the synchronous model. We can also note that for a higher number of processes, splitting each of the variables x and y themselves among the processes is not beneficial. Th is suggests that in general, the nu mber of processes should not exceed the number of variables in the fitness function.
Overall, a synchronization period of 10 or 20 seems to be the best choice. This is consistent with results that were presented in [3,11].
The fitness itself is only half o f the story. A remarkable thing about these experiments is that the number of generations that are necessary to achieve a given performance changes substantially fro m one setting to another. Thus, Figure 4 shows this measure as a function of the synchronization period. Fro m this figure we can see that even though the algorith m doesn't find the optimal solution as frequently under the parallel models, the convergence is much faster in general. Th is suggests that for harder problems for wh ich an optimal solution is not known, as for examp le, the motorcycle configuration problem, the parallel model is likely to achieve a reasonable fitness value faster. We have also performed some T-Tests comparing the fitness achieved under each parameter setting and a given value of the exchange period with the one achieved with the same parameter setting and the next value of the exchange period. Table 12 summarizes these results.

Conclusions
In this paper we presented a shared memory parallel model of genetic algorith ms designed to take advantage of mu ltip le CPU cores in co mmon current arch itectures. We have tested our model with three sets of problems of various difficult ies on four different platforms with several types of processors and operating systems. Each problem presents different separability properties.
The experimental results presented in Section 4 explore the performance of the parallel model on several levels. First, in what concerns the speedup, for the benchmark functions there is a clear improvement on the platforms with mu ltip le CPUs. A more modest but still noticeable speedup can be observed as well for the mo re d ifficult problem of configuring the autonomous pilot. The best speedup is around 185% on 2 CPU cores and around 223% on 4 CPU cores.
In terms of average fitness achieved in 1000 generations, for all test problems we can observe that the parallel model outperforms the sequential model for a nu mber of processes less or equal to 4, which is also the maximu m nu mber of available cores on our test platforms. Fo r the Rosenbrock function, the sequential model was able to find the optimal solution 100% of the time in 1000 generations, and this is also the case for the asynchronous model with 2 processes. For the deceptive problems, we see about 12% fitness improvement in the best case. For this problem, the parallel model was able to fine-tune the search to smaller parts of the chromosome and thus, improve the performance. For the motorcycle problem, we see an improvement of 3.5% in the best case.
A comparison of the synchronized and asynchronized schemes shows an improvement in speedup for the asynchronous model without loss in performance. Another set of experiments have shown that the synchronization and communicat ion period o f 10 generations that we have chosen is a nearly optimal choice of balancing between the speedup and fitness performance for both the synchronized and asynchronous models.
On the subject of problem separability, several conclusions can be drawn fro m our experiments. The speedup can be better improved for separable problems, which is to be expected. The fitness improvement is also mo re imp ressive for the mo re separable problems. A higher div ision of the chromosome is also more beneficial to problems that are more separable. With a d ivision into 2 processes, though, an improvement can be observed for all the problems, even the non-separable ones. This means that even though our algorith m is more efficient fo r separable problems, it can still present some advantages even for non-separable ones.
In conclusion, our model p resents a valid approach to taking advantage of the multi-core co mputing technologies that are now widely available, even for non-separable problems, and a strict synchronization between the processes is not a benefit in terms of performance.