Fractal Properties of Linux Kernel Maps

Many different measures were proposed to describe the problem of possible software complexity - the number of lines of code sometimes referred as a source lines of code (SLOC), Halstead’s volume V , McCabe cyclomatic number V ( G ), among others. However, any of them doesn’t take into account the possible fractal properties of software source code emerging from development process. The main aim of this paper is to show that in the case of successive Linux OS kernels fractal self-organization of the system can be seen. This is done in the relation to: (i) the analysis of rate of growth for number of files and source lines of Linux kernels code, (ii) by the presentation of some visualizations indicating self-similar graphical structure of OS kernels, (iii) by the calculations of fractal dimensions D b basing on box dimension method. Basing on obtained results it can be assumed that: (i) calculated rate of growth in the case of lines and files in the simplest approach can be approximated by the polynomial with degree 2 with R = 0.96 and R = 0.94 respectively, (ii) this system becomes more and more complex with self-similar structure, (iii) its fractal dimension is still growing. Presented analysis opens new possibilities for description of computer programs in terms of complex systems approach.


Introduction
A branch of software metrics that is focused on direct measurement of software attributes is called the software complexity [1]. It can be used to provide a continuous feedback during a software project to help control the development process and to predict the critical information about reliability and maintainability of software systems. There are many different software co mplexity measures: the number of lines of code sometimes referred as a source lines of code (SLOC), Halstead's volume V, McCabe cyclomat ic number V(G), etc [2]. The most popular one is the number of source lines of code (SLOC). Th is measure gives the size of software program by counting the number of lines in the text of program's source code. It helps to predict the amount of effort that would be required to redevelop the program. Line count is usually big, it can even reach 10 7 lines -see the examples in Tab le 1. However, this measure has many disadvantages, for examp le the lack of accountability, the lack of counting standards, the problems with mu ltip le languages, etc. Bill Gates said that [3]: "Measuring programming progress by lines of code is like measuring aircraft building progress by weight". Despite all that, this quantity is quite often given, because it can help imagine how difficult the development was and how "big" the software is. Table 1 shows informat ion based on data taken from [4,5,6] that presents different operating systems size in millions of SLOC. As it can be seen the most popular operating systems consist of roughly 50 or more millions lines of code. This information gives a clue of how complicated (or maybe co mple x) the structure of such a system can be, by dint of the possible existence of different dependencies between particular parts of analyzed system 1 . However, it doesn't say anything about the nature of this dependencies and possible patterns that can emerge. It is quite hard to imag ine how comp licated (comp le x) can be the structure of operating system due to many reasons. Co mpany secrets concealing the details of given operating system functionality is one of them -there is no access to the source code or detailed documentation that will be helpful in such investigations. Obviously, because nowadays operating systems consist of millions lines of source code it is almost impossible to analy ze it without "creative" methods of analysis.
But in some cases there is a possibility to have the access to the operating system source code and the most interesting example here is Linu x OS. Th is system was introduced by Linus Torvalds in 1991. At the beginning, his work was treated rather as a some kind of "software toy", but now it is assumed that is used by about 30 mln of people. This 1 the difference between complex and complicated systems is quite subtle: the complicated system is a system that has many interdependent elements, but the dependen cies between them are govern ed by well-known deterministic laws (i.e. such systems are rather simple systems), while in the case of complex systems the dependencies between their components are governed by laws that are not necess ary well-known. See the details in [7] development phenomenon is very interesting itself because of many reasons. One of them will be presented in this paper, where we would like to focus on self-organized fractal properties of Linu x kernel maps visualizations. The paper consists of 5 sections. After the introduction, in Section 2, a wide context of systems self-organizat ion and its relation to co mputer science is presented, as a basic concept in the case of complex systems. Section 3 presents analysis of Linu x kernel develop ment basing on lines of code and the number of files in successive versions. Analysis of fractal dimension for generated self-similar maps is presented in Section 4. Section 5 closes the paper with conclusions.

Systems Self-organization
A self-organization is one of the most amazing properties in the case of many complex systems. It can be considered as a process in which the internal organization of a system (usually it's an open system), increases in comp lexity without management by an outside source. Systems that self-organize typically display many emergent properties. The term self-organization was used for the first time by I. Kant in his "Crit ique of Judgment", however its introduction to contemporary science was done in 1947 by the psychiatrist and engineer W. R. Ashby [8]. Then it was taken up by the cyberneticians (H. von Foerster, G. Pask, S. Beer and N. Wiener) in [9], however it didn't beco me co mmon in the scientific literature (except in the field of co mple x systems theory) before the 1970s and especially after 1977 when I. Prigogine (a Nobel Prize Laureate) showed the thermodynamic concept of self-organization.
In the case of mathemat ics and computer science this phenomenon is usually connected with the ideas of cellular automata, graphs (especially in co mple x networks, like "s mall worlds" and scale-free networks), and some instances of evolutionary computation and artificial life. In the field of mult i-agent systems, the problem of engineering such systems that will present self-organized behavior is very active research area. This paper shows that there are also other fields where the self-organization can appear. Cooperation of many programmers (sometimes totally independent) in develop ment of software is a great examp le here. A question arises: how self-organization can be uncovered? And when we can say, with full responsibility: "this system has self-organizing properties"? The answers are fro m one hand simple, but fro m the other not, because the self-organized systems display many emergent properties. Fro m one hand we can see its prevalence in the surrounding environment, while form the other we can't g ive one pattern or example that will exactly fit to all cases. One of the very interesting examples of self-organized systems are fractals. It is due to the fact that many systems self-organize in self-similar structures (see for examp le [10]). One of the most commonly known examp les of such structures are cauliflowers or more spectacular one -a ro manesco.
As it is known the fractals are mathemat ical sets that can't be directly seen in Nature in the way that they are built (by the recurrence defin ition with the possibility of infin ite number of magnifications that can be done and always will look exactly the same as the whole fractal), however one of the ma in feature of fractals is the self-similarity property, which is one of the most frequent properties of shapes, systems, things, etc. either natural or -sometimeshuman-made. Recalling the famous words of B. Mandelbrot, who in his book [11] wrote that: "Clouds are not spheres, mountains are not cones, coastlines are not circles, and bark is not smooth, nor does lightning travel in a straight line", we can imagine that the self-similarity property is an inherent feature of many co mple x systems.
But, are the computer systems the complex ones with the self-organized patterns? If we consider the co mputer systems only as Turing machines imp lementations we may assume that they are at least complicated systems [12]. However, each such an implementation is the system that has a physical nature not only in the sense that it needs energy for normal work or is built fro m physical co mponents, but also it is governed by laws and dependencies that have such physical nature. The problem of co mputer systems complexity was noticed many years ago for examp le by P. Wegner in 1976 or by M. Ge ll-Mann in 1987. P. Wegner in [13] wrote that: "When computers were first developed in the 1940's (...) software costs were less than 5% of hardware costs. (…) In the 1950's and 60's hardware costs decreased by a factor of 2 every two or three years and computers were applied to increasingly[number o f] systems. (...) to accomplish such tasks[it] may require millions of instructions and millions of data items, (...)[it] has led to a situation where software costs averaged 70% of total system cost in 1973. (...) an important reason for skyrocketing software costs arises fro m the fact that current large software systems are much mo re co mple x (...) than the systems being developed 25 years ago or even ten years ago. It was pointed out by Dijkstra [in 1972] that the structural comp lexity of a large software system is greater than that of any other system constructed by Man (...)". M. Gell-Mann argues in [14] that: "(…) chose topics that could be helped along by these huge, big, rapid co mputers that people were talking about -not only because we can use the mach ines for modeling, but also because these machines themselves were examp les of co mple x systems". Thus even if one has a single computer system (as a Turing machine implementation) that isn't connected to the network, this system can be in many ways considered as a co mple x one and the self-organization patterns can appear. This view will be presented in details further in the paper using probably most important piece of software -the operating system.

Linux Kernel
Let's start with the short story of Linu x. Despite that it can be found very quickly in Internet, we wou ld like to quote what can be read fro m linu x.org [15]: "Linu x is an operating system that was init ially created as a hobby by a young student, Linus Torvalds, at the University of Helsin ki in Fin land. Linus had an interest in Minix, a sma ll UNIX system, and decided to develop a system that exceeded the Minix standards. He began his work in 1991 when he released version 0.02 and worked steadily until 1994 when version 1.0 of the Linu x Kernel was released. The kernel, at the heart of all Linu x systems, is developed and released under the GNU General Public License and its source code is freely availab le to everyone. It is this kernel that forms the base around which a Linu x operating system is developed. There are now literally hundreds of companies and organizations and an equal number of indiv iduals that have released their o wn versions of operating systems based on the Linu x kernel."  The most important information about Linu x is that its code is freely available to everyone thus anyone can develop it. Fro m the beginning it was assumed that Linu x source code will be freely availab le and now this access is based on GNU General Pub lic License. This is the reason why Linux development can be done by many enthusiasts from all over the World. The whole situation (process) can be compared to the river basin behavior: the work that is done by many enthusiast is similar to the rainfalls in river basin while the developed Linu x kernel, wh ich emerges as a cu mulated work, to the river as a final product of rainfalls in a wide basin. Somet imes the Linu x improvements can be very significant (high rainfall) -so metimes they can be very petty (sma ll rainfall); so me o f the imp le mented ideas become a significant part of th is system, but some aren't further developed, etc. As it can be seen, the quite short history of Linu x isn't any obstacle in developing of this operating system in a very quick way. The first version of Linu x had just a couple of hundreds of lines of code, whereas the latest versions have millions. Details can be found in Table 2. As it can be seen (Fig. 1) Linu x growth is very rapid (ordinate represents log values).
The quick and unexpected development o f Linu x OS and the whole Linu x co mmunity is not only surprising for people who are not well aware of Linu x h istory and its present state but also for scientific co mmunity, who started publish many different papers about this phenomenon. As the examples Tuomi or Godfrey papers can be given [20,16]. The last one shows the state of Linu x develop ment at the end of 2000. Author assumes that the Linu x quick gro wth can be expressed by the equation (1) with R 2 =0.997 where y denotes the size in millions of lines of codes without comments, x denotes days since Linu x kernel version 1.0 was released.
To show the actual state of Linu x kernel develop ment the detailed analysis of nu mbers of files and lines of code that were added, changed or deleted for each kernel release were prepared, basing on informat ion that can be found in Linu xHQ [21]. It should be noted that Linu x kernels are numbered in very interesting way: it is x.y.z. The first number denotes a ma jor release (now it is 2), second it's minor release and if th is number is even it means that this is stable kernel, wh ile the odd number denotes unstable (developed) kernels. The last number is a rev ision number. Each first stable kernel (i.e. revision 0) is based on latest version of unstable one, i.e., it is released when the whole previous development work has been done, while the unstable kernels can be released when the development of stable kernels hasn't been finished. For example: the unstable kernel ver. 2.1.0 was released on 30 September 1996, while the latest stable version of stable kernel 2.0.40 was released on 8 February 2001. This exp lains the structure of two graphs ( Fig. 2 and Fig. 3) where the nu mber of files and lines of code for each release of successive Linu x kernels (starting fro m 1.0) is presented.  Basing on this information, similarly to the equation (1), the models that e xp ress the quick growth of Linu x kernels size in the case of lines of code and the number of files were calculated. The calculations were based on all available data points since kernel version 1.0 and for parameters estimation all points were used. In the case of equation (1) the authors didn't exp lain how they achieved their results, i.e., did they use all available points data in December 2000 or only those which "fit" to their model - Fig. 2  In the case of files we obtain fo llo wing model and the variance:   Fig. 5 show that kernel v. 2.6.x follows different trend than other stable kernels fro m 2.x.y family. In this case the growth is very rapid similarly to the unstable kernels. Obvious question appears: why this family of kernels acts this way? Maybe this is connected with the raising functionality of new kernel versions or because of a big number of new hard ware solutions that are rather novel and require appropriate drivers. Another reason can be the increasing popularity of Linu x itself because it's easy accessible via Internet -nowadays many people have a PC computer with the access to the Internet. This increase can be also caused by the growing number of people who don't like operating systems fro m M icrosoft or simply by the so far unknown trends. Probably the exp lanation of this fact isn't as simp le as it seems to be, but this observation is very interesting. This also seems to be in contradiction to the common opinion that maintain ing such a big system is extraordinarily difficult and co mplicated [16]. The whole process obviously needs a lot of time: in the case of latest kernel versions 2.6.x there are 3-4 releases per year, but in the case of previous kernels releases, i.e., 2.4.x and 2.2.x the situation was similar or even "worse" (2-3 releases per yearsee[21]).

Fractal Properties of Linux Kernel Maps
Information given in Section 3 fro m one hand can help imagine how the structure of Linu x kernel can be complicated (co mplex), but fro m the other hand it doesn't say anything about the real comp lexity of Linu x kernel structure. Graphical v isualization can be used to solve this problem. It was done for the first time by Rusty Russell, who introduced The Free Code Graphing Pro ject [19]. Basing on his proposal six visualizat ions for Linu x stable kernels were made i.e. Kernel v. .0 (this paper shows only two of the m: Fig. 6 for v. 1.0 and Fig. 7 for v. 2.6.0). Each visualization represents the inner structure of Linu x kernel. It is built fro m rings that represent the folders used to organize the source code files. The inner ring has all files fro m the ipc, kernel, lib, mm and in it d irectories (all piled together). The second ring incorporates two segments: the fs/ segment and the net/ segment. The third ring has got one segment per architecture, and the final ring has all drivers piled together. In each ring there are bo xes (solid border) that represent the *.c files from the kernel tree. Each box contains smaller bo xes (dotted outlines) with colored lines that show three types of functions: static (dark green color), ind irect (light green color) and non-static (blue co lor). The layout of drawing is given as follows: fro m inner to outer, fro m smallest to largest, with an iterative spacing increase if there is too much gap in the outer ring.
As it can be expected the structures of obtained visualizat ions fro m version to version are mo re and mo re complicated. However, all the v isualizations indicate the existence of self-similarity property, which can be observed in different regions of figures for zoomed parts. It is a very interesting fact that the work that has been done by many programmers during many years as a result can be visualized this way, indicating the existence of some kind of "order" in the whole structure.
Because the generated maps indicate possible existence of self-similarity, fractal dimension was calculated using box dimension approach. The box dimension is defined as the exponent D b in the relation where N(d) is the smallest number of bo xes of linear size d necessary to cover a data set of points distributed in a two-dimensional p lane. Simple fact acts as basis of this method: for Euclidean objects, the number of boxes necessary to cover a set of points lying on a smooth line is proportional to 1/d, proportional to 1/d 2 to cover a set of points evenly distributed on a plane, proportional to 1/d 3 to cover a set of points evenly distributed in a space, and so on …, thus the equation (4) defines their d imension by the value of D b exponent (for Euclidean objects this is an integer value). A box dimension can be defined basing on the number of occupied boxes that are placed at any position and orientation, however the number of bo xes needed to cover the set should be minimized as much as it is possible. Finding the configuration that min imizes N(d) among all the possible ways to cover the set with bo xes of size d proves to be quite difficult co mputational problem. If the overestimation of N(d) in a bo x dimension is not a function of scale, wh ich is a plausible conjecture if the set is self-similar, then using boxes in a grid or minimizing N(d) by letting the boxes take any position is bound to give the same result. This is because of power law (such as (4)) behaviorthe exponent does not vary if one mu ltip lies N(d) or d by any constant. However, because the assumption not always can be fulfilled in pract ice, to ensure that the obtained results will be reliable, one can rotate the grid for each bo x size by some value of degrees and take the minimal value of N(d). In presented analysis the angular increments of rotation were set to 15  Because the equation (4) represents a power law, to calculate the value of D b plots of log(N(d)) on the vertical a xis versus log(d) on the horizontal axis were made. The successive points usually follo w a straight line with a negative slope that equals D b . There is another problem in this approach -the range of values of d. Triv ial results could be expected for very small and very large values of d thus the calculations of the slope were done for two sets of data: all obtained points and for points that lie between 10%-90% of available d values (the extremes were d iscarded). The obtained results are in Tab le 3. As it can be seen the latest kernel versions have higher D b dimension than the first ones.

Conclusions
Some interesting properties of open software structure and its development were shown in this paper. A mong them one can indicate: quick growth of Linu x kernel measured by number o f source lines of code and number of used files (in the simp lest approach this growth can be appro ximated by polynomial with degree 2), self-similar visualizat ions of different stable Linu x kernels, calcu lated bo x dimension for these visualizations. Because Linu x OS is, in many people opinion, independently developed by many enthusiasts all over the world one can imag ine that its structure won't reflect any interesting properties. However, as it turned out this structure shows the existence of system self-organization (Figs. 6 and 7) with self-similar visual patterns. Used box counting method gives calculations for bo x dimension giving a possibility fo r description o f the co mp lex nature of software systems in te rms of fractals. Hav ing this, problems of software evolution can be considered with new metrics and laws, but the proposed approach needs to be developed in future work.