Graham’s Ph.D. Thesis: A Study of Software Component System Evolution

Since I spent 4.5 years completing my Ph.D. at Massey University, I have decided 1 year on to give an overview of what I actually did.

If you want my full thesis, you can download it here. Be warned, it is really long, really dense and an academic document. I am trying to make this post a bit more reader friendly.

So here it goes…

Introduction to Components and Evolution

In order to agree to talk, we just have to agree we are talking about roughly the same thing. The Feynman Lectures on Physics, Motion, Richard Feynman, 1961.

According to Feynman(and I always refer to Feynman when trying to discuss a complex topic), before I get too far down the rabbit hole of my thesis, I need to make sure we are talking about the same things. So lets break apart the title of my thesis into three parts:

Software Components
Component Systems
Software Evolution

I will go over each of these things, to make sure we are on the same page.

Software Components

A software component is a piece of software that can be composed into a component system. Well… that is a useless definition, it is like saying ‘a piece of cake is defined as a part of a cake’. Of course, what you want to know is what does a software component look like?. Well that turns out to be a very difficult question to answer.

A component is an intuitive concept, where each person has their own internal metaphor in which to describe it. Software components are:

lego blocks, where a developer just clicks them together to build software
No, wait, they were are more like biological systems where a developer puts them near each other and they communicate and self organise
Nope, they are like an electrical system, where they are wired up together by a developer
Hold on, they are mechanical like cogs in a engine where one connection is nothing like another and you have to put the right piece in the right place

These are all existing metaphors that have been used to describe software components in academia. Finding a definition that will satisfy all these ideas is impossible. Even well respected, and well known, engineers and researchers have trouble defining what a software component is. For example Bertrand Meyer (author of Object-Oriented Software Construction) and Clemens Szyperski (author of Component Software: Beyond Object-Oriented Programming) discussed in Dr Dobbs, the definition of a comopnent; Szyperski defines a component as having three characteristic properties:

Being a unit of independent deployment.
Being a unit of third party composition.
Having no externally observable state.

Meyer defines a software component where it:

May be used by other software elements (clients).
May be used by clients without the intervention of the components’ developers.
Includes a specification of all dependencies (hardware and software platform, versions, other components).
Includes a precise specification of the functionality it offers.
Is usable solely on the basis of that specification.
Is composable with other components.
Can be integrated into a system quickly and smoothly.

What hope have I in defining what a software component is when these two cannot come to an agreement on the most basic problem “should a software component have state?”. I think that after many such arguments Szyperski wrote that it is impossible to

enumerate a fixed agreeable set of features that is necessary and sufficient for a natural concept such as component

I still needed a definition for my thesis. Since there didn’t exist a definition that would satisfied everyone, I decided to create my own. I chose to only define the parts of a software component I needed for my research. My definition: a software component:

has explicitly declared constraints for its execution environment, e.g. dependencies on (or conflicts with) other components
include mechanisms to automatically alter a component composition, e.g. change the components in a system without manual intervention

I found these two attributes were sufficient for me to continue to study software components without being paralysed by trying to define exactly what it is I am studying. So now, hopefully you roughly know what I mean by software component.

Component Systems

Some examples of component systems that fit my definition are:

the Ubuntu operating system
the Eclipse IDE

There are an estimated 20 million users of the Ubuntu operating system and millions of users of the Eclipse IDE. Ubuntu and Eclipse systems are constructed from components, called packages and bundles respectively, and can be changed by adding or removing components to and from their systems.

Ubuntu has a Debian based package system where each package declares its constraints in a control file that looks like this: Package: textEditorPackage Version: 0.0.1.alpha Depends: spellChecker Conflicts: otherTextEditorPackage

apt-get will use these constraints to change the component system by adding, remove or updating the components in the system.

Eclipse has a component system based on OSGi bundles. These bundles describe the constraints on their execution environment in a bundle manifest file that looks like this: Bundle-Name: TextEditor Bundle-Vendor: Graham Jenson Bundle-SymbolicName: nz.geek.textEditor Bundle-Version: 0.0.1.alpha Bundle-RequiredExecutionEnvironment: J2SE-1.4 Export-Package: nz.geek.textEditor;version="0.0.1.alpha" Require-Bundle: nz.geek.fonts Import-Package: nz.geek.spellchecker;version>"0.0.1"

P2 is the core of how Eclipse changes its component composition. It uses the constraints from the bundles to allow a user to install, remove and update their system.

Software Evolution

If an object has all its component parts replaced, is it the same object? paraphrased from Plutarch, the Life of Theseus

Brooks in The Mythical Man-Month states that over 90% of the cost of a system occurs after deployment in the maintenance phase, and that any successful piece of software will inevitably need to be maintained. This realisation, that the cost of software is not in its construction, but in its maintenance made many people look at how software changed after its deployment. There are two views you can take on software changing after deployment:

the deliberate activities a developer does to software after deployment, software maintenance
the unplanned changes that happen because of these activities, software evolution

Software maintenance has been formalised into types in ISO/IEC 14764 as:

Adaptive Maintenance: adapting to new system or technical requirements.
Perfective Maintenance: adapting to new user requirements.
Corrective Maintenance: fixing errors and bugs.
Preventive Maintenance: adapting to prevent future problems.

Using these definitions studies from Lientz and Swanson (1980) found that 75% of the maintenance effort was on the first two types, and corrective maintenance took about 21% of the effort.

Some laws of software evolution were identified by Lehman; these are:

Continuing Change: Software systems must be continually adapted, otherwise they become progressively less satisfactory.
Increasing Complexity: As the system evolves its complexity increases unless work is done to reduce it.
Self Regulation: The system evolves with statistically determinable trends and invariances.
Conservation of Organisational Stability: The average effective activity rate to evolve a system is invariant over its lifetime.
Conservation of Familiarity: As the system evolves, its incremental growth must remain invariant to ensure users maintain mastery over the system.
Continuing Growth: The system must continually grow to maintain user satisfaction.
Declining Quality: The quality of the system will decline unless rigorously maintained.
Feedback System: The function a system performs is changed by the effect it has on its environment.

Both the study of software maintenance and evolution show that a software engineer’s objective of creating a satisfactory system is difficult, expensive, and not always achievable. Additionally, the continual evolution of a software system is necessary, and this evolution reduces quality, increases complexity, and is costly. These are the core problems of being a software engineer.

Component and Component System Evolution

We are like sailors who on the open sea must reconstruct their ship but are never able to start afresh from the bottom. Willard Van Orman Quine, Word and Object, 1960

To maintain and evolve a software component requires technical skills. You need to know how the component system works that you are developing for and you will need to be able to write software. Originally, that was the assumption when creating component systems. Adding, removing or updating a component will require technical knowledge and understanding of the system. However, applications like apt-get started to make changing a system trivial, to the point where a complete novice can install, upgrade, and remove any component in their system.

When a user is the one who composes their system, and not a developer, this breaks apart two processes:

the evolution of the component
the evolution of the component system.

Component Evolution is very similar to software evolution, it requires a developer. The units of this evolution are versions. Using these versions, you can tell if a version of software is more or less evolved than another.

Component System Evolution (CSE) is not like software evolution. There are no versions of a system, telling if one is more evolved than another is impossible. However, changing a component system is much more restrictive than an individual component as it must satisfy all contained component constraints. If component a depends on component b, which is noted as a -> b, then if a is installed b must be as well.

Changing a component system would be impossible for a non-technical user if it were not for tool support. The named a tool that helps change a component system a Component Dependency Resolver (CDR). CDR’s help users alter their systems by calculating a system that satisfies all the constraints of a requested install, removal or upgrade of a component. CDR’s like Eclipse P2 and apt-get allow uses to simply state they want to add a component.

Additionally, CDR’s typically try to minimise the change and out-of-dateness of a change to a system. If you, as a user, wanted to install a component and in doing so your CDR installed all components; it technically did what you asked, but it changed too much for you to be satisfied. Also, if it installed the oldest version of every component it could find, that is also not good as most users what the most up-to-date software.

What I Did and How I Did it

So, knowing all that I just described about Software Components, Evolution, and Component System Evolution, what did I actually do?

Like all the science fair projects I did in primary school, I needed to define a hypothesis, a method, do some experiments, then analyse the results.

Hypothesis

The objective of this research has lead to the thesis:

It is possible to reduce the negative effects of component system evolution by altering the mechanisms by which systems are changed.

I broke this down into these necessary steps I had to take:

To develop a reproducible and controllable environment in which to measure the effects of CSE.
To use this environment to study how systems evolve.
To alter the mechanisms by which systems are changed and study their impact on CSE.
To demonstrate a reduction on change and out-of-dateness using such alterations.

Method

Simulation is a great way to study stuff. It lets you control all the variables, cheaply run hundreds of experiments without interruption and all it takes is a little (or a lot) time on a computer.

To make a realistic simulation you need to have enough data of significant size and complexity to make the simulation realistic and non-trivial.

For these reasons I selected Ubuntu as the component system I simulated.

Ubuntu

Ubuntu systems follow the Unix philosophy of components:

Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.

First two rules of Unix:

Rule of Modularity: Write simple parts connected by clean interfaces.
Rule of Composition: Design programs to be connected to other programs.

Ubuntu also has a massive sets of data to use:

a full history of every component in its core repository and the time it was uploaded
one of the largest automated surveys in the world just about what packages are popular, Ubuntu popcon
millions of users who are easily accessible to collect data
an open and transparent community of developers and researchers all actively sharing what they are doing

Validation

The biggest hurdle with simulation is making it similar enough to reality. To show a simulation is similar to reality. That is:

Validation is the process of determining whether a simulation is an accurate representation of the system, for the particular object of the study.

This quote comes from “How to build valid and credible simulation models” Law (2005), which is the methodology that I used to guide me in creating and validating my simulation. I chose this method because of its practical advice it had to create models and validate them to the degree I needed. The method is outlined as such:

Formulate the problem
Collect information to construct a conceptual model
Validate the conceptual model
Implement the conceptual model
Validate the implementation
Design conduct and analyse experiments
Document and present results

What I Created

Experiment is the sole judge of scientific truth The Feynman Lectures on Physics, Introduction, Richard Feynman, 1961.

To answer my hypothesis I needed to answer some smaller questions:

How can CSE be modelled?
How can a user who changes their component system be modelled?
How can a CSE simulation be implemented?
How can the negative effects during CSE be reduced?

To answer these questions I needed to:

create a formal model CoSyE (Component System Evolution) that describes CSE.
create the CUDF* language that is used to define documents that describe the evolution of a component system.
create SimUser (Simulated User), which models a user who changes their system.
create GJSolver which is an efficient implementation that calculates the changes made to a system as it evolves (called resolving).
build a simulation of the evolution of Ubuntu operating systems using CoSyE, CUDF*, SimUser and GJSolver.
test two methods to reduce the out-of-dateness and change during CSE.
analyse the results from experiments and draw conclusions.

I will briefly go over each of these things as their creation and validation is the majority of my thesis.

CoSyE (Semantic)

To exactly describe what I am talking about, I decided to describe it as a model. The qualities I wanted from model were:

Abstractness — it is a reduction of reality
Understandability — it is intuitive
Accuracy — it is a representation of the real system
Predictivness — can be used to predict non-obvious properties
Inexpencive — cheaper to study than the actual system

The model of Component System Evolution (CoSyE) that I created is described in mathematical notation. The reason for using maths (instead of say UML) is that the complexities of the model constraints made them difficult to describe in such a rigid framework such as UML.

It took a very long iterative process to create this model. This is because I had never created a model before and the I was learning mathematical notation while building it. This model is very specific, I had to define what a version was, what a component was, even what a name was. It is a complete definition of what I was studying.

The benefit such a precise definition is that it provided a solid foundation on which to discuss the domain I was studying.

The parts of the model that it is important to understand are a list of possible constraints that can describe a components relationship to other component and a component system:

Exclusion: Not in the system
Conflict: When two components cannot be in the system
Inclusive Disjunction: When at least one of a set of components must be in the system
Dependence: When if one component is in a system at least one of a set of other components must be in the system
Exactly One: When exactly one of a set of components must be in a system.

And the parts of the model that are needed to create and instance of it

A series of times
The set of available components at each time
User requests to change the system at each time
Sets of systems constraints at each time (can be extracted from the components that exist)
Optimisation criteria at each time, used to satisfy the user request
The initial components in the system

Why these parts are necessary and how they interact will become clearer soon.

CUDF* (Syntax)

For the CoSyE model to be useful, I had to be able to describe an instance of it. I extended an existing language called Common Upgradeability Description Format (CUDF) from Mancoosi.

Here is an example of CUDF: preamble: property: size: int = [0]``package: syslib version: 1 installed: true``package: syslib version: 2 conflicts: syslib``package: textEditor version: 1 depends: spellChecker | spellCheckerService, syslib > 1``package: spellChecker version: 1 size: 1``package: tpspeller version: 1 provides: spellCheckerService size: 2``request: install:textEditor

Mancoosi was a European research project looking at the problem of upgrading a component system. Since the evolution of a component system occurs over many upgrades, and CUDF was a syntax to describe just one upgrade, I had to extend CUDF to CUDF* to suit my domain.

The difference between CUDF and CUDF* are:

the Mancoosi Optimisation Format(MOF) to describe optimisation criteria
the times different components became available
the initial time of the simulation
multiple user requests

Here is an example of CUDF*: preamble: 100 property: size: int = [0]``package: syslib version: 1 time: 50 installed: true``package: syslib version: 2 time: 150 conflicts: syslib``package: textEditor version: 1 time: 150 depends: spellChecker | spellCheckerService, syslib > 1``package: spellChecker version: 1 time: 150 size: 1``package: tpspeller version: 1 time: 250 provides: spellCheckerService size: 2``request: 200, -change,-size install:textEditor``request: 300, -size,-change install: textEditor, tpspeller

This example describes five components, a initial system with only syslib installed at time 100, and two changes to a component system:

A request to install the textEditor component at time 200 while minimising change then size.
A request to install the textEditor and tpspeller at time 300 while minimising size then change.

It is reasonably straight forward description, and your intuition about what will happen is probably right. However, if you want an exact description of this format please refer to the thesis.

SimUser

So I have a way to describe the evolution of component system, but how will I create these descriptions that look realistic. How do component system actually evolve with users upgrading, installing and removing components.

Modelling a user behaviour is hard. I mean really hard and it is even harder to verify. The first step though is generally talking to users, so I chose to start my modelling by doing an online survey.

Note: Everything involving people outside your research requires ethics approval, including online surveys.

I put a query on an online forum and got about 50 respondents of people who use various linux systems. Each one gave me loads of great information as I let them have lots of free text. I also did something that would help a great deal later, and asked for people to submit their logs. I collected about 30 useful logs which is better than a user survey because it is real data.

From survey I was able to describe two type of user behaviour:

Progressive behaviour prioritizes the potential risk of becoming out-of-date over the risk of introducing new problems.
Conservative behaviour prioritizes the potential risk of changing the system over having less functionality and having old problems persist.

For example, a desktop user who wants the latest and greatest is more progressive than a conservative server admin who only wants things to keep working the way they are.

Next, I created SimUser, which is a formal and simple description of how a user upgrades their system. It includes four variables:

is the probability a user requests to upgrade the system per day.
is the probability a user requests to install any component per day.
is the MOF criteria used to select an optimal system for an upgrade request.
is the MOF criteria used to select an optimal system for an install request.

What values should I assign these variables to get a realistic user?

The criteria to upgrade and and install a component I can assign what apt-get uses, and tweak it a bit to see if I can make it better.

The probability a user requests to upgrade their system can be determined by looking at the survey and the logs the applicants provided.

Here, I was able to graph the participants probabilities to install and upgrade components, and cluster them using k-means:

The probability a user requests to install any component per day can be extracted from an online survey called The Ubuntu Popularity Context or popcon. This survey records every package and its chance of being installed in an Ubuntu system. However, because there are many packages that are only installed because they are dependencies of another package, I had to filter this list by the 2399 packages from the package called app-install-package that contains information about popular packages to install.

Validation

I validated this model in four ways:

I had discussions with project supervisors (Jens and Hans) and other stakeholders (i.e. Giovanni, Marsland, Catherine and anyone else I could).
I compared SimUser to responses from the survey
I compared generated CUDF* documents to the supplied user logs
I created a virtual Ubuntu system and looked at its changing repository over a month.

GJSolver

What I cannot create, I do not understand Richard Feynman, 1988

Programming is what a programmer does, and the component resolution algorithms is what got me interested in this topic to begin with. With that in mind I wanted to implement my own solver, so that I could experiment with it.

DPLL

What I found when I looked deep into component resolvers like Eclipse P2 was one of the hardest to implement and beautiful algorithms Davis-Putnam-Logemann-Loveland (DPLL). This algorithm is used to find if a Boolean equation can have its variables assigned to make the equation true.

Given the equation a OR b, is it satisfiable? Yes, if you assign a = true , b = true will result in it being satisfiable. DPLL does not care if there is more solutions, it just needs one to show that the equation is satisfiable.

Given the equation a AND (!a OR b), is this one satisfiable? Well we know a must be true, and the second part we know that !a will be false so b must be true. This is the kind of reasoning that DPLL can employ to derive values for variables.

Another way to write (!a OR b) is a -> b, i.e a implies b. This is how we map these Boolean equations to components, where if a where a component then a depends on b.

Here is the description of DPLL function from my thesis:

In this function F is just the set of constraints, P is the currently assigned variables (or partial solution) and:

unit-propagate is a particular way in which you can infer variable assignments from already inferred knowledge, e.g. you know a=true and you know a -> b, therefore b=true.
decide makes a guess at which variable should be true or false. If decide picks the right assignments all the time, then this algorithm will be really fast.
It is recursive, e.g. it guesses a=true then it calls itself to see if there is an answer where a=true. If there isn’t then the answer must be a=false. If a cannot be true or false, then there is no answer and DPLL knows there is no solution.

Simple right! Well the devil is in the details. Getting fast unit-propagation, good heuristics for decide, efficiently handling hundreds of thousands of variables and constraints is very difficult.

Optimisation

Finding any solution to a component problem isn’t a good idea. There are many solutions to any given component upgrade problem, the hardest part is finding a good one.

For this I used an algorithm called Lexicographic-iterative-strengthening, which is another way of saying it finds a solution and keeps trying to make it better till it can’t any more.

The last thing I needed to do was to describe how I mapped my CUDF* instance to the implementation, and that is pretty tedious so I will leave it out here.

Verification of GJSolver

Verification: Did I build it right?

So after implementing GJSolver I needed to make sure:

It correctly solves component upgrade problems
It solves the problems at least as well as other implementations

I found out if it did both of these by entering into Mancoosi International Solver Competition (MISC). In this competition GJSolver competeted against other implementations in solving CUDF problems. A competition like this has the benefit of being conducted by an impartial third party, meaning I couldn’t tamper with the results.

The outcomes of the competition were:

GJSolver consistently got good results, reasonably quickly. It won one the most difficult of the three tracks it was entered in.
No CUDF problem was solved incorrectly by GJSolver.
When compared to the other solvers I produced similar solutions in similar time frames.

Validation of GJSolver

Validation: Did I build the right thing?

To validate GJSolver and SimUser I:

simulated installing 200 requests of packages and compared it to the logs of apt-get users.
simulated updating an Ubuntu system once a day for a month and compared it to the results I collected from a virtual Ubuntu system doing the same thing.

The results were close, but not exactly the same. This validation is not done to make sure the simulation is exactly reality, it never will be. The validation is done to see where it differs from reality.

Experiment: Alter Simulation and Measure Effect

I had four questions I wanted to answer with my simulation:

What consequences do a user’s choices have on their system (i.e. their probabilities to upgrade and install) when using the apt-get criteria?
Can the out-of-dateness of a system be reduced?
Can the total change of a system be reduced?
How do the systems of realistic users evolve?

To answer these questions I used SimUser to create CUDF* documents for various users, then used GJSolver to resolve those CUDF* documents.

Results

I will not go over all the results but my initial experiments were with four simulated users:

Always Install users install one component every day
Always Update and Install users install and update every day
Always Upgrade users upgrade everyday
Control users do nothing

Note: The measure UTTDpC stands for Up-To-daTe-Distance per Component, a measure of how many newer versions exist of a component.

The results from these users looked like:

How up to date a comopnent system is (lower is better):

How much change a system goes through:

Reducing up-to-dateness

When creating the optimisation criteria for apt-get I learnt something that I didn’t previously know: apt-get will not install a entirly new component during an update. So if a newer version of an already installed component requires a component that you do not currently have installed it will be unsable to upgrade that component.

By altering the optimisation criteria I was able to get a system about 24% more uptodate over a year of upgrading everyday.

Reducing Change

While experimenting I noticed that many times a component would be upgraded in quick sucession. Wondering if this was a common issue I looked to the data and created this graph:

Using this information I came up with the theory that sometimes a component is released to the package repository with a bug that is quickly found, fixed, then a new version is released. This means that any person who installed the original package would have been better waiting a little bit because:

they will not have to download a package twice
they will not have a bug in your system
they will not have to change your system twice

So I created an optimisation that waited 7 days after a package is added to the repository to install it and simulated the results. I found that if you update frequently then you will save yourself about 30 instances per year where you install the same package twice in a week. If you increase that waiting time to 28 days, you can save yourself over 100 changes a year.

Summary of Experiments

In addition to showing how to reduce changing by waiting before installing a component, and decreasing out-of-dateness by letting updates install new components; I also showed:

The majority of change during evolution is caused by a user upgrading. Installing new components increases the amount of change when upgrading.
Systems become out-of-date at the rate at which components evolve. Components evolve at a higher rate during release cycles.
Reuse decreases the rate of change during CSE. This is due to the two effects; reuse decreases the installation rate of components and this decreases the amount of components necessary to be upgraded.
Increasing the frequency of upgrading has depreciating returns on reducing a systems out-of-dateness. It may also increase change due to components being repeatedly upgraded if they quickly release multiple versions.

Research Conclusion

The research I did gave an understanding of how component systems evolve and it can provide users and developers with insights into the effects of their choices. Additionally, the research has proposed two novel ways to reduce negative effects during CSE and the tools in with which to measure the effectiveness of these techniques.

What I learnt

Only do a Ph.D. if you can live on ramen noodles: you are not paid much (if anything) there is a lot of stress involved and it will take a while.
Doing a Ph.D. will force you to learn how to give presentations, write research papers, organise conferences, in addition to all the academic things you must learn. So get good at learning things.
Look at what others have done, it saved me a lot of problems. If I had not found Mancoosi, or Daniel Le Berre, I would have never finished. Understanding their work allowed me to put my work in their context.
Writing is difficult, feedback is necessary. I had many people look at my writing and try to understand what I was saying.
Formalism is not for formalisms sake, but for your understanding and communication.
Learn to write and read mathematical notation. I believe it has helped me be a better programmer, by forcing me to think about every function I write.
An experiment is more powerful than any amount of words. Feynman says “Experiment is the sole judge of scientific ‘truth’”. If you have a repeatable experiment no one can argue with you.
I am much more critical of science, especially scientific reporting when they use the word prove and proof. Most of the time what they do is demonstrate something, proof is a long way off.

References

This will be short, because if you want to see all my references there are 7 pages of them in my thesis!

Brooks The Mythical Man-Month, this book is a must for a developer who wants to understand what their job is.

Szyperski Component Software: Beyond Object-Oriented Programming, a great (and massive) tomb of knowledge about all aspects of software components.

Everything written by Richard Feynman, Surley you’re joking Mr Feynman, What do you care what other people think?, even QED. He has a way of telling a story while explaining something that makes you able to understand (even a bit).

Bonus: This is a timelapse of me writing my thesis:

Introduction to Components and Evolution#

Software Components#

Component Systems#

Software Evolution#

Component and Component System Evolution#

What I Did and How I Did it#

Hypothesis#

Method#

Ubuntu#

Validation#

What I Created#

CoSyE (Semantic)#

CUDF* (Syntax)#

SimUser#

GJSolver#

DPLL#

Optimisation#

Verification of GJSolver#

Reducing Change#

Summary of Experiments#

Research Conclusion#

References#

Introduction to Components and Evolution

Software Components

Component Systems

Software Evolution

Component and Component System Evolution

What I Did and How I Did it

Hypothesis

Method

Ubuntu

Validation

What I Created

CoSyE (Semantic)

CUDF* (Syntax)

SimUser

GJSolver

DPLL

Optimisation

Verification of GJSolver

Reducing Change

Summary of Experiments

Research Conclusion

References