@PHDTHESIS\{IMM2001-0797, author = "M. Keijzer", title = "Scientific discovery using genetic programming", year = "2001", school = "Informatics and Mathematical Modelling, Technical University of Denmark, {DTU}", address = "Richard Petersens Plads, Building 321, {DK-}2800 Kgs. Lyngby", type = "", note = "Supervised by Prof. Lars Kai Hansen, {IMM,} {DTU}", url = "http://www2.compute.dtu.dk/pubdb/pubs/797-full.html", abstract = "Genetic Programming is capable of automatically inducing symbolic computer programs on the basis of a set of examples or their performance in a simulation. Mathematical expressions are a well-defined subset of symbolic computer programs and are also suitable for optimization using the genetic programming paradigm. The induction of mathematical expressions based on data is called symbolic regression. In this work, genetic programming is extended to not just fit the data i.e., get the numbers right, but also to get the dimensions right. For this units of measurement are used. The main contribution in this work can be summarized as: The symbolic expressions produced by genetic programming can be made suitable for analysis and interpretation by using units of measurements to guide or restrict the search. To achieve this, the following has been accomplished: A standard genetic programming system is modified to be able to induce expressions that more-or-less abide type constraints. This system is used to implement a preferential bias towards dimensionally correct solutions. A novel genetic programming system is introduced that is able to induce expressions in languages that need context-sensitive constraints. It is demonstrated that this system can be used to implement a declarative bias towards 1) the exclusion of certain syntactical constructs; 2) the induction of expressions that use units of measurement; 3) the induction of expressions that use matrix algebra; 4) the induction of expressions that are numerically stable and correct. A case study using four real-world problems in the induction of dimensionally correct empirical equations on data using the two different methods is presented to illustrate to use and limitations of these methods in a framework of scientific discovery." }