User:WillWare/ML Book Intro

Motivation
I hope to make this a book that I will find interesting, and from which I can learn. There may then be a good chance that others would also find it interesting.

There are a few threads here that hold my interest. Each is cohesive by itself, but jamming them together all in one book seems to me like a bad idea. The tentative plan at the moment is to produce multiple volumes, so that my profound ignorance can be made abundantly available for the amusement of all mankind.

The lay of the land
Take this with a grain of salt, in light of my aforementioned ignorance.

There are a cluster of related topics including artificial intelligence, machine learning, and neural nets. My first attempt to create a book of these topics was in early 2010, and looking back at it a decade later, it didn't suck as bad as I would have expected. So I am emboldened to take another swing at it. My thoughts on all this stuff have changed surprisingly little in the intervening years.

Some related fields involve more "analog" things: classification, clustering, various sorts of signal processing, Kalman filters, and the contents of the wonderful MIT course 6.432, (Detection, Estimation, and Stochastic Processes). That course was last taught in 2005, and has since been replaced by two others, 6.437 (Inference and Information) and 6.972 (Algorithms for Estimation and Inference). I guess I need to take those next.

One idea that grabbed my interest about that time was the notion of automating science. Not just automating lab work, but automating the scientific method itself. This is discussed at length below.

As I look through this collection of subjects, I find certain cohesive threads.

The Symbolic AI thread
This book aims to address this thread. Here's what I think a reasonable syllabus might look like. I think of this as "discrete" or "digital" machine learning/reasoning, because it deals with essentially logical statements and discrete entities. It tracks context and does not fall victim to the fits of gibberish produced by something like GPT-3.

The data mining thread
There are two powerful ideas here, classification of data objects into categories, and regression which tries to create a mathematical model of an empirically found relationship among pieces of data. This is incidentally where a lot of money is being made these days.

This topic is a little more "analog" in the sense that we are dealing explicitly with probabilities or degrees of certainty. The primary insight here is Bayes' theorem.
 * $$P(A|B) = \frac{P(B | A)\, P(A)}{P(B)}\,\! $$.

This helps us use observations to improve the accuracy of our estimates of hidden variables we cannot directly observe. If a hidden variable happens to be of considerable commercial value, so much the better.

Actually, a huge piece of this volume could simply be a giant R tutorial. Likewise Weka.
 * Practice
 * Recommender system
 * Collaborative filtering
 * Data mining
 * The R language


 * Theory
 * Information theory
 * Statistical classification
 * Regression analysis
 * Logistic regression
 * Bayes' theorem
 * Bayesian inference
 * Cluster analysis
 * Hidden Markov model


 * Algorithms
 * Naive Bayes classifier
 * Bayesian linear regression
 * Association rule learning
 * Decision tree learning
 * Support vector machine
 * Principal component analysis

The signal processing thread
One could loosely refer to this as a sort of machine learning, in that the machine is sometimes inferring the value of an unobservable variable, given observations of some other more accessible variable. While this topic is close to my heart, I need to think harder about how to put it all together.


 * Statistical inference
 * Bayes' theorem
 * Bayesian inference
 * Fuzzy logic
 * Artificial neural network
 * Expectation-maximization algorithm
 * Hidden Markov model
 * Kalman filter

A good page on hidden Markov models.
 * http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/main.html

I should try some data mining on the data from the Framingham Heart Study if that data is available.

The natural language processing thread

 * Speech synthesis
 * Speech recognition
 * Natural language understanding
 * Machine translation

The automation of science
Reported in April 2009 by Ross King at Aberystwyth University, the Adam robot used lab automation to perform experiments and data mining to find patterns in the resulting data. It developed novel genomics hypotheses about S. cerevisiae yeast and tested them. Adam's conclusions were manually confirmed by human experimenters, and found to be correct. This was the first instance in human history where a machine contributed new scientific knowledge without human oversight.

Modern science deals with ever-increasing amounts of data, while the cost of computers continues to fall. The need to automate certain aspects of scientific work will grow more urgent in the coming years. We may some day see machine theoreticians and experimentalists collaborating with their human counterparts, participating in a scientific literature that is both human- and machine-readable.

This is an idea that has been in my thoughts quite a bit over several months now, and I've blogged about it.

Can computers do scientific investigation?
I came across a 2001 paper in Science recently that lines up with some thinking I'd been doing myself. The web is full of futuristic literature that envisions man's intellectual legacy being carried forward by computers at a greatly increased pace; this is one of the ideas covered under the umbrella term technological singularity.

In machine learning there are lots of approaches and algorithms that are relevant to the scientific method. The ML folks have long been working on the problem of taking a bunch of data and searching it for organizing structure. This is an important part of how you would formulate a hypothesis when looking at the bunch of data. You would then design experiments to test the hypothesis. If you wanted to automate everything completely, you'd run the experiment in a robotic lab. Conceivably, science could be done by computers and robots without any human participation, and that's what the futurists envision.

The Science paper goes into pretty deep detail about the range and applicability of machine learning methods, as things stood in 2001. I find ML an interesting topic, but I can't claim any real knowledge about it. I'll assume that somebody somewhere can write code to do the things claimed by the paper's authors. It would be fascinating to try that myself some day.

To bring this idea closer to reality, what we need is a widely accepted machine-readable representation for hypotheses, experiments, and experimental results. Since inevitably humans would also participate in this process, we need representations for researchers (human, and possibly machine) and ratings (researcher X thinks hypothesis Y is important, or unimportant, or likely to be true, or likely to be false). So I have been puttering a little bit with some ideas for an XML specification for this sort of ontology.

Specifying experiments isn't that tricky: explain what equipment and conditions and procedure are required, and explain where to look for what outcome, and say which hypotheses are supported or invalidated depending on the outcome. Experimental results are likewise pretty simple. Results should refer to the experiments under test, identifying them in semantic web style with a unique permanently-assigned URI.

The tricky part is an ontology for scientific hypotheses. But you then need a machine-readable language flexible enough to express complex scientific ideas, and that's potentially challenging. Besides, some of these ideas are naturally expressible in ways humans can easily get, but in ways difficult for machines, for instance almost anything involving images.

An XML specification for describing hypotheses, experiments and results in a machine-readable way would be very interesting.

Machines doing actual science, not just lab work
Here's the press release: Robot scientist becomes first machine to discover new scientific knowledge

In an earlier posting, I discussed the idea of computers participating in the reasoning process of the scientific method. There are, as far as I can see, two fields that are applicable to this. One is machine learning, where a computer studies a body of data to find patterns in it. When done with statistical methods, this is called data mining. The other is automated reasoning such as is done with semantic web technology.

So I was quite interested to see the news story linked above. Researchers in the UK have connected a computer to some lab robotics and developed a system that was able to generate new scientific hypotheses about yeast metabolism, and then design and perform experiments to confirm the hypotheses.

This is important because there will always be limits to what human science can accomplish. Humans are limited in their ability to do research, requiring breaks, sleep, and vacations. Humans are limited in their ability to collaborate, because of personality conflicts, politics, and conflicting financial interests. Human talent and intelligence are limited; the Earth is not crawling with Einsteins and Feynmans.

That's obviously not to say that computers would have an unlimited capacity to do science. But their limits would be different, and their areas of strength would be different, and science as a combined effort between humans and computers would be richer and more fruitful than either alone.

I still think it's important to establish specifications for distributing this effort geographically. I would imagine it makes sense to build this stuff on top of semantic web protocols.

I like the idea that with computer assistance, scientific and medical progress might greatly accelerate, curing diseases (hopefully including aging) and offering solutions to perennial social problems like boom-and-bust economic cycles. Then we could all live in a sci-fi paradise.

Foresight Institute conference, Jan 16 and 17, 2010
The Foresight conference is just winding down. The talks were live-blogged over at NextBigFuture by Brian Wang who did a good job of concisely capturing the essentials. My own favorite talk was by Hod Lipson, who talked about a number of things, including something I find fascinating, the automation of science, about which I plan to blog more frequently.

The Adam project was reported in April 2009 in Science by Ross King et. al. at Aberystwyth University. It used lab automation to perform experiments, and data mining to find patterns in the resulting data. Adam developed novel genomics hypotheses about S. cerevisiae yeast and tested them. Adam's conclusions were manually confirmed by human experimenters, and found to be correct. This was the first instance in human history where a machine discovered new scientific knowledge without human oversight.

Here is what I want to see computers doing in the coming years.
 * Look for patterns in data -- data mining
 * Propose falsifiable hypotheses
 * Design experiments to test those hypotheses
 * Perform the experiments and collect data
 * Confirm or deny hypotheses
 * Mine new data for new patterns, repeat the process

In the longer term, I want to see machine theoreticians and experimentalists collaborate with their human counterparts, both working in a scientific literature that is readable and comprehensible for both. This will require the development of a machine-parseable ontology (ideally a widely recognized standard) for sharing elements of the scientific reasoning process: data sets, hypotheses, predictions, deduction, induction, statistical inference, and the design of experiments.

So why do I want all this stuff? For one thing, it's interesting. For another, I am approaching the end of my life and I want to see scientific progress (and particularly medical progress) accelerate considerably in my remaining years. Finally, this looks to me like something where I can make some modestly valuable contribution to humanity with the time and energy I have left.

How hard is generating scientific hypotheses?
In the 1500s, a Danish astronomer named Tycho Brahe used Galileo's invention of the telescope to collect an enormous amount of numerical data describing the motion of the planets. Brahe's assistant Johannes Kepler studied that data and arrived at some interesting conclusions which we now know as Kepler's laws of planetary motion: Kepler's laws were the starting point from which Isaac Newton formulated his law of gravitation, the inverse-square law that we all know and love.
 * The orbit of every planet is an ellipse with the Sun at a focus.
 * A line joining a planet and the Sun sweeps out equal areas during equal intervals of time.
 * The square of the orbital period of a planet is directly proportional to the cube of the semi-major axis of its orbit.

We have here a three-step process: collect data, find mathematical patterns in the data, and create a theory that explains those patterns. Collecting data is simple in principle, and looking for mathematical patterns is also simple. Kepler's arithmetic was done by hand, but now we have computer programs (like Eureqa) which use genetic programming to find parsimonious mathematical formulas that fit sets of data. You can find Java applets on the web that demonstrate this idea.

So the first two steps aren't too hard. We can arrive rather easily at mathematical formulas that describe various experimentally measurable aspects of reality. That's a good thing. The hard job is the next step: finding theories or "likely stories" that explain why those formulas take whatever form they do. Sometimes the form of the math suggests a mechanism, because you've learned to associate elliptical orbits with conservative force fields which necessarily have an inverse-square law. (Hundreds of years after Newton, that is now a no-brainer.) But generally the problem is non-trivial and so far, as far as I'm aware, requires human insight.

Inference engines and automated reasoning
An inference engine is a computer program that reasons, using some form of knowledge representation.

This can be done with propositional logic or first-order logic, assuming each proposition is completely unambiguous and is either 100% true or 100% false. These simplistic engines are fun little exercises in programming but in real-world situations, reasoning usually needs to consider ambiguities and uncertainties. Instead of simply being true or false, propositions may be likely or unlikely, or their likelihood may be something to be tested or determined. Some elements of some propositions may be poorly defined.

In the unambiguous binary case, it's typical to express rules for generating new propositions as if-then rules with variables in them. We call these production rules because they are used to produce new propositions. If X is a man, then X is mortal. Given the statement "Socrates is a man", we Obviously this isn't rocket science, but even without handling uncertainty, it will still be useful if scaled to very large numbers of propositions, as in the semantic web.
 * match the statement to the rule's IF clause
 * take note of all variable assignments: X=Socrates
 * plug assignments into the THEN clause: "Socrates is mortal"

How to handle uncertainty? This can be done by representing knowledge as a Bayesian network, a directed graph where the edges represent the influences and dependencies between random variables. There is a good tutorial about these online. Here's an example where the probability of rain is an independent variable, and the sprinkler system is usually off if it's raining, and the grass can get wet from either rain or the sprinkler.



There are at least two open-source inference engines that work with Bayesian networks. One is SMILE, another is the OpenBayes library for the Python language. OpenBayes allows you to update the state of your knowledge with a new observation.
 * Suppose now that you know that the sprinkler is on and that it is not cloudy, and you wonder what's the probability of the grass being wet : Pr(w|s=1,c=0). This is called evidence...
 * and then perform inference in the same way... The grass is much more likely to be wet because the sprinkler is on!
 * and then perform inference in the same way... The grass is much more likely to be wet because the sprinkler is on!

Here is a list of many more Bayesian network libraries, and another list. There is also a nice tutorial on Learning Bayesian Networks from Data, the process of taking a bunch of data and automatically discovering the Bayesian network that might have produced it. Another Bayesian reasoning system is BLOG.
 * Bayesian logic (BLOG) is a first-order probabilistic modeling language under development at MIT and UC Berkeley. It is designed for making inferences about real-world objects that underlie some observed data: for instance, tracking multiple people in a video sequence, or identifying repeated mentions of people and organizations in a set of text documents. BLOG makes it (relatively) easy to represent uncertainty about the number of underlying objects and the mapping between objects and observations.

Are production rule systems and Bayesian network systems mutually compatible? I don't yet know. Do Bayesian networks adequately represent all important forms of uncertainty or vagueness that one might encounter in working with real-world data? I don't know that either. Are there other paradigms I should be checking out? Probably.

Bayesian nets in RDF, and how to update them
I've banged my head on this for a couple of days and feel close to a solution. The graph looks like this.



Each random boolean variable gets a node, the causal relationship between them gets a node, and each variable gets a probability. The math for updating probabilities is a little tricky, but in a fun and interesting way, so I enjoyed banging on that. At some point I'll tackle more involved cases where there aren't simply two random boolean variables, but that's the logistically simple case that exposes most of the concepts involved. Kinda like the Drosophila of Bayesian inference.

What big topics haven't I addressed yet?
How much of this stuff is actually relevant? How much would it require me to redesign everything? What are the criteria for relevance? My goals are (1) interesting work for the remaining 15 years of my working life, and (2) stuff that contributes to the automation of science project.
 * AI in games
 * Game-learning AI (chess, checkers, go): Deep Blue, Blondie24
 * Fuzzy logic
 * Neural nets
 * Category:Heuristics
 * Non-exhaustive search: gradient descent, conjugate gradient, genetic algorithms, simulated annealing