User:WillWare/Elementary production system

Having examined the definition of a production system, now it's time to build one. This system will be "elementary" in the sense that it will do things in the most obvious way, and will not scale to large numbers of facts and rules. That is where the Rete algorithm will become important.

Notations for production rule systems vary slightly from one implementation to another. Most of the notations look pretty similar to what I'll use here. A common convention is that variable names begin with question marks.

A production system includes facts and rules. Facts represent specific bits of knowledge about specific things in the world. Rules are used to work with facts to generate new knowledge, frequently by deduction, or to perform some task based on a particular combination of facts.

Facts can be represented as Lisp lists. (Bob age 35) (Bob eyecolor blue) (Ted marriedto Annie) (smallsquares (1 1) (2 4) (3 9) (4 16)) A particularly simple and flexible representation for facts is a triplet, such as (Bob age 35) or (Bob eyecolor blue), and we can label the three elements the subject, the predicate, and the object. Any fact more complex than a triplet can be represented as one or more triplets with no loss of expressive power, and with a simplification of the system software. So the last fact could be rewritten as these triplets. (1 square 1) (2 square 4) (3 square 9) (4 square 16)

To build rules, we will need variables to build patterns that will match facts. For example, (?a age 35) will match any fact saying that somebody is 35 years old.

A rule (sometimes called a "production") has the following parts. To process the rule, we match the patterns against our facts, while being consistent in variable bindings. We can't bind ?a to Bob in one pattern and to Fred in another. Once we have a match of all the patterns with consistent bindings for all variables, we apply those bindings to the actions and perform the actions.
 * A name - optional, but helpful for diagnostic purposes.
 * One or more patterns which tell us when the rule should be applied. Patterns can include variables, and when they do, the variables will be bound to elements in the matching facts.
 * One or more actions to be taken when the rule is applied. The most common action is to assert a new fact, but other actions are possible.

Our production system will be written in Python, a cross-platform language that's easy to read. There is room for some latitude in defining a production system, and we will add a few nifty features.
 * We will be able to specify that a pattern is negated, that is, the rule should be applied when that pattern cannot be matched.
 * We will allow some arithmetic in patterns.
 * We will allow the user to define new actions in Python.

Talk about the source code a little
''Give some idea what's going on and how it relates to what's been said so far. Show examples of the rules and facts and queries that this system understands, and how it responds to them.''

This approach can be considered a reference implementation. Its purpose is clarity, not performance. The source code is here:
 * http://code.google.com/p/wware-autosci/source/browse/prodsystem/prodsystem.py

The Rete algorithm discussed below gives a big performance improvement.

Failure to scale
The little system above works fine for tiny toy problems. Butmagine what might happen in a system with many thousands of rules, and many millions of facts. Rules might have hundreds or thousands of patterns. What could possibly go wrong?

If you look at what's happening in Rule.apply, you'll see that it's pretty inefficient. Immediately it calls a recursive function, which fishes around for consistent bindings for each pattern of the rule, sweeping through the fact list repeatedly. Here's a log-log graph of the performance improvement with the Rete algorithm compared to this naive approach. The horizontal axis is the number of cars in an example, where each car has three facts. The vertical axis is the number of seconds to process a single rule with four patterns.



The Rete algorithm
The Rete algorithm is an efficient pattern matching algorithm for implementing production rule systems. The Rete algorithm was designed by Dr Charles L. Forgy of Carnegie Mellon University, first published in a working paper in 1974, and later elaborated in his 1979 Ph.D. thesis and a 1982 paper (see References). Rete has become the basis for many popular expert system shells, including CLIPS, Jess, Drools, BizTalk, Rules Engine and Soar.

This description of the Rete algorithm is distilled from R. Doorenbos's PhD dissertation, Production Matching for Large Learning Systems which also describes a variant named Rete/UL, optimized for large systems.

A Rete is a dataflow network in two sections, an Alpha network and a Beta network.

The Alpha network populates working memories for each pattern of each rule, exploiting repetitions of patterns where possible. When a new fact is presented to the Alpha network, the fact is matched to each pattern, and if a match is possible, the resulting variable bindings are recorded. Pattern matching is sped up using a decision tree, with decisions based on constants in the pattern. Because the nodes of the decision tree must represent constants in the pattern, and the triplet's second element (its predicate) is often a constant, the decision tree's first branchpoint is often the predicate. A pattern with no constants, e.g. (?x ?y ?z), will match all facts.

The Beta part of the network primarily contains join nodes and beta memories. Join nodes test for consistency of variable bindings between conditions. Beta memories store "tokens" which match some but not all of the conditions of a production.

The Alpha network
The mechanism for populating the working memory for a particular pattern works like this.

fact --> binary search on first variable --(no match)---> discard fact |             (match found) |               V              binary search on second variable --(no match)---> discard fact |              (etc) +> save binding in working memory

The operation of the Alpha network is then to present facts sequentially to these mechanisms. They operate entirely independently so this presents an opportunity for parallelism.

Bindings, compatibility, and merging
The Beta network works with bindings, which are sets of variables and the values assigned to them. Here are some examples of bindings. {'?a': 'Al', '?b': 'eats', '?c': 'burgers'} {'?a': 'Bob', '?b': 'golf'} {'?b': 'wife', '?c': 'Deb'} {'?a': 'Deb', '?c': 'salad'} Two bindings are compatible if they do not assign the same variable to two different values. The bindings above are incompatible because ?a is assigned to Al, Bob, and Deb in different places. Likewise ?b and ?c are assigned different values. Here are a set of bindings that are all mutually compatible because the values assigned to the variables are consistent throughout. {'?a': 'Al', '?b': 'eats', '?c': 'burgers'} {'?a': 'Al', '?d': 'wife', '?e': 'Deb'} {'?e': 'Deb', '?b': 'eats', '?f': 'salad'} Once a group of bindings has been determined to be compatible, they can be merged into a single binding. This simply involves grouping all the assignments into one binding without repetition. With the previous example, the result of a merge would look like this. {'?a': 'Al', '?b': 'eats', '?c': 'burgers', '?d': 'wife', '?e': 'Deb', '?f': 'salad'} The Beta network requires only the compatibility and merging of two bindings at a time.

Machinery
Nodes marked "Cn" represent the working memories compiled by the Alpha network, and nodes marked "Dn" represent the results of joining pairs of working memories. Each "Join" node has the job of taking the contents of two working memories (each being a collection of bindings), and merging each compatible pair into a single binding.

Rule1: C1, C2, C3 Rule2: C3 Rule3: C2, C4 C1 --- Join ---> D1 --- Join --> Rule1 /               / C2 *-*          / \       / C3 \--*> Rule2 \ C4 --- Join ---> Rule3

Joining is a potential performance problem. Given two working memories with N bindings each, the most obvious approach would require O(N2) time, which is likely to be unacceptable. In fact it is possible to join two sets of bindings in O(N) time, as follows.

Let the two sets of bindings be {Xi} and {Yi} respectively....

Give some kind of overview of how the Beta stuff is working below.

Opportunities for parallel computation
Two sorts of parallel computation are currently in common use. One is a networked cluster, usually of Linux boxes for ease of administration. This approach is appropriate when tasks can be broken into reasonably large subtasks. Some work has been done in modifying the Linux kernel to accelerate relevant network protocols, but much of this work is not publicly available.

An increasingly popular approach to parallel computing, arguably more cost-effective, is the use of GPU boards. Costing usually only a few hundred dollars, these boards offer dozens or hundreds of processor cores. There are usually mildly esoteric constraints and restrictions to be considered when programming in this domain. These typically involve the size and organization of memory available to the GPU cores, and the overhead of moving data between CPU memory and GPU memory.

It is of considerable interest to review the Rete algorithm code with the goal in mind of parallelizing it for execution on a GPU board. Because GPUs are programmed in C, not Python, it will be necessary to identify pieces of code that can be translated to C, and then to provide a C extension callable from the remaining Python code.

Source code

 * http://code.google.com/p/wware-autosci/source/browse/prodsystem/rete.py