Scott McCloskey

ICSG 755 - Neural Networks and Machine Learning

Winter 1999-2000

Research Project Report

Pruning Algorithms for Neural Networks

Despite the numerous successes of neural networks in solving interesting problems, there are still reasons that prevent these solutions from being used in real commercial applications. One such inhibitor is the fact that neural networks can be large enough that the speed of the feed forward process prohibits their use. In many cases, the slow speed is a consequence of the size of the network. Obviously, larger networks require longer to pass data through than their smaller counterparts. Because of this, and other motivations that will be discussed later, the problem of generating the smallest network that will solve a given problem becomes an interesting problem. Several methods have been proposed to perform this minimization, and it is the purpose of this paper to analyze a number of these that fall into the category of pruning algorithms.

The largest motivation, the speed of the feed forward process of the network, has already been mentioned. It should come as no surprise that the execution time in a feed forward network is largely determined by the topology. Each weight in the topology results in a multiplication in the feed forward phase, and each neuron results in the evaluation of a function (sigmoid, hyperbolic tangent, etc.) in addition to the cost of the weights contributing to the neuron's input. Thus network topology is the primary concern in developing fast neural networks.

Another important motivation is the lack of alternative methods for generating minimal neural network topologies. While there are formal methods to find the minimal network topology in general, such methods are typically too complicated to be used for specific applications1. Because of this, more useful (though possibly less accurate) methods such as pruning are important in real applications.

A third important motivation for using a minimal network topology is the notion that smaller networks generalize better to test inputs not in the original training set. The basic idea behind this belief is that larger networks, due to their excessive topologies, will over constrain problems and simply memorize the training data. This is similar to the idea that training a network to have no error will over constrain the problem. On a high level, the belief is that smaller topologies somehow capture the essence of a problem, whereas a larger network will simply memorize the input and output pairs in the training data.

The basic idea behind all pruning algorithms is to start with a neural network topology (structure of layers, neurons and weights) that is large enough to solve a the problem of interest and iteratively remove neurons and/or weights until the network is unable to learn the training data. In order to do so, a number of heuristics are defined for selecting the neuron or weight to be removed at each step. Most of these heuristics are defined in such a way that they choose the neuron or weight whose removal will cause the least disturbance in the net's ability to learn the data. The output of a pruning algorithm is a network topology that is minimum given the heuristic used and the input topology. This output is the smallest topology in the iteration defined above that converges for the training data.

With the basic framework of a pruning algorithm in place, the most important choice in constructing an implementation of a pruning algorithm is the selection of a pruning heuristic. A number of heuristics have been suggested (a list can be found in Thimm and Fiesler2), and , as previously mentioned, most are designed to select the neuron or weight to which the network is least sensitive. The sensitivity of a network to a neuron or weight is a measure of the error that is introduced when that neuron or weight is removed.

The simplest heuristic is that which selects the weight with the smallest absolute. The logic behind this is that the weight with the smallest absolute value will contribute the least to the neuron it provides an input for, and thus the network is least sensitive to it. Along with this selection heuristic is the suggestion that the weight's mean contribution (the mean of a neuron's input coming along the selected weight) be added to the weight between the bias and the neuron that the weight contributes to.

Another selection heuristic is to choose the weight with the smallest variance in its contribution. The logic behind this heuristic is that, if a weight always contributes the same value to a neuron then it is essentially acting as a bias neuron. In this case, that weight can be removed and its value added to the weight from the bias to the target neuron. In the general case, the weight with the lowest variance is the one acting most like a bias, and thus it is selected for removal. As with before, the weight's mean contribution is added to the weight from the bias to the target neuron.

Whole scores of other selection heuristics have been suggested, though most are significantly more complicated than those mentioned previously. For the sake of brevity, they will be omitted.

Four pruning methods were selected, implemented, and tested for this project. Those four methods are as follows:

  1. Select the weight with the smallest absolute value. Remove this weight without changing the values of any neighboring weights.
  2. Select the hidden neuron with the smallest variance in outputs. This is measured by observing the outputs of each hidden neuron during training. The output variance for neuron j is defined to be:
  3. OVj = S absolute value (Oj, k - Oj, k-1), where Oj, k is the output of neuron j at the kth step of training.

    Similar to method #1, the neuron is removed without changing the values of any neighboring weights.

  4. Weights are selected as in method #1, but the value of the removed weight is distributed evenly amongst all other weights contributing to the same output neuron.
  5. Neurons are selected as in method #2, but the value of the removed weights coming from the removed neuron are distributed (as in method #3) amongst all other weights contributing to the same output neuron. Weights contributing to the removed neuron are removed without changing the values of neighboring neurons.

In all of the above methods, an additional constraint was imposed; namely that weights or neurons involving the bias neuron were ineligible to be pruned.

In order to test the pruning methods selected above, a network computing the XOR function was used. We have seen previously that the XOR function can be accurately learned by a fully connected network with 2 input neurons, 2 hidden neurons, and 1 output neuron, with both the hidden and output neurons connected to a bias neuron. This topology, then, would be viewed as a rough idea of a minimum topology (though it hasn't been proven to be). In order to test the pruning, a starting topology was chosen with 2 input neurons, 5 hidden neurons, and 1 output neuron. The network is fully connected with bias neurons contributing to neurons in the hidden and output layers, bringing the total number of weights to 21.

Using this topology as a starting point, 50 weight sets were found that were known to converge to mean squared errors less than 0.005. All four pruning methods were run on each of the 50 initial weight sets, and pruning continued until the network was no longer able to converge. For each method, the number of items (neurons or weights) pruned was measured, and the results are summarized in the following table:

Nodes

or

Weights

Removed

Min

Max

Mean

Method

#1

2 weights

9 weights

4 weights

#2

3 nodes

3 nodes

3 nodes

#3

1 weights

7 weights

3.4 weights

#4

3 nodes

3 nodes

3 nodes

Note: 1 node = 5 weights

Surprisingly, both neuron selection methods (#2 and #4) succeeded in pruning the neural network to the assumed minimum topology in all 50 cases. This is surprising because there is no reason to predict that the results in any of the 50 cases should resemble one another, as the initial weights are unrelated (excepting for the fact that they were all generated by the same random number generator). Consequentially, this leads to the tentative conclusion that methods #2 and #4 are in some sense robust in that they can consistently arrive to the same topology regardless of the initial weight set. This makes no claim, however, about the ability of these methods to converge to the same topology given different starting topologies.

Also a surprise was the fact that the weight selection methods (#1 and #3) performed quit poorly. Intuitively the freedom to choose individual weights (as opposed to choosing groups of weights as with methods #2 and #4) should lead to a better (smaller) output because of the fact that it is removing weights at a slower rate. In addition to producing larger topologies, the weight selection criteria provided very different results from one input weight set to the next.

Another observation based on these results is the fact that none of the pruning methods succeeded in producing a topology smaller than the assumed minimum, which contained 2 hidden neurons and a total of 9 weights.

Based on the results of this test, method #4 was selected for use in the more interesting case, pruning a network trained to recognize handwritten digits. Only one method was tested on this more interesting case because of the limited processing power available.

In the case of handwritten digits, two results are of interest: firstly, the final size of the pruned network, and, secondly, the success of the resulting network in classifying handwritten digits not in the training set. The expectation, as mentioned earlier, is that the pruned network outperform the unpruned network on the non-training digits.

For this test, a starting topology was chosen with 256 input neurons (corresponding to the 256 pixels in the input bitmap), 15 hidden neurons, and 10 output neurons. The convergence criteria was set to be a mean squared error (over the entire training set) of 0.09, and the training set consisted of twenty input output pairs (2 for each digit). A small training set was selected both for speed and to ensure that the unpruned network would incorrectly classify some test data, leaving room for observed improvement by the pruned networks. Subsequent (less rigorous) testing was also done with larger test sets. A testing set of 400 input output pairs (40 for each digit) was selected, with no repeats of the pairs from the training set.

For each test case, a set of weights conforming to the starting topology was chosen to give a starting network. This network was then trained until it satisfied the convergence criteria. Next, the network was pruned using the selected pruning method, operating on the same training data. Next, both the pruned and unpruned networks were tested against the testing set, using the winner takes all method of interpreting the outputs of the networks. The total number of correctly classified testing inputs was counted for both weight sets.

The following table summarizes the results of this experiment.

# training pairs

 

# correct classifications - Starting topology

# neurons pruned

# correct classifications - Pruned topology

20

mean

203.7250

9.4500

183.0500

 

min.

177

7

135

 

max.

227

11

226

100

mean

289.4

10

258.6

 

min.

281

9

234

 

max.

300

11

271

400*

 

400

5

400

*Note: Only one network was trained with all 400 training sets, and thus the results may or may not be indicative of the process in general.

Contrary to expectations, it is apparent from these results that the smaller network topology perform better on the testing set. Quite the contrary, the smaller network's performance was almost always worse than the larger, unpruned network. Furthermore, while the performance on the testing set for both networks improved with larger training sets, the smaller network failed to overcome the unpruned network's performance on the testing set. A related observation is that the networks with fewer neurons pruned seemed to perform better than networks with a larger number of neurons pruned. In some cases, networks with a small number of neurons pruned generalized better than their corresponding unpruned network.

From the experiments executed in the course of this project, it is reasonable to conclude that pruning methods can be very useful in reducing the size of a neural network. The results of the pruning algorithm for the interesting example of digit recognition indicate that such algorithms can be very useful in reducing the size, and thus the execution time, of neural networks used to solve interesting problems. This speed improvement, it should be mentioned, does not come without a cost. For each neuron or weight pruned from the network, a new round of training was required. This further increases the time it takes to train the network while potentially reducing the execution time after training.

Future work is plentiful as a result of this project. Certainly, methods #1-3 could also be tested on the digit recognition network to determine which of the four provides the best (and, perhaps, the most consistent) results for this problem. Furthermore, other selection heuristics should be experimented with in order to gain a broader understanding of the capabilities of pruning algorithms. Finally, new criteria for halting the pruning process should be experimented with. For this project, pruning was halted when the network could no longer successfully learn the training data. Other halting criteria that choose to halt the pruning before this point may have significant impact upon the ability of the pruned network to generalize, as observed in the case of digit recognition.

References

  1. G. Thimm and E. Fiesler, Pruning of Neural Networks. Page 2
  2. G. Thimm and E. Fiesler, Pruning of Neural Networks. Page 2

Works Cited

G. Thimm, E. Fiesler. Pruning of Neural Networks. IDIAP Research Report #97-03,

Dalle Molle Institute for Perceptive Artificial Intelligence, Valais, Switzerland.

Available online at http://www.idiap.ch/~thimm/thimm-97.3.bib.abs.html