UserPreferences

XORNoise


1. Training Neural Networks in the Presence of Random Distractions

1.1. Motivation

Developing robots are attracted to parts of their environment that exhibit novel and exciting behavior. Unfortunately, not all of this behavior can be learned. In particular, some parts of the world behave randomly, such as a staticky TV or curtains blowing in the wind. A developing robot needs to be able to recognize parts of its environment as unlearnable and ignore them in favor of those parts of its environment which it can make predictions about. To this end, we have developed two new learning techniques for neural networks which improve the speed and accuracy of learning in the presence of random distractions.

1.2. Method

Both of our techniques use a standard three layer back-propagating neural network, composed of an input layer, a hidden layer, and an output layer, which is split into several components. The first component of the output layer, labeled Output in the figure below, is the component that is attempting to solve the problem represented in the input patterns. In our tests we used the double XOR problem (1101 -> 01).

The next components of the output layer perform Error Anticipation (EA for short). EA involves training the network to anticipate the error of its own Output component. To accomplish this, the input activations are propagated through the network (without doing any learning) and the activation of the Output component is compared with its training target, resulting in a vector of errors on a per unit basis. Then, in the actual training phase, the target for the EA1 component of the output layer is the difference on a unit by unit basis between output activation and the target value. This process can be repeated many times Another EA component can predict the error of the previous EA component (EA2 would predict the error on EA1), and so on. Most of our experiments have two EA components.

The final component of the output layer is the Hidden Layer Anticipation component (LA for short). This component attempts to mimic the activation of the hidden layer for the current input. Essentially, during the same propagate used for EA above, we also record the activation of the hidden layer and set it as the target for the LA component. Due to the presence of squashing functions, this LA is not simply learning the identity function.

In trials where we have turned these techniques off, we simply set their target values to always be .5, we do not modify the structure of the network.

http://www.cs.hmc.edu/~jlewis/eann/images/structure.png

1.3. Results

As one can see from the graphs, standard back propagation is able to learn even in the presence of random distractions. However, both of our techniques allow the network to learn faster than it otherwise might. The “dataNO” lines represent the TSS with both techniques off. “dataEA” uses just EA, “dataLA” uses just LA, and “dataEALA” uses both techniques. The numbers after the name represent the percentage of deterministic data in the training set.

EA is the more subtle of the two effects. In the case were the data is 100% deterministic, it actually hinders the learning process. However, when the data is 25% deterministic EA is a substantial aid to the learning process.

LA exerts a positive influence in every mix of deterministic and random data, but its influence becomes more pronounced as more random data is added to the training set.

http://www.cs.hmc.edu/~jlewis/eann/images/graph100.png http://www.cs.hmc.edu/~jlewis/eann/images/graph075.png http://www.cs.hmc.edu/~jlewis/eann/images/graph050.png http://www.cs.hmc.edu/~jlewis/eann/images/graph025.png

1.4. Discussion

Why do these techniques work? Though we don't have complete answers, we do have theories about how these techniques influence learning.

In the case of EA we believe that anticipating the error causes the neural network to better segregate the random inputs from the deterministic inputs at its hidden layer. To attempt to verify these claims we performed principal components analysis on networks using EA and the networks not using EA. As you can see from the figures, EA networks do a much better job of segregating deterministic inputs (X) from random inputs (O). This allows the out the output layer to better ignore random inputs.

http://www.cs.hmc.edu/~jlewis/eann/images/eapca.png http://www.cs.hmc.edu/~jlewis/eann/images/nopca.png

LA is more complicated. It seems to both speed up the learning rate between the input and hidden layers and partially resist backwards progress when presented with random samples. The graph below shows the TSS error of the Output component (red) and the LA component (green). Learning seems to progress in three major stages.

http://www.cs.hmc.edu/~jlewis/eann/images/latss.png

In the first stage, the majority of the error is coming from the Output component. Since the weights from the input layer to the hidden layer start off random, as do the weights from the hidden layer to the LA component, the activations of the LA component and the hidden layer are near .5 and thus there is not much error. During this stage, the majority of the error is coming from non-random input samples—since their targets are either 1 or 0, whereas the random samples have targets ranging from 1 to 0 and averaging .5. Thus the initial expected error from non-random samples is .5 and from random samples is .25. The non-random input samples cause the majority of learning from the input to hidden layer at this stage.

In the second stage, the TSS of the Output component plummets, and the TSS from the LA component spikes. This is the critical stage where the LA strategy does much better than the other strategies. Here we believe that the LA component is minimizing damaging weight changes between the input and hidden layers in the following way.

Let's consider the first neuron of both the hidden layer and the LA component of the output layer. During the first stage of learning, the hidden layer neuron will have become either activated or inactivated when presented with a certain input pattern. The LA neuron will learn to mimic this, but will likely be a little behind, since it will always be learning before the weight changes from the Output component are applied for a particular sample. Thus, for some arbitrary sample, the hidden neuron will be activated at .8, for example, and the LA neuron might be activated at .7. In this case (and the corresponding low activation case) the back propagation step will cause the hidden layer neuron to be even more activated the next time around, since there is a negative error for the LA neuron (assuming the weight between the two neurons is positive). In this way, LA speeds up the learning between the input and hidden layers—reinforcing the initial movements of the hidden layer activations.

If we imagine that the sample is a random sample, and that the Output component target is the opposite of what it should be, then the Output component will likely be propagating back error that would reduce the activation of the hidden neuron. However, at the same time, the LA component will be propagating back error that would increase that same activation. This is the key assistance that LA provides in the second stage. Once the network starts learning its problem partially, the discrepancy in expected error mentioned above begins to disappear, and non-random inputs no longer carry more weight. In this situation LA is helping the network “remember” the direction its learning was proceeding in.

In the third stage, the TSS from both the Output component and the LA component is very small, but the LA TSS is higher in part due to the greater number of neurons in the LA component. Here, LA is simply a distraction, and it prevents LA networks from achieving the same accuracy as non LA networks (though it gets close much faster and this effect is mitigated by the presence of EA). One possible improvement on LA would be to turn it off once the Output TSS reaches some predetermined minimum.

1.5. Follow-up Experiments

We wondered whether we would see similar effects with datasets of randomly-generated binary patterns. We created a new dataset consisting of 50 unique input patterns of size 10 paired with 50 target patterns (not necessarily unique) of size 5. All bits were determined by random "coin tosses". A few example patterns are shown below.

[0, 1, 0, 1, 1, 1, 1, 0, 0, 1]  ->  [1, 0, 0, 0, 1]
[1, 1, 0, 1, 1, 1, 1, 0, 0, 0]  ->  [0, 0, 1, 1, 0]
[0, 0, 0, 0, 1, 0, 1, 0, 1, 1]  ->  [1, 1, 1, 1, 1]
[1, 1, 1, 0, 0, 0, 0, 0, 0, 0]  ->  [0, 1, 1, 0, 1]
[1, 0, 1, 1, 0, 1, 0, 0, 0, 0]  ->  [1, 1, 1, 1, 1]
[0, 1, 0, 0, 0, 1, 1, 0, 0, 1]  ->  [0, 1, 1, 0, 0]
...

For each experiment, a portion of the input patterns were designated as unpredictable "noise"; for these inputs, new binary target patterns were generated on the fly on each training step, rather than using the original randomly-generated target pattern of the dataset. We tried several different levels of noise: 0%, 25%, 50%, and 75% of the dataset. Unlike in the XOR dataset, input patterns had no explicit flag bits to indicate whether or not an input was noise, since all bits were randomly generated. Furthermore, no error prediction was performed in these experiments.

The network architecture consisted of 10 input units, 15 hidden units, 5 output units, and 15 prediction units. The prediction layer attempted to reproduce the activation pattern of the hidden layer on each step. Prediction could be disabled, as a control, in which case training targets of 0.5 were used for the prediction units instead of the hidden layer activations.

The results are shown below. These four graphs plot the TSS error of the output layer versus training epoch (for the predictable input patterns only).

http://www.cs.pomona.edu/~marshall/graphs/10x15x5/10x15x5.tss.000.gif http://www.cs.pomona.edu/~marshall/graphs/10x15x5/10x15x5.tss.025.gif
http://www.cs.pomona.edu/~marshall/graphs/10x15x5/10x15x5.tss.050.gif http://www.cs.pomona.edu/~marshall/graphs/10x15x5/10x15x5.tss.075.gif

The next four graphs were generated from the same runs, but plot the fraction of all output units correct to within tolerance (0.05) for the predictable input patterns.

http://www.cs.pomona.edu/~marshall/graphs/10x15x5/10x15x5.correct.000.gif http://www.cs.pomona.edu/~marshall/graphs/10x15x5/10x15x5.correct.025.gif
http://www.cs.pomona.edu/~marshall/graphs/10x15x5/10x15x5.correct.050.gif http://www.cs.pomona.edu/~marshall/graphs/10x15x5/10x15x5.correct.075.gif

Hidden layer prediction doesn't seem to help much here. In fact, prediction seems to actually hinder the learning process. This is seen most clearly in the plots of fraction of outputs correct (second group above). One possible explanation is that this randomly-generated dataset may actually be too easy for the network to learn, and thus hidden layer prediction affords no clear advantage over regular backpropagation. With 10-bit patterns, the input space consists of 2^10 = 1024 possible patterns, but only 50 are included in the dataset. This effectively gives the network a fairly large amount of freedom to form separating hyperplanes between the deterministic and unpredictable input patterns. In other words, it is relatively easy for the network to learn to distinguish the "signal" patterns from the "noise" patterns of this dataset, without relying on the extra machinery of hidden layer prediction.

1.5.1. 6-Bit Input Patterns

To test this hypothesis, we created a new dataset consisting of 64 unique 6-bit input patterns, which covers the entire space, paired with 2-bit target patterns. As before, we varied the percentage of the input patterns designated as noise. For these experiments, we used 9 hidden units (1.5 times the number of input units, as before). The resulting graphs are shown below.

http://www.cs.pomona.edu/~marshall/graphs/6x9x2/6x9x2.tss.000.gif http://www.cs.pomona.edu/~marshall/graphs/6x9x2/6x9x2.tss.025.gif
http://www.cs.pomona.edu/~marshall/graphs/6x9x2/6x9x2.tss.050.gif http://www.cs.pomona.edu/~marshall/graphs/6x9x2/6x9x2.tss.075.gif

The same runs, showing fraction of all output units correct to within tolerance (0.05) for the predictable input patterns:

http://www.cs.pomona.edu/~marshall/graphs/6x9x2/6x9x2.correct.000.gif http://www.cs.pomona.edu/~marshall/graphs/6x9x2/6x9x2.correct.025.gif
http://www.cs.pomona.edu/~marshall/graphs/6x9x2/6x9x2.correct.050.gif http://www.cs.pomona.edu/~marshall/graphs/6x9x2/6x9x2.correct.075.gif

Judging by the TSS Error graphs, hidden layer prediction seems to help, although the effect is less pronounced in the outputs-correct graphs.

1.5.2. Impact of Number of Hidden Layer Units

The number of hidden layer units used in the network seems to have a substantial impact on network performance using our strategies. Using the 64 pattern training set above, the graphs below show the variation between 10 and 30 hidden layer units. In general, our strategies seem to prefer a larger number of hidden layer units. These tests were run with only 10% of the patterns behaving deterministically.

http://www.cs.hmc.edu/~jlewis/eann/images/novseala1010.png http://www.cs.hmc.edu/~jlewis/eann/images/novseala1030.png
http://www.cs.hmc.edu/~jlewis/eann/images/novseala1010vs1030.png http://www.cs.hmc.edu/~jlewis/eann/images/novsla1010vs1030.png

When comparing the two lower graphs, it seems that EA is not much help in this scenario. This may be due to the fact that there are no flag bits that designate whether a pattern is deterministic or not--the whole pattern must be considered.