UserPreferences

Anticipation in a Neural Network


1. Anticipation in a Neural Network

How can anticipation help a neural network learn?

Questions:

  1. Given a task surrounded by noisy, unlearnable patterns, can a neural network learn the task?

  2. Can learning to predict the amount of error help learn this task?

To test this question, let's construct a hard set of problems for a neural network to learn. First, consider the XOR problem:

0 0 -> 0
0 1 -> 1
1 0 -> 1
1 1 -> 0

That is, given two inputs, output the exclusive-or (XOR) of the two inputs: if the inputs are the same, output a 0. If they are different, output a 1.

Now, consider two XOR problems, back-to-back:

0, 0, 0, 0 -> 0, 0
0, 0, 0, 1 -> 0, 1
0, 0, 1, 0 -> 0, 1
0, 0, 1, 1 -> 0, 0
0, 1, 0, 0 -> 1, 0
0, 1, 0, 1 -> 1, 1
0, 1, 1, 0 -> 1, 1
0, 1, 1, 1 -> 1, 0
1, 0, 0, 0 -> 1, 0
1, 0, 0, 1 -> 1, 1
1, 0, 1, 0 -> 1, 1
1, 0, 1, 1 -> 1, 0
1, 1, 0, 0 -> 0, 0
1, 1, 0, 1 -> 0, 1
1, 1, 1, 0 -> 0, 1
1, 1, 1, 1 -> 0, 0

Here, the first and second inputs determine the first output, and the third and fourth inputs determine the second output. Now, let's make it really hard. In this version, we will double the number of patterns, and make those patterns output random values. To signal that these patterns are bogus, we will add a flag on the front of the inputs with 0 meaning do the regular double XOR, and 1 meaning that the outputs will be random:

0, 0, 0, 0, 0 -> 0 0
0, 0, 0, 0, 1 -> 0 1
0, 0, 0, 1, 0 -> 0 1
0, 0, 0, 1, 1 -> 0 0
0, 0, 1, 0, 0 -> 1 0
0, 0, 1, 0, 1 -> 1 1
0, 0, 1, 1, 0 -> 1 1
0, 0, 1, 1, 1 -> 1 0
0, 1, 0, 0, 0 -> 1 0
0, 1, 0, 0, 1 -> 1 1
0, 1, 0, 1, 0 -> 1 1
0, 1, 0, 1, 1 -> 1 0
0, 1, 1, 0, 0 -> 0 0
0, 1, 1, 0, 1 -> 0 1
0, 1, 1, 1, 0 -> 0 1
0, 1, 1, 1, 1 -> 0 0

1, 0, 0, 0, 0 -> ? ?
1, 0, 0, 0, 1 -> ? ?
1, 0, 0, 1, 0 -> ? ?
1, 0, 0, 1, 1 -> ? ?
1, 0, 1, 0, 0 -> ? ?
1, 0, 1, 0, 1 -> ? ?
1, 0, 1, 1, 0 -> ? ?
1, 0, 1, 1, 1 -> ? ?
1, 1, 0, 0, 0 -> ? ?
1, 1, 0, 0, 1 -> ? ?
1, 1, 0, 1, 0 -> ? ?
1, 1, 0, 1, 1 -> ? ?
1, 1, 1, 0, 0 -> ? ?
1, 1, 1, 0, 1 -> ? ?
1, 1, 1, 1, 0 -> ? ?
1, 1, 1, 1, 1 -> ? ?

The question marks represent a random value between 0 and 1. Can a neural network learn the non-random patterns, or will there be too much "noise" (the random outputs)? Let's test if a neural network can learn this. We'll design a network with the following layout:

          [ 2 XOR Outputs ]
                  ^
                  |
           [ Hidden Layer ]
                  ^
                  |
             [ 5 Inputs ]

Here is the data in a Conx format:

net.setInputs( [[0, 0, 0, 0, 0],
                [0, 0, 0, 0, 1],
                [0, 0, 0, 1, 0],
                [0, 0, 0, 1, 1],
                [0, 0, 1, 0, 0],
                [0, 0, 1, 0, 1],
                [0, 0, 1, 1, 0],
                [0, 0, 1, 1, 1],
                [0, 1, 0, 0, 0],
                [0, 1, 0, 0, 1],
                [0, 1, 0, 1, 0],
                [0, 1, 0, 1, 1],
                [0, 1, 1, 0, 0],
                [0, 1, 1, 0, 1],
                [0, 1, 1, 1, 0],
                [0, 1, 1, 1, 1],
                [1, 0, 0, 0, 0],
                [1, 0, 0, 0, 1],
                [1, 0, 0, 1, 0],
                [1, 0, 0, 1, 1],
                [1, 0, 1, 0, 0],
                [1, 0, 1, 0, 1],
                [1, 0, 1, 1, 0],
                [1, 0, 1, 1, 1],
                [1, 1, 0, 0, 0],
                [1, 1, 0, 0, 1],
                [1, 1, 0, 1, 0],
                [1, 1, 0, 1, 1],
                [1, 1, 1, 0, 0],
                [1, 1, 1, 0, 1],
                [1, 1, 1, 1, 0],
                [1, 1, 1, 1, 1]] )
net.setTargets( [[ 0,  0 ],
                 [ 0,  1 ],
                 [ 0,  1 ],
                 [ 0,  0 ],
                 [ 1,  0 ],
                 [ 1,  1 ],
                 [ 1,  1 ],
                 [ 1,  0 ],
                 [ 1,  0 ],
                 [ 1,  1 ],
                 [ 1,  1 ],
                 [ 1,  0 ],
                 [ 0,  0 ],
                 [ 0,  1 ],
                 [ 0,  1 ],
                 [ 0,  0 ],
                 ['?','?'],
                 ['?','?'],
                 ['?','?'],
                 ['?','?'],
                 ['?','?'],
                 ['?','?'],
                 ['?','?'],
                 ['?','?'],
                 ['?','?'],
                 ['?','?'],
                 ['?','?'],
                 ['?','?'],
                 ['?','?'],
                 ['?','?'],
                 ['?','?'],
                 ['?','?']] )

And here is a program to run that will create such a network:

http://bubo.brynmawr.edu/cgi-bin/viewcvs.cgi/xornoise/xor2.py?rev=HEAD&content-type=text/vnd.viewcvs-markup

(This network has some extra code and an extra output layer. We'll explore that in a moment.) It turns out that a network can learn to do the task:

http://bubo.brynmawr.edu/~dblank/images/without.png

This figure shows 20 runs, plotting total summed squared error of the two output values for 1000 sweeps through the patterns.

Can we do anything to help the network realize that 16 of the patterns are not learnable?

We add another output bank that is trained to output the error of the first bank. That is, if the input is:

0 0 1 1 1

Then we see that the output is not random (first value), and that the outputs should be 1 0. If our actual outputs were  0.5 0.0 then we should output our error expectations as (0.5 - 1.0) (0.0 - 0.0). Because these networks can't output negative numbers, we'll make the output to be the absolute value of these differences. So, the error expectation output should be:

0.5 0.0

Now, the network looks like:

 [ 2 XOR Outputs ]  [ 2 Expected Error Outputs]
            ^         ^
            |         |
           [ Hidden Layer ]
                  ^
                  |
             [ 5 Inputs ]

What effect does adding these error expectation units have on training?

http://bubo.brynmawr.edu/~dblank/images/with.png

Again, this figure shows 20 runs, plotting total summed squared error of the two output values for 1000 sweeps through the patterns. However, this has been trained with the additional task of producing the error of the XOR output.

Can you give and explaination why this harder task learns more quickly?


2. Statistical Difference

Does this additional task make a statistically significant difference?

Yes.

From gnumeric, load the TSS error for the 1000 epoch for all 40 runs. Select ANOVA one-way test. If F is greater than F critical, then the null hypothesis (that there is no difference) can be excluded. F(1,38) = 8.974 > 4.098, for Alpha = 0.05, p < 0.05, with one caution: the requirement of equality of variances was violated. However, this test is very robust to this violation.