When computing the slopes for the candidates that will be used in the weight change formula, he divides actPrime (the derivative of the activation function evaluated at the net input to the node) by the sum squared error and has a comment about normalizing it. I don't understand exactly why this is done either, but I temporarily put it in to look for other differences (which led me to the above, since we don't compute sumSquared Error the same way).
One very non-obvious variable is his error.sumErr variable which is actually the average (sumErr / numTrainingPoints) during many parts of his code. In some parts it is actually the sum.
I would like to discuss these differences, potentially at length.
Fahlman uses the function he calls compute_error in both the candidate training phase and the output training phase. In the output training phase it is also responsible for computing the slopes, but in the candidate phase it only populates his error struct.
Something else we might want to think about is how we randomly initialize candidate weights when they are installed. I have not yet looked at how Fahlman does this but I suspect the best way to do it would be to use what we know about the correlation of the candidate to pick the sign of the random weights.
I have looked into how weights are randomly initialized in conx and in Fahlman's code. It seems like conx and Fahlman's code initialize weights in fundamentally the same way. Conx simply initializes them by defaul between -0.1 and 0.1 whereas Fahlman initializes them by default to be between -1.0 and 1.0.
I have made changes to Fahlman's code to make it compute error the way we do and to not normalize actPrime and to set error to be equal to dif. His code still learns the twospirals problem very consistently. I have verified that we compute the same slopes in the first candidate epoch. However, we seem to not train for the appropriate number of epochs total. His code generally takes around 1700 epochs (output training epochs and candidate training epochs combined) but our code takes more around 500 epochs total. There must be some bug either in candidate recruitment or in the stagnation detection code.
UPDATE:
-
Fahlman uses the correlation computed for the candidate with each output unit to set the initial weights from the installed candidate to the outputs. However, we compute the correlation differently than Fahlman does. We compute it the way he describes it in his paper which does not seem to be what his code does. Since we compute the correlation differently, this could result in selecting a different candidate than Fahlman's code would select when recruiting a new candidate. In the two spirals problem, I would expect this to have a larger impact than the initial weights between the new candidate and the outputs since there is only one output unit in the two-spirals problem and it shouldn't be hard for the output training phase to determine a good weight for the candidate to the output. At this point, I am going to try and compute the correlation the way Fahlman does and initialize the candidate to output weights the way he does as well.
UPDATE 2:
-
Everything looks correct as far as I have checked it. I am tracing through with a candidate layer size of 1. Everything is identical up through the 2nd candidate epoch (as far as I have gotten) except for a slight different in the bias weight of the candidate unit. It seems that the bias weight for the lone candidate unit in my code is getting farther and farther away from what it is in Fahlman's code. I will be investigating this further tomorrow.
I have fixed the slow divergence, I accidentally had a different decay parameter than Fahlman. See the update below for the newest discrepancy/problem.
UPDATE 3:
-
Before the 4th call of change_weights in trainCandidates, the slopes (wed) are incorrect. Before this point, all weights, weight changes, and correlations are identical.
