UserPreferences

CandidateTrainingDifferences


I have been carefully and painstakingly comparing what Fahlman's candidate training phase does versus what mine does in conx. There are a few important differences that I don't entirely understand. For one, in the candidate training, what Fahlman calls error is not target-goal (which he calls dif) but (target-goal) * actDeriv(netInput), which he calls error_prime as he computes it. This is very strange. His computation of sum squared error is also based on this errorPrime value that he assigns to his error variable.

When computing the slopes for the candidates that will be used in the weight change formula, he divides actPrime (the derivative of the activation function evaluated at the net input to the node) by the sum squared error and has a comment about normalizing it. I don't understand exactly why this is done either, but I temporarily put it in to look for other differences (which led me to the above, since we don't compute sumSquared Error the same way).

One very non-obvious variable is his error.sumErr variable which is actually the average (sumErr / numTrainingPoints) during many parts of his code. In some parts it is actually the sum.

I would like to discuss these differences, potentially at length.

Fahlman uses the function he calls compute_error in both the candidate training phase and the output training phase. In the output training phase it is also responsible for computing the slopes, but in the candidate phase it only populates his error struct.

Something else we might want to think about is how we randomly initialize candidate weights when they are installed. I have not yet looked at how Fahlman does this but I suspect the best way to do it would be to use what we know about the correlation of the candidate to pick the sign of the random weights.

I have looked into how weights are randomly initialized in conx and in Fahlman's code. It seems like conx and Fahlman's code initialize weights in fundamentally the same way. Conx simply initializes them by defaul between -0.1 and 0.1 whereas Fahlman initializes them by default to be between -1.0 and 1.0.

I have made changes to Fahlman's code to make it compute error the way we do and to not normalize actPrime and to set error to be equal to dif. His code still learns the twospirals problem very consistently. I have verified that we compute the same slopes in the first candidate epoch. However, we seem to not train for the appropriate number of epochs total. His code generally takes around 1700 epochs (output training epochs and candidate training epochs combined) but our code takes more around 500 epochs total. There must be some bug either in candidate recruitment or in the stagnation detection code.

UPDATE:

UPDATE 2:

I have fixed the slow divergence, I accidentally had a different decay parameter than Fahlman. See the update below for the newest discrepancy/problem.

UPDATE 3: