Neural Network Overtraining / Computer Science / Forums

Forums

4hv.org :: Forums :: Computer Science

« Previous topic | Next topic »

Neural Network Overtraining

Move Thread

LAN_403

AndrewM

Tue Jan 16 2007, 03:27PM

Registered Member #49 Joined: Thu Feb 09 2006, 04:05AM
Location: Bigass Pile of Penguins
Posts: 362

Neural networks texts and websites often speak of overtraining, and the importance of ensuring that the network stays generally applicable. I find the many sources call this "memorization"... that is, the network simply memorizes the training set, rather than approximates the underlying function. I can't get my head around this, and I think the memorization statement is wrong or a simplification.

Consider a network being trained to approximate AND: The training set contains only 4 cases, so its easy for me to imagine that even a simple network could produce satisfactory performance by simply remembering each case.

However, they also often speak of how simple networks cannot approximate XOR. And yet, this input set is the same size as AND, so if the network is simply "memorizing", there should be no such thing as an impossible function.

The reason I'm asking is: if networks truly memorize the set itself when overtrained, we should be able to estimate the storage capacity of a network based on its architecture (x neurons, y synapses, z layers). The advantage being that you could size your training set such that it was larger than the network is capable of memorizing and reduce the nets ability to overtrain.

My guess is that saying a network "memorizes" a set when its overtrained is not accurate, but is a simplification used to make texts more accesible to the casual reader. As I understand it, overtraining is simply a name for when a network identifies trends that are valid in your training set, but not in the 'general' set, and that no 'memorization' takes place. Any one agree/disagree?

Bjørn

Tue Jan 16 2007, 08:07PM

Registered Member #27 Joined: Fri Feb 03 2006, 02:20AM
Location: Hyperborea
Posts: 2058

Consider a network being trained to approximate AND: The training set contains only 4 cases, so its easy for me to imagine that even a simple network could produce satisfactory performance by simply remembering each case.

For cases that are too simple to be useful, overtraining may give as good or better results. The problem arises on real problems where it is impossible to train the net on anything but a tiny subset of all possible input. If it is overtrained it will fail on inputs it has not been trained on and it will reach false optimums where it can't continue to improve because it does very good on the training set.

However, they also often speak of how simple networks cannot approximate XOR. And yet, this input set is the same size as AND, so if the network is simply "memorizing", there should be no such thing as an impossible function.

A neural network with one input layer and one output layer is mathematically incapable of XOR and countless other functions (if you try to make an XOR gate out of transistors you will realise the problem). You would need at least one hidden layer or feedback to do XOR. It can also be shown that a neural network with one hidden layer can do everything a neural network with N hidden layers can do.

You are right that there are no impossible functions for 1+N hidden layers but for a network with no hidden layers even "memorizing" out of reach in the same way as the XOR function.

The reason I'm asking is: if networks truly memorize the set itself when overtrained, we should be able to estimate the storage capacity of a network based on its architecture (x neurons, y synapses, z layers). The advantage being that you could size your training set such that it was larger than the network is capable of memorizing and reduce the nets ability to overtrain.

That works well and is always a good idea. It would be even better to try to estimate the size needed by some more advanced method so you scale the network to contain the resulting function rather than a fraction of the training set.

My guess is that saying a network "memorizes" a set when its overtrained is not accurate, but is a simplification used to make texts more accesible to the casual reader. As I understand it, overtraining is simply a name for when a network identifies trends that are valid in your training set, but not in the 'general' set, and that no 'memorization' takes place. Any one agree/disagree?

The memory effect is real, but if it is the only or even the most common problem when overtraining is mentioned I don't know. If you train a neural network with just a few data point it is easy to see that the network simply memorizes the data. Even if it gives the correct result it is not what we wanted, we wanted it to model the simplest function that fits the data, not the data itself.

AndrewM

Tue Jan 16 2007, 09:14PM

Registered Member #49 Joined: Thu Feb 09 2006, 04:05AM
Location: Bigass Pile of Penguins
Posts: 362

BjÃ¸rn BÃ¦verfjord wrote ...

The reason I'm asking is: if networks truly memorize the set itself when overtrained, we should be able to estimate the storage capacity of a network based on its architecture (x neurons, y synapses, z layers). The advantage being that you could size your training set such that it was larger than the network is capable of memorizing and reduce the nets ability to overtrain.

Do such analyses have a name? I've been frustrated thus far in my search.

I'm imagining cases where the function behind the data is unknown, or possibly even nonexistant (stock prediction, horse race gambling, rainfall forecasting, etc). In such a case one cannot, even with advanced methods like you mentioned, size the net to a function that one doesn't know.

Thus it seems that one would want to size the net to the available data: large enough to hopefully model the function but small enough to be incapable of memorizing the entire training set. I haven't the foggiest idea of the form such analyses would take.

Carbon_Rod

Tue Jan 16 2007, 10:42PM

Registered Member #65 Joined: Thu Feb 09 2006, 06:43AM
Location:
Posts: 1155

Weighted variables do have limits. I have used a GUI application that tracks the relative weights in an easy to read overlapping graph that can be tuned/edited with a mouse click.

IIRC there was a site about OCR that uses the NN technique and also compares PID control situation comparisons. I will post the URL if I recall its location...

Cheers,

Bjørn

Wed Jan 17 2007, 02:19AM

Registered Member #27 Joined: Fri Feb 03 2006, 02:20AM
Location: Hyperborea
Posts: 2058

For all functions that can be evaluated on a digital computer there exists at least one set of weights that will make a digitally simulated neural network with one hidden layer compute that function.

Do such analyses have a name? I've been frustrated thus far in my search.

I don't know any name for it. The simple but fairly efficient method I have used it to split the training data into two sets of identical properties then train on one set and test on the other. After trying a few different sizes it usually becomes clear what size is most promising.

Simon

Sat Jan 20 2007, 01:43AM

Registered Member #32 Joined: Sat Feb 04 2006, 08:58AM
Location: Australia
Posts: 549

wrote ...

The simple but fairly efficient method I have used it to split the training data into two sets of identical properties then train on one set and test on the other.

This technique is becoming more popular for this problem, which is really a problem that affects all model fitting. It's nice because it lends itself to automation.: generate a batch of models that fit one subset of your data, pick the best n, test these on the other set and pick the best one.

AndrewM

Sun Jan 21 2007, 01:12AM

Registered Member #49 Joined: Thu Feb 09 2006, 04:05AM
Location: Bigass Pile of Penguins
Posts: 362

Well I'm actually doing just that. I have about 600 cases in my data pool. I take half and train on them, and after each training point I run the network on the selection set. I print out the residual error from the training point and the selection point and I can monitor how the network is doing.

My problem is that I just can't make any headway. I have 8 input neurons and one output neuron. I started by formulating my data in binary, e.g. all inputs and outputs were 0 or 1. This only gave me 512 unique datapoints, however, and when training on only 300 cases, im sure most of them fell on the few most common points. I figured this lended itself to overtraining.

So I changed to sigmoid activation functions and sigmoid inputs (but my output data was still binary). Now the problem is that if I keep the hidden neurons high (like 15), the training error will drop very low, but the select error never goes does. If I drop the hidden neurons lower, like to 4, then the error drops to 30% (i.e. 30% of the cases output an incorrect result, assuming that i'm generous and treat >.5 as a 1 and <.5 as a 0) but the selection error STILL doesn't budge. I'm at my wits end, I guess my data truly is random.

Bjørn

Sun Jan 21 2007, 05:18AM

Registered Member #27 Joined: Fri Feb 03 2006, 02:20AM
Location: Hyperborea
Posts: 2058

The only other thing I can think of is that the input representation might not expose the information in a way that the training algorithm can exploit. You could try Fourier, Wavelet or some other transformation of the data before you present it to the network. You could even try to feed it severel different transforms at the same time.

Neural networks are very sensitive to the way the information is presented to them, but with only 8 bits of data it is hard to imagine it should make a big difference.

Moderator(s): Chris Russell, Noelle, Alex, Tesladownunder, Dave Marshall, Dave Billington, Bjørn, Steve Conner, Wolfram, Kizmo, Mads Barnkob