Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with small gradient and curvature mini-batches independent of the dataset size for classification. We modify Martens' HF for this setting and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. On classification tasks, stochastic HF achieves accelerated training and competitive results in comparison with dropout SGD without the need to tune learning rates.
 State From To ( Cc) Subject Date Due Action Fulfill Ryan Kiros ICLR 2013 Conference Track Fulfilled: ICLR 2013 call for conference papers 16 Jan 2013 Completed Ryan Kiros ICLR 2013 Conference Track Request for Endorsed for oral presentation: Training Neural Networks with... 16 Jan 2013 Reveal: document Ryan Kiros Revealed: document: Training Neural Networks with Stochastic Hessian-Free... 05 Feb 2013 Completed Aaron Courville Anonymous f834 Request for review of Training Neural Networks with Stochastic Hessian-Free... 05 Feb 2013 01 Mar 2013 Completed Aaron Courville Anonymous 4709 Request for review of Training Neural Networks with Stochastic Hessian-Free... 05 Feb 2013 01 Mar 2013 Completed Aaron Courville Anonymous 0a71 Request for review of Training Neural Networks with Stochastic Hessian-Free... 05 Feb 2013 01 Mar 2013 Reveal: document ICLR 2013 Conference Track Revealed: document: Endorsed for poster presentation: Training Neural Networks... 27 Mar 2013 Fulfill ICLR 2013 Conference Track Ryan Kiros Fulfilled: Request for Endorsed for oral presentation: Training Neural Networks... 27 Mar 2013

Ryan Kiros 10 Feb 2013
 State From To ( Cc) Subject Date Due Action Reveal: document Ryan Kiros Revealed: document: 10 Feb 2013
Code is now available: http://www.ualberta.ca/~rkiros/ Included are scripts to reproduce the results in the paper.
Anonymous 0a71 01 Mar 2013
 State From To ( Cc) Subject Date Due Action Reveal: document Anonymous 0a71 Revealed: document: review of Training Neural Networks with Stochastic... 01 Mar 2013 Fulfill Anonymous 0a71 Aaron Courville Fulfilled: Request for review of Training Neural Networks with Stochastic... 01 Mar 2013
Summary and general overview: ---------------------------------------------- The paper tries to explore an online regime for Hessian Free as well as using drop outs. The new method is called Stochastic Hessian Free and is tested on a few datasets (MNIST, USPS and Reuters). The approach is interesting and it is a direction one might need to consider in order to scale to very large datasets. Questions: --------------- (1) An aesthetic point. Stochastic Hessian Free does not seem as a suitable name for the algorithm, as it does not mention the use of drop outs. I think scaling to a stochastic regime is an orthogonal issue to using drop outs, so maybe Drop-out Stochastic Hessian Free would be more suitable, or something rather, that makes the reader aware of the use of drop-outs. (2) Page 1, first paragraph. Is not clear to me that SGD scales well for large data. There are indications that SGD could suffer, for e.g., from under-fitting issues (see [1]) or early over-fitting (see [2]). I'm not saying you are wrong, you are probably right, just that the sentence you use seems a bit strong and we do not yet have evidence that SGD scales well to very large datasets, especially without the help of things like drop-outs (which might help with early-overfitting or other phenomena). (3) Page 1, second paragraph. Is not clear to me that HF does not do well for classification. Is there some proof for this somewhere? For e.g. in [3] a Hessian Free like approach seem to do well on classification (note that the results are presented for Natural Gradient, but the paper shows that Hessian Free is Natural Gradient due to the use of Generalized Gauss-Newton matrix). (4) Page 3, paragraph after the formula. R-operator is only needed to compute the product of the generalized Gauss-Newton approximation of the Hessian with some vector v. The product between the Hessian and some vector 'v' can easily be computed as d sum((dC/dW)*v)/dW (i.e. without using the R-op). (5) Page 4, third paragraph. I do not understand what you mean when you talk about the warm initialization of CG (or delta-momentum as you call it). What does it mean that \hat{M}_\theta is positive ? Why is that bad? I don't understand what this decay you use is suppose to do? Are you trying to have some middle ground between starting CG from 0 and starting CG from the previous found solution? I feel a more detailed discussion is needed in the paper. (6) Page 4, last paragraph. Why does using the same batch size for the gradient and for computing the curvature results in \lambda going to 0? Is not obvious to me. Is it some kind of over-fitting effect? If it is just an observation you made through empirical experimentation, just say so, but the wording makes it sound like you expect this behaviour due to some intuitions you have. (7) Page 5, section 4.3. I feel that the affirmation that drop-outs do not require early stopping is too strong. I feel the evidence is too weak at the moment for this to be a statement. For one thing, \beta_e goes exponentially fast to 0. \beta_e scales the learning rate, and it might be the reason you do not easily over-fit (when you reach epoch 50 or so you are using a extremely small learning rate). I feel is better to make this as an observation. Also could you maybe say something about this decaying learning rate, is my understanding of \beta_e correct? (8) I feel a important comparison would be between your version of stochastic HF with drop-outs vs stochastic HF (without the drop outs) vs just HF. From the plots you give, I'm not sure what is the gain from going stochastic, nor is it clear to me that drop outs are important. You seem to have the set-up to run this additional experiments easily. Small corrections: -------------------------- Page 1, paragraph 1, 'salable -> 'scalable' Page 2, last paragraph. You wrote : 'B is a curvature matrix suc as the Hessian'. The curvature of a function f at theta is the Hessian (there is no choice) and there is only one curvature for a given function and theta. There are different approximations of the Hessian (and hence you have a choice on B) but not different curvatures. I would write only 'B is an approximation of the curvature matrix or B is the Hessian. References: [1] Yann N. Dauphin, Yoshua Bengio, Big Neural Networks Waste Capacity, arXiv:1301.3583 [2] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent and Samy Bengio, Why Does Unsupervised Pre-training Help Deep Learning? (2010), in: Journal of Machine Learning Research, 11(625--660) [3] Razvan Pascanu, Yoshua Bengio, Natural Gradient Revisited, arXiv:1301.3584
Anonymous 4709 04 Mar 2013
 State From To ( Cc) Subject Date Due Action Reveal: document Anonymous 4709 Revealed: document: review of Training Neural Networks with Stochastic... 04 Mar 2013 Fulfill Anonymous 4709 Aaron Courville Fulfilled: Request for review of Training Neural Networks with Stochastic... 04 Mar 2013
This paper makes an attempt at extending the Hessian-free learning work to a stochastic setting. In a nutshell, the changes are: - shorter CG runs - cleverer information sharing across CG runs that has an annealing effect - using differently-sized mini-batches for gradient and curvature estimation (former sizes being larger) - Using a slightly modified damping schedule for \lamdba than Martens' LM criteria, which encourages fewer oscillations. Another contribution of the paper is the integration of dropouts into stochastic HF in a sensible way. The authors also include an exponentially-decaying momentum-style term into the parameter updates. The authors present but do not discuss results on the Reuters dataset (which seem good). There is also no comparison with the results from [4], which to me would be a natural thing to compare to. All in all, a series of interesting tricks for making HF work in a stochastic regime, but there are many questions which are unanswered. I would have liked to see more discussion *and* experiments that show which of the individual changes that the author makes are responsible for the good performance. There is also no discussion on the time it takes the stochastic HF method to make on step / go through one epoch / reach a certain error. SGD dropout is a very competitive method because it's fantastically simple to implement (compared to HF, which is orders of magnitude more complicated), so I'm not yet convinced by the insights of this paper that stochastic HF is worth implementing (though it seems easy to do if one has an already-running HF system).
Anonymous f834 04 Mar 2013
 State From To ( Cc) Subject Date Due Action Reveal: document Anonymous f834 Revealed: document: review of Training Neural Networks with Stochastic... 04 Mar 2013 Fulfill Anonymous f834 Aaron Courville Fulfilled: Request for review of Training Neural Networks with Stochastic... 04 Mar 2013
Ryan Kiros 05 Mar 2013
 State From To ( Cc) Subject Date Due Action Reveal: document Ryan Kiros Revealed: document: 05 Mar 2013
Anonymous 0a71 17 Mar 2013
 State From To ( Cc) Subject Date Due Action Reveal: document Anonymous 0a71 Revealed: document: 17 Mar 2013
Regarding using HF for classification. My point was that lack of results in the literature about classification error with HF might be just due to the fact that this is a new method, arguably hard to implement and hence not many had a chance to play with it. I'm not sure that just using HF (the way James introduced it) would not do well on classification. I feel I didn't made this clear in my original comment. I would just remove that statement. Looking back on [R2] I couldn't find a similar statement, it only says that empirically KSD seems to do better on classification. Also I see you have not updated the arxiv papers. I would urge you to do so, even if you do not have all the new experiments ready. It would be helpful for us the reviewers to see how you change the paper.
Ryan Kiros 18 Mar 2013
 State From To ( Cc) Subject Date Due Action Reveal: document Ryan Kiros Revealed: document: 18 Mar 2013
I have submitted an updated version to arxiv and should appear shortly. My apologies for the delay. From the suggestion of reviewer 0a71 I've renamed the paper to "Training Neural Networks with Dropout Stochastic Hessian-Free Optimization".
ICLR 2013 Conference Track 27 Mar 2013
 State From To ( Cc) Subject Date Due Action Reveal: document ICLR 2013 Conference Track Revealed: document: Endorsed for poster presentation: Training Neural Networks... 27 Mar 2013 Fulfill ICLR 2013 Conference Track Ryan Kiros Fulfilled: Request for Endorsed for oral presentation: Training Neural Networks... 27 Mar 2013
Endorsed for poster presentation: Training Neural Networks with Stochastic Hessian-Free Optimization