Hessian-free (HF) optimization has been successfully used for training deep autoencoders and recurrent networks. HF uses the conjugate gradient algorithm to construct update directions through curvature-vector products that can be computed on the same order of time as gradients. In this paper we exploit this property and study stochastic HF with small gradient and curvature mini-batches independent of the dataset size for classification. We modify Martens' HF for this setting and integrate dropout, a method for preventing co-adaptation of feature detectors, to guard against overfitting. On classification tasks, stochastic HF achieves accelerated training and competitive results in comparison with dropout SGD without the need to tune learning rates.
State From To ( Cc) Subject Date Due Action
Fulfill
Ryan Kiros ICLR 2013 Conference Track
Fulfilled: ICLR 2013 call for conference papers

16 Jan 2013
Completed
Ryan Kiros ICLR 2013 Conference Track
Request for Endorsed for oral presentation: Training Neural Networks with...

16 Jan 2013
Reveal: document
Ryan Kiros
Revealed: document: Training Neural Networks with Stochastic Hessian-Free...

05 Feb 2013
Completed
Aaron Courville Anonymous f834
Request for review of Training Neural Networks with Stochastic Hessian-Free...

05 Feb 2013 01 Mar 2013
Completed
Aaron Courville Anonymous 4709
Request for review of Training Neural Networks with Stochastic Hessian-Free...

05 Feb 2013 01 Mar 2013
Completed
Aaron Courville Anonymous 0a71
Request for review of Training Neural Networks with Stochastic Hessian-Free...

05 Feb 2013 01 Mar 2013
Reveal: document
ICLR 2013 Conference Track
Revealed: document: Endorsed for poster presentation: Training Neural Networks...

27 Mar 2013
Fulfill
ICLR 2013 Conference Track Ryan Kiros
Fulfilled: Request for Endorsed for oral presentation: Training Neural Networks...

27 Mar 2013

9 Comments

Ryan Kiros 10 Feb 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Ryan Kiros
Revealed: document:

10 Feb 2013
Code is now available: http://www.ualberta.ca/~rkiros/ Included are scripts to reproduce the results in the paper.
Please log in to comment.
Anonymous 0a71 01 Mar 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 0a71
Revealed: document: review of Training Neural Networks with Stochastic...

01 Mar 2013
Fulfill
Anonymous 0a71 Aaron Courville
Fulfilled: Request for review of Training Neural Networks with Stochastic...

01 Mar 2013
Summary and general overview: ---------------------------------------------- The paper tries to explore an online regime for Hessian Free as well as using drop outs. The new method is called Stochastic Hessian Free and is tested on a few datasets (MNIST, USPS and Reuters). The approach is interesting and it is a direction one might need to consider in order to scale to very large datasets. Questions: --------------- (1) An aesthetic point. Stochastic Hessian Free does not seem as a suitable name for the algorithm, as it does not mention the use of drop outs. I think scaling to a stochastic regime is an orthogonal issue to using drop outs, so maybe Drop-out Stochastic Hessian Free would be more suitable, or something rather, that makes the reader aware of the use of drop-outs. (2) Page 1, first paragraph. Is not clear to me that SGD scales well for large data. There are indications that SGD could suffer, for e.g., from under-fitting issues (see [1]) or early over-fitting (see [2]). I'm not saying you are wrong, you are probably right, just that the sentence you use seems a bit strong and we do not yet have evidence that SGD scales well to very large datasets, especially without the help of things like drop-outs (which might help with early-overfitting or other phenomena). (3) Page 1, second paragraph. Is not clear to me that HF does not do well for classification. Is there some proof for this somewhere? For e.g. in [3] a Hessian Free like approach seem to do well on classification (note that the results are presented for Natural Gradient, but the paper shows that Hessian Free is Natural Gradient due to the use of Generalized Gauss-Newton matrix). (4) Page 3, paragraph after the formula. R-operator is only needed to compute the product of the generalized Gauss-Newton approximation of the Hessian with some vector `v`. The product between the Hessian and some vector 'v' can easily be computed as d sum((dC/dW)*v)/dW (i.e. without using the R-op). (5) Page 4, third paragraph. I do not understand what you mean when you talk about the warm initialization of CG (or delta-momentum as you call it). What does it mean that \hat{M}_\theta is positive ? Why is that bad? I don't understand what this decay you use is suppose to do? Are you trying to have some middle ground between starting CG from 0 and starting CG from the previous found solution? I feel a more detailed discussion is needed in the paper. (6) Page 4, last paragraph. Why does using the same batch size for the gradient and for computing the curvature results in \lambda going to 0? Is not obvious to me. Is it some kind of over-fitting effect? If it is just an observation you made through empirical experimentation, just say so, but the wording makes it sound like you expect this behaviour due to some intuitions you have. (7) Page 5, section 4.3. I feel that the affirmation that drop-outs do not require early stopping is too strong. I feel the evidence is too weak at the moment for this to be a statement. For one thing, \beta_e goes exponentially fast to 0. \beta_e scales the learning rate, and it might be the reason you do not easily over-fit (when you reach epoch 50 or so you are using a extremely small learning rate). I feel is better to make this as an observation. Also could you maybe say something about this decaying learning rate, is my understanding of \beta_e correct? (8) I feel a important comparison would be between your version of stochastic HF with drop-outs vs stochastic HF (without the drop outs) vs just HF. From the plots you give, I'm not sure what is the gain from going stochastic, nor is it clear to me that drop outs are important. You seem to have the set-up to run this additional experiments easily. Small corrections: -------------------------- Page 1, paragraph 1, 'salable` -> 'scalable' Page 2, last paragraph. You wrote : 'B is a curvature matrix suc as the Hessian'. The curvature of a function `f` at theta is the Hessian (there is no choice) and there is only one curvature for a given function and theta. There are different approximations of the Hessian (and hence you have a choice on B) but not different curvatures. I would write only 'B is an approximation of the curvature matrix` or `B is the Hessian`. References: [1] Yann N. Dauphin, Yoshua Bengio, Big Neural Networks Waste Capacity, arXiv:1301.3583 [2] Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent and Samy Bengio, Why Does Unsupervised Pre-training Help Deep Learning? (2010), in: Journal of Machine Learning Research, 11(625--660) [3] Razvan Pascanu, Yoshua Bengio, Natural Gradient Revisited, arXiv:1301.3584
Please log in to comment.
Anonymous 4709 04 Mar 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 4709
Revealed: document: review of Training Neural Networks with Stochastic...

04 Mar 2013
Fulfill
Anonymous 4709 Aaron Courville
Fulfilled: Request for review of Training Neural Networks with Stochastic...

04 Mar 2013
This paper makes an attempt at extending the Hessian-free learning work to a stochastic setting. In a nutshell, the changes are: - shorter CG runs - cleverer information sharing across CG runs that has an annealing effect - using differently-sized mini-batches for gradient and curvature estimation (former sizes being larger) - Using a slightly modified damping schedule for \lamdba than Martens' LM criteria, which encourages fewer oscillations. Another contribution of the paper is the integration of dropouts into stochastic HF in a sensible way. The authors also include an exponentially-decaying momentum-style term into the parameter updates. The authors present but do not discuss results on the Reuters dataset (which seem good). There is also no comparison with the results from [4], which to me would be a natural thing to compare to. All in all, a series of interesting tricks for making HF work in a stochastic regime, but there are many questions which are unanswered. I would have liked to see more discussion *and* experiments that show which of the individual changes that the author makes are responsible for the good performance. There is also no discussion on the time it takes the stochastic HF method to make on step / go through one epoch / reach a certain error. SGD dropout is a very competitive method because it's fantastically simple to implement (compared to HF, which is orders of magnitude more complicated), so I'm not yet convinced by the insights of this paper that stochastic HF is worth implementing (though it seems easy to do if one has an already-running HF system).
Please log in to comment.
Anonymous f834 04 Mar 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous f834
Revealed: document: review of Training Neural Networks with Stochastic...

04 Mar 2013
Fulfill
Anonymous f834 Aaron Courville
Fulfilled: Request for review of Training Neural Networks with Stochastic...

04 Mar 2013
This paper looks at designing an SGD-like version of the "Hessian-free" (HF) optimization approach which is applied to training shallow to moderately deep neural nets for classification tasks. The approach consists of the usual HF algorithm, but with smaller minibatches and with CG terminated after only 3-5 iterations. As advocated in [20], more careful attention is paid to the "momentum-constant" \gamma. It is somewhat interesting to see a very data intensive method like HF made "lighter" and more SGD-like, since this could perhaps provide benefits unique to both HF and SGD, but it's not clear to me from the experiments if there really is an advantage over variants of SGD that would perform some kind of automatic adaptation of learning rates (or even a fixed schedule!). The amount of novelty in the paper isn't particularly high since many of these ideas have been proposed before ([20]), although perhaps in less extreme or less developed forms. Pros: - takes the well-known approach HF in a different (if not entirely novel) direction - seems to achieves performance competitive with versions of SGD used in [3] with dropout Cons: - experiments don't look at particularly deep models and aren't very thorough - comparisons to other versions of SGD are absent (this is my primary issue with the paper) ---- The introduction and related work section should probably clarify that HF is an instance of the more general family of methods sometimes known as "truncated-Newton methods". In the introduction, when you state: "HF has not been as successful for classification tasks", is this based on your personal experience, particularly negative results in other papers, or lack of positive results in other papers? Missing from your review are papers that look at the performance of pure stochastic gradient descent applied to learning deep networks, such as [15] did, and the paper by Glorot and Bengio from AISTATS 2010. Also, [18] only used L-BFGS to perform "fine-tuning" after an initial layer-wise pre-training pass. When discussing the generalized Gauss-Newton matrix you should probably cite [7]. In section 4.1, it seems like a big oversimplification to say that the stopping criterion and overall convergence rate of CG depend on mostly on the damping parameter lambda. Surely other things matter too, like the current setting of the parameters (which determine the local geometry of the error surface). A high value of lambda may be a sufficient condition, but surely not a necessary one for CG to quickly converge. Moreover, missing from the story presenting in this section is the fact that lambda *must* decrease if the method is to ever behave like a reasonable approximation of a Newton-type method. The momentum interpretation discussed in the middle of section 4, and overall the algorithm discussed in this paper, sounds similar to ideas discussed in [20] (which were perhaps not fully explored there). Also, a maximum iteration for CG is was used in the original HF paper (although it only appeared in the implementation, and was later discussed in [20]). This should be mentioned. Could you provide a more thorough explanation of why lambda seems to shrink, then grow, as optimization proceeds? The explanation in 4.2 seems vague/incomplete. The networks trained seem pretty shallow (especially Reuters, which didn't use any hidden layers). Is there a particular reason why you didn't make them deeper? e.g. were deeper networks overfitting more, or perhaps underfitting due to optimization problems, or simply not providing any significant advantage for some other reasons? SGD is already known to be hard to beat for these kinds of not-very-deep classification nets, and while it seems plausible that the much more SGD-like HF which you are proposing would have some advantage in terms of automatic selection of learning rates, it invites comparison to other methods which do this kind of learning rate tuning more directly (some of which you even discuss in the paper). The lack of these kinds of comparisons seems like a serious weakness of the paper. And how important to your results was the use of this "delta-momentum" with the particular schedule of values for gamma that you used? Since this behaves somewhat like a regular momentum term, did you also try using momentum in your SGD implementation to make the comparison more fair? The experiments use drop-out, but comparisons to implementations that don't use drop-out, or use some other kind of regularization instead (like L2) are noticeably absent. In order understand what the effect of drop-out is versus the optimization method in these models it is important to see this. I would have been interested to see how well the proposed method would work when applied to very deep nets or RNNs, where HF is thought to have an advantage that is perhaps more significant/interesting than what could be achieved with well tuned learning rates.
Please log in to comment.
Ryan Kiros 05 Mar 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Ryan Kiros
Revealed: document:

05 Mar 2013
Thank you for your comments! To Anonymous 0a71: --------------------------------- (1,8): I agree. Indeed, it is straightforward to add an additional experiment without the use of dropout. At the least, the experimental section can be modified to indicate whether the method is using dropout or not instead of simply referring to "stochastic HF". (2): Fair point. It would be interesting trying this method out in a similar experimental setting as [R1]. Perhaps it may give some insight on the paper's hypothesis that the optimization is the culprit to underfitting. (3): Correct me if I'm wrong but the only classification results of HF I'm aware of are from [R2] in comparison with Krylov subspace decent, not including methods that refer to themselves as natural gradient. Minibatch overfitting in batch HF is problematic and discussed in detail in [R5], pg 50. Given the development of [R3], the introduction could be modified to include additional discussion regarding the relationship with natural gradient and classification settings. (5): Section 4.5 of [R4] discusses the benefits of non-zero CG initializations. In batch HF, it's completely reasonable to fix \gamma throughout training (James uses 0.95). This is problematic in stochastic HF due to such a small number of CG iterations. Given a non-zero CG initialization and a near-one \gamma, \hat{M}_\theta may be more likely to remain positive after CG and assuming f_k - f_{k-1} < 0, means that the reduction ratio will be negative and thus \lambda will be increased to compensate. This is not necessarily a bad thing, although if it happens too frequently the algorithm will began to behave more like SGD (and in some cases the linesearch will reject the step). Setting \gamma to some smaller initial value and incrementing at each epoch, based on empirical performance, allows for near-one \delta values late in training without negating the reduction ratio. I refer the reader to pg.28 and pg.39 in [R5], which give further motivation and discussion on these topics. (6): Using the same batches for gradients and curvature have some theoretical advantages (see section 12.1, pg.48 of [R5] for derivations). While \lambda -> 0 is indeed an empirical observation, James and Ilya also report similar behaviour for shorter CG runs (although longer than what I use) using the same batches for gradients and curvature (pg.54 of [R5]). Within the proposed stochastic setting, having \lambda -> 0 doesn't make too much sense to me (at least for non-convex f). It could allow for much more aggressive steps which may or may not be problematic given how small the curvature minibatches are. One solution is to simply increase the batch sizes, although this was something I was intending to avoid. (7): The motivation behind \beta_e was to help achieve more stable training over the stochastic networks induced using dropout. You are probably right that "not requiring early stopping" is way too strong of a statement. To Anonymous 4709: --------------------------------- Due to the additional complexity of HF compared to SGD, I attempted to make my available (Matlab) code as easy as possible to read and follow through in order to understand and reproduce the key features of the method. While an immediate advantage of stochastic HF is not requiring tuning learning rate schedules, I think it is also a promising approach in further investigating the effects of overfitting and underfitting with optimization in neural nets, as [R1] motivates. The experimental evaluation does not attack this particular problem, as the goal was to make sure stochastic HF was at least competitive with SGD dropout on standard benchmarks. This to me was necessary to justify further experimentation. There is no comparison with the results of [R4] since the goal of the paper was to focus on classification (and [R4] only trains on deep autoencoders). Future work includes extending to other architectures, as discussed in the conclusion. I mention on pg. 7 that the per epoch update times were similar to SGD dropout (I realize this is not particularly rigorous). In regards to evaluating each of the modifications, I had hoped that the discussion was enough to convey the importance of each design choice. I realize now that there might have been too much assumption of information discussed in [R5]. These details will be made clear in the updated version of the paper with appropriate references. To Anonymous f834: -------------------------------- - Thanks for the reference clarifications. In regards to classification tasks, see (3) in my response to Anonymous 0a71. - Indeed, much of the motivation of the algorithm, particularly the momentum interpretation, came from studying [R5] which expands on HF concepts in significantly more detail then the first publications allowed for. I will be sure to make this more clear in the relevant sections of the paper. - I agree that not comparing against other adaptive methods is a weakness and discussed this briefly in the conclusion. To accommodate for this, I tried to use an SGD implementation that would at least be as competitive (dropout, max-norm weight clipping with large initial rates, momentum and learning rate schedules). Weight clipping was also shown to improve SGD dropout, at least on MNIST [R6]. - Unfortunately, I don't have too much more insight on the behaviour of \lambda though it appears to be quite consistent. The large initial decrease is likely to come from conservative initialization of \lambda which works well as a default. - I did not test on deeper nets largely due to time constraints (it made more sense to me to start on shallower networks then to "jump the gun" and go straight for very deep nets) . Should I not have done this? As alluded to in the conclusion, I wouldn't be expecting any significant gain on these datasets (perhaps I'm wrong here). It would be cool to try on some speech data where deeper nets have made big improvements but I haven't worked with speech before. Reuters didn't use hidden layers due to the high dimensionality of the inputs (~19000 log word count features). Applying this to RNNs is a work in progress. ---------------------------------------------- To summarize (modifications for the paper update): - include additional references - add results for stochastic HF with no dropout - some additional discussion on the relationship with natural gradient (and classification results) - better detail section 4, including additional references to [R5] These modifications will be made by the start of next week (March 11). One additional comment: after looking over [R6], I realized the MNIST dropout SGD results (~110 errors) were due to a combination of dropout and the max-norm weight clipping and not just dropout alone. I have recently been exploring using weight clipping with stochastic HF and it is advantageous to include it. This is because it allows one to start training with smaller \lambda values, likely in the same sense as it allows SGD to start with larger learning rates. I will be updating the code shortly to include this option. [R1] Yann N. Dauphin, Yoshua Bengio, Big Neural Networks Waste Capacity, arXiv:1301.3583 [R2] O. Vinyals and D. Povey. Krylov subspace descent for deep learning. arXiv:1111.4259, 2011 [R3] Razvan Pascanu, Yoshua Bengio, Natural Gradient Revisited, arXiv:1301.3584 [R4] J. Martens. Deep learning via hessian-free optimization. In ICML 2010. [R5] J. Martens and I. Sutskever. Training deep and recurrent networks with hessian-free optimization. Neural Networks: Tricks of the Trade, pages 479–535, 2012. [R6] N. Srivastava. Improving Neural Networks with Dropout. Master's thesis, University of Toronto, 2013.
Please log in to comment.
Anonymous 0a71 17 Mar 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 0a71
Revealed: document:

17 Mar 2013
Regarding using HF for classification. My point was that lack of results in the literature about classification error with HF might be just due to the fact that this is a new method, arguably hard to implement and hence not many had a chance to play with it. I'm not sure that just using HF (the way James introduced it) would not do well on classification. I feel I didn't made this clear in my original comment. I would just remove that statement. Looking back on [R2] I couldn't find a similar statement, it only says that empirically KSD seems to do better on classification. Also I see you have not updated the arxiv papers. I would urge you to do so, even if you do not have all the new experiments ready. It would be helpful for us the reviewers to see how you change the paper.
Please log in to comment.
Ryan Kiros 18 Mar 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Ryan Kiros
Revealed: document:

18 Mar 2013
I have submitted an updated version to arxiv and should appear shortly. My apologies for the delay. From the suggestion of reviewer 0a71 I've renamed the paper to "Training Neural Networks with Dropout Stochastic Hessian-Free Optimization".
Please log in to comment.
ICLR 2013 Conference Track 27 Mar 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
ICLR 2013 Conference Track
Revealed: document: Endorsed for poster presentation: Training Neural Networks...

27 Mar 2013
Fulfill
ICLR 2013 Conference Track Ryan Kiros
Fulfilled: Request for Endorsed for oral presentation: Training Neural Networks...

27 Mar 2013
Endorsed for poster presentation: Training Neural Networks with Stochastic Hessian-Free Optimization
Please log in to comment.
Ryan Kiros 31 Mar 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Ryan Kiros
Revealed: document:

31 Mar 2013
I want to say thanks again to the conference organizers, reviewers and openreview.net developers for doing a great job. I have updated the code on my webpage to include two additional features: max norm weight clipping and training deep autoencoders. Autoencoder training uses symmetric encoding / decoding and supports denoising and L2 penalties.
Please log in to comment.
Ryan Kiros 26 Apr 2013
State From To ( Cc) Subject Date Due Action
Reveal: document
Ryan Kiros
Revealed: document:

26 Apr 2013
Dear reviewers, To better account for the mentioned weaknesses of the paper, I've re-implemented SHF with GPU compatibility and evaluated the algorithm on the CURVES and MNIST deep autoencoder tasks. I'm using the same setup as in Chapter 7 of Ilya Sutskever's PhD thesis, which allows for comparison against SGD, HF, Nesterov's accelerated gradient and momentum methods. I'm going to make one final update to the paper before the conference to include these new results.
Please log in to comment.

Please log in to comment.