Unit Tests for Stochastic Optimization
Tom Schaul, Ioannis Antonoglou, David Silver
23 Dec 2013 arXiv 6 Comments
Conference Track
Optimization by stochastic gradient descent is an important component of many large-scale machine learning algorithms. A wide variety of such optimization algorithms have been devised; however, it is unclear whether these algorithms are robust and widely applicable across many different optimization landscapes. In this paper we develop a collection of unit tests for stochastic optimization. Each unit test rapidly evaluates an optimization algorithm on a small-scale, isolated, and well-understood difficulty, rather than in real-world scenarios where many such issues are entangled. Passing these unit tests is not sufficient, but absolutely necessary for any algorithms with claims to generality or robustness. We give initial quantitative and qualitative results on a dozen established algorithms. The testing framework is open-source, extensible, and easy to apply to new algorithms.
State From To ( Cc) Subject Date Due Action
New Request
Tom Schaul Conference Track
Request for Endorsed for oral presentation: Unit Tests for Stochastic...

23 Dec 2013
Reveal: document
Tom Schaul
Revealed: document: Unit Tests for Stochastic Optimization

23 Dec 2013
Completed
Conference Track Anonymous 5afd
Request for review of Unit Tests for Stochastic Optimization

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous b820
Request for review of Unit Tests for Stochastic Optimization

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous abdc
Request for review of Unit Tests for Stochastic Optimization

14 Jan 2014 04 Feb 2014

6 Comments

Anonymous 5afd 06 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 5afd
Revealed: document: review of Unit Tests for Stochastic Optimization

06 Feb 2014
Fulfill
Anonymous 5afd Conference Track
Fulfilled: Request for review of Unit Tests for Stochastic Optimization

06 Feb 2014
This paper bravely proposes to test the empirical convergence of stochastic optimization algorithms using a vast collection of simple and relatively standardized tests. They explain how they construct the tests and perform experiments that lead to a striking visualization (figure 5). Unfortunately none of the compared algorithms appear to solve all the problems robustly. This idea could appear naïve because it is not supported by theoretical considerations and represents a purely empirical perspective. However there are many reasons to consider that this idea has great potential. First, similar ideas have worked in other fields. It is customary to compare general optimization codes on a collection of well known benchmark problems, not because it provides a guarantee, but because it provides a sanity check. Second, we must recognize optimization in deep learning systems is still beyond the reach of theoretical analysis. According to the current theoretical knowledge, it should not work. Therefore the best way to investigate such algorithms remains a well designed collection of empirical comparisons. The comparison described in this paper is a good start in that direction.
Please log in to comment.
Tom Schaul 11 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Tom Schaul
Revealed: document:

11 Feb 2014
Thank you for your comments! > The existence of some prototype in real data is easy to assert, but the > probability that you will have to deal with said prototype is hard. Indeed. We try to argue that the scenarios covered by the proposed unit tests could occur, not that they necessarily occur a lot. But a good algorithm should behave robustly on most of them when they do occur. > While the paper mentions that there are tools for visualizing these > results, there are not many details given about these tools. I think an > interesting question on its own is what are the write visualization of the > results and how to interpret them, a question which I don't think is fully > answer by the current work. In fact, Figure 5 is such a proposed visualization, and it is described in section 3 (fuller details are available in the published code). It is designed to provide a quick qualitative overview and flag potential weaknesses of an algorithm.
Please log in to comment.
Anonymous b820 07 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous b820
Revealed: document: review of Unit Tests for Stochastic Optimization

07 Feb 2014
Fulfill
Anonymous b820 Conference Track
Fulfilled: Request for review of Unit Tests for Stochastic Optimization

07 Feb 2014
Summary ------------ This paper introduces a suite of unit tests for optimization algorithms that attempt to analyze the algorithm performance in very specific and isolated cases (for example : how does it deal with a saddle point or a cliff like structure in the error). This work feels like a needed step in the right direction. As the authors pointed out, passing these tests is not sufficient to claim an algorithm is better than another, but it is necessary to understand some possible weaknesses of the algorithm and for doing a more proper comparison between different algorithms (compared to just looking at the error curve for some randomly selected task and model). Comments: -------------- I think the success of such an approach is in the details. The engineering effort to use a specific such framework to test some new algorithm will have a huge impact on whether these unit tests will be successful or not. Unfortunately some of these things are hard to predict right now (and in some sense are beside the point of the paper). The other thing that one has to keep in mind is the effect of high dimensional spaces on the problem we try to solve. E.g. one can prove that for some family of models (say auto encoders) on some bounded domain some given error will have saddle points regardless of the dimensionality of the input. However the distribution of this saddle points is a different story and it might be hard to say how important it is to deal with saddle points or not for this specific model. Basically what am I trying to say is that the existence of some prototype in real data is easy to assert, but the probability that you will have to deal with said prototype is hard. And this is not a criticism to this work, but more a message that anyone should keep in mind. I guess what I'm trying to point out is that interpreting the results of these unit tests is far from trivial and might be even counter-intuitive. Another important detail for any such suite of unit tests is the tools that one can use to investigate the results. If I think of the unit test metaphor for code development, usually the results of a large suite of tests is interpreted via failure (and how many tests failed) or success. Running such a suite of experiments on an optimization task is not as easy to interpret. First of all is hard to know which tests we care most about, and is hard to know how the correlation between the failures on different tests affect the algorithm on real task. While the paper mentions that there are tools for visualizing these results, there are not many details given about these tools. I think an interesting question on its own is what are the write visualization of the results and how to interpret them, a question which I don't think is fully answer by the current work.
Please log in to comment.
Anonymous abdc 11 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous abdc
Revealed: document: review of Unit Tests for Stochastic Optimization

11 Feb 2014
Fulfill
Anonymous abdc Conference Track
Fulfilled: Request for review of Unit Tests for Stochastic Optimization

11 Feb 2014
This paper looks at developing "unit tests" for stochastic optimization algorithms, which consist of toy objective functions and corresponding gradient noise that are assembled randomly from a suite of various components. The hope is that these tests would allow one to analyse the specific failure cases for various methods, perhaps in order to inform improvements to them. This paper is mostly just engineering, but the authors seem to have created a fairly versatile tool for generating toy problems with many different characteristics which may well be quite useful in the future. The paper itself doesn't contain much in the way of useful conclusions about the optimization algorithms (mostly different versions of SGD) that are tried. I also have several issues with various aspects of the unit tests themselves, and am not fully convinced that they are testing the "right" kind of thing or if these tests can tell us much of use about the optimization problems we really care about. It would have been nice if the paper had demonstrated how these tests could actually inform algorithm design. Nonetheless I would recommend that it be accepted. I hope the authors can address some of the issues I've brought up. Detailed comments: Are these optimization surfaces unimodal? If not, couldn't it be the case that some optimization methods simply will get "lucky" and bounce their way into a better local basin, whereas other that might be more careful about remaining stable in the face of noise will miss these? This seems to me like it might not be a realistic analogy of the kinds of optimizations we care about, where multimodal landscapes with multiple modes of highly variable quality don't seem to exist. I'm thinking about neural network optimization in particular here, and going on my experience that typically the lowest achievable error from run to run doesn't vary all that much (if at all). Another major issue I have is that by testing local convergence of methods you are really only testing one them in a single and likely not too important phase of optimization: fine convergence to a local min. Often in these cases the method that can deal with the noise the best wins. I'm not sure if this is the best analogy to what happens in deep net optimization, where fine convergence to a local min seems to be only a small and not particularly important phase of optimization (which is often just associated with overfitting). I would wager that the most significant aspect of optimization is the journey towards a local min from very far away along a very curvy path that can possibly lead to other local mins of roughly equivalent quality. It is not clear to me that your unit tests properly capture this aspect of optimization by focusing either on fine convergence to local optima, or the tendency for an optimizer to jump out of one local min into a closely situated and much higher quality one. Page 1: In what sense do you mean that these weaknesses are separate from "raw performance". If an algorithm is performing well on some task, why would I care if it has some invisible issues on that task, as long as it appears to be working? I think I'm misinterpreting what you meant here. Page 1: I don't really understand what you are saying in the 'locally' paragraph. Are you saying that you want to assume local optimization because your units tests, if they are to be run quickly, can only consider toy examples with a simple local structure (unlike a neural network which has a "global structure")? And what does this have to do with non-stationary algorithms (or do you mean non-stationary objective functions?) or "initialization bias"? Don't you still need to initialize the algorithms in the unit tests and won't this choice affect the relative performance of different methods? Page 3: With these noise prototypes, is it obvious why their mean must recover the gradient (i.e. so they are unbiased)? For the additive case this is clear, since the noise has mean 0, but it is less clear for multiplicative noise. I suppose if the average scale is alpha the multiplicative noise would on average estimate alpha*gradient. This is an important point, since the stochastic gradients need to be unbiased for many stochastic optimization algorithms to work, at least in theory. Cauchy noise, for example, doesn't even have a mean, so this seems a bit problematic. Although I suppose for certain choices of its parameters, the distribution is at least centred around 0. Perhaps this might be good enough in practice. Is this what you did? The mask-out noise is unbiased as far as I can tell. You really should discuss the issue of unbiasedness of your noise in general, as this is very important to the theory of these methods. Page 4: How are you generating these random rotations? Are these just like random orthonormal matrices? Generated how? Page 4: It is not clear to me why gradient descent should work at all when it is given vectors from a vector field that is not actually a gradient field. And assuming it does work for certain such fields, the reasons are probably subtle and highly situation dependent. I don't imagine that this can be easily simulated by applying a fixed rotation to the gradient. In fact, I can't even imagine how gradient descent could ever converge if it were given gradient with a fixed rotation. For example, you could just permute the coordinates. That is a valid rotation, but surely this would cause most reasonable algorithms to fail catastrophically. Also, you should explain concept of curl etc for this audience. Page 5: What do you mean in the sentence: "However, non-stationary optimization can even be important in large stationary tasks, when the algorithm itself chooses to track a particular aspect of the problem, rather than solve the problem globally." What do you mean by "aspect", "track" and "globally" here? This goes back to my previous question. Page 5: I think a better way of simulating non-stationarity of the objective function would be to perhaps randomly jiggle or move the various shape and scale parameters that define your objective. Mere translation doesn't seem particularly hard to deal with, or realistic. Page 6: Why didn't you test the method(s) from [9]?? Page 6: Double the progress of vanilla SGD doesn't seem particularly 'excellent' to me. Perhaps merely 'good'. Page 7: What are these "groups"? Where do you describe what they are? Page 7: So if I understand correctly, the vertical axis gives the different methods with different hyper parameter choices for each method? Many of these methods, such as Nesterov's accelerated gradient, have automatically adapted hyperparameters, which is sometimes done according to a fixed schedule. The momentum parameter in particular usually isn't supposed to be constant, at least in theory. And in general, for stochastic optimization methods, there usually are no guarantees for a fixed learning rate. Some methods like ADAGRAD implicitly anneal the learning rate, while with others, like plain SGD, or accelerated gradient SGD, you need to reduce the learning rate adaptively or at least with a schedule if you plan to have fine convergence to a local min. Are you keeping these fixed, are you are instead varying the hyper-hyperparameters of the methods that adjust the hyperparameters? Are methods with "thicker" regions ones with more adjustable hyperparameters? It seems to me that it doesn't really make sense to plot the values for all hyper-parameters, when some are clearly crazy. If a method has lots of hyperparameters, it shouldn't be judged to be 'less robust' if for certain crazy choices of these it diverges. And it isn't always true that we have no way of determining good hyperparameters for methods that have them. Binary search, or perhaps Bayesian optimization are certainly better than exhaustive sweeps. More simply than that, one can just adjust these things on the fly, with a heuristic or manually if needed, as partial progress with sub-optimal hyperparameter choices doesn't need to be thrown out (unless very bad divergence has taken place, and then you can always backtrack to a previous snap-shot of the parameters).
Please log in to comment.
Tom Schaul 25 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Tom Schaul
Revealed: document:

25 Feb 2014
Thank you for your constructive and detailed comments, we’ve revised the paper to take them into account, and here are some specific answers: > I am not fully convinced that they are testing the "right" kind of thing or if these tests can tell us much of use about the optimization problems we really care about. As reviewer 5afd point out, this paper aims to be a start in this direction. Different users of stochastic gradient methods may care more or less about different issues (e.g. high dimensions, local optima of varying quality, etc), and so a lot is being left to future work. Nevertheless, we argue that most of our proposed unit tests are more or less representative of a (part of an) optimization surface that could be encountered, and that a good algorithm should behave robustly on most of them. > Are these optimization surfaces unimodal? Yes, most of the surfaces are unimodal. > If not, couldn't it be the case that some optimization methods simply will get "lucky" and bounce their way into a better local basin, whereas other that might be more careful about remaining stable in the face of noise will miss these? All experiments are repeated many times to minimize the effect of “lucky” runs. Careful algorithms have different properties than more aggressive or stochastic algorithms, and this is should be reflected in them having different strengths and weaknesses on different sets of unit tests -- including but not limited to multimodal surfaces. > Another major issue I have is that by testing local convergence of methods you are really only testing one them in a single and likely not too important phase of optimization: fine convergence to a local min. [...] I would wager that the most significant aspect of optimization is the journey towards a local min from very far away along a very curvy path that can possibly lead to other local mins of roughly equivalent quality. Indeed, some unit tests test fine convergence to a local minimum, but there are many others as well, that test for example non-divergence under high noise, or the local optimization dynamics on shape prototypes that have their optimum at infinity (linear slope, sigmoid, exponential, saddle points, etc.) -- so these latter unit tests verify that an algorithm keeps making progress. In other words, they check for sane behavior during the long “journey” that characterizes most early phases of optimization. > Page 1: In what sense do you mean that these weaknesses are separate from "raw performance". If an algorithm is performing well on some task, why would I care if it has some invisible issues on that task, as long as it appears to be working? For any given task, the best algorithm is determined by its raw performance (and may be a different one each time). This is a separate question from which algorithm is robust in general and likely to work well on new, unknown problems. > Page 1: I don't really understand what you are saying in the 'locally' paragraph. We have revised this paragraph for clarity. > Don't you still need to initialize the algorithms in the unit tests and won't this choice affect the relative performance of different methods? Yes, initialization and algorithm state are one currently unresolved issue, but in section 4.1, we discuss a possible way of addressing this in future work. > You really should discuss the issue of unbiasedness of your noise in general. We have clarified in section 2.3 that the noise is not necessarily unbiased. > Page 4: How are you generating these random rotations? Are these just like random orthonormal matrices? Yes. > Page 4: It is not clear to me why gradient descent should work at all when it is given vectors from a vector field that is not actually a gradient field. And assuming it does work for certain such fields, the reasons are probably subtle and highly situation dependent. This is true, and in the cited reinforcement literature the issue of when it may converge anyway has been studied extensively. > I don't imagine that this can be easily simulated by applying a fixed rotation to the gradient. In some simple cases, such as the one in Figure 4, it can, indeed, there the vector field is exactly a gradient field combined with a rotation (and a large class of algorithms do converge in this scenario) -- we have clarified this point in the latest revision. > Page 5: What do you mean in the sentence: "However, non-stationary optimization can even be important in large stationary tasks, when the algorithm itself chooses to track a particular aspect of the problem, rather than solve the problem globally."? We have clarified this in section 2.6, with the details being deferred to the cited [21]. > Page 5: I think a better way of simulating non-stationarity of the objective function would be to perhaps randomly jiggle or move the various shape and scale parameters that define your objective. Thank you, have added these modifiers to expand the set of unit tests to include all three types of non-stationarity in the revised version. > Page 6: Why didn't you test the method(s) from [9]? We are presently working on more robust variants of that method, with the help of the presented unit tests actually, to be published shortly. > Page 7: What are these "groups"? Where do you describe what they are? Vertically, each group of rows is one algorithm with different hyperparameters, horizontally, each group of columns is a collection of unit tests with a certain property -- this is now described more clearly in the text. > Page 7: So if I understand correctly, the vertical axis gives the different methods with different hyper parameter choices for each method? [...] Are methods with "thicker" regions ones with more adjustable hyperparameters? Yes and yes. Section 3.1 gives the overview of what these hyperparameters are for each algorithm, and which values are swept over (full details are in the published code). > If a method has lots of hyperparameters, it shouldn't be judged to be 'less robust' if for certain crazy choices of these it diverges. We try not to make a judgement on which hyperparameters are reasonable. Instead we want to show that our unit tests can be a tool for determining whether a single setting of hyperparameters can be robust in most scenarios, or whether some algorithms have to be tuned to the problem.
Please log in to comment.
Tom Schaul 26 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Tom Schaul
Revealed: document:

26 Feb 2014
The updated version of the paper is now visible on arxiv, and the updated code at https://github.com/IoannisAntonoglou/optimBench can reproduce the result figures.
Please log in to comment.

Please log in to comment.