We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
State From To ( Cc) Subject Date Due Action
New Request
Jost Tobias Springenberg Conference Track
Request for Endorsed for oral presentation: Improving Deep Neural Networks with...

23 Dec 2013
Reveal: document
Jost Tobias Springenberg
Revealed: document: Improving Deep Neural Networks with Probabilistic Maxout...

23 Dec 2013
Completed
Conference Track Anonymous f3f1
Request for review of Improving Deep Neural Networks with Probabilistic Maxout...

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous 4eb4
Request for review of Improving Deep Neural Networks with Probabilistic Maxout...

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous 2618
Request for review of Improving Deep Neural Networks with Probabilistic Maxout...

14 Jan 2014 04 Feb 2014

12 Comments

Anonymous f3f1 07 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous f3f1
Revealed: document: review of Improving Deep Neural Networks with Probabilistic...

07 Feb 2014
Fulfill
Anonymous f3f1 Conference Track
Fulfilled: Request for review of Improving Deep Neural Networks with...

07 Feb 2014
The paper introduces a generalization of the maxout unit, called probout. The output of a maxout unit is defined as the maximum of a set of linear filter responses. The output of a probout unit is sampled from a softmax defined on the linear responses. For vanishing temperature this turns into the maxout response. While the idea is probably not revolutionary, it seems reasonable and it seems to work fairly well on the datasets tried in the paper. It is a bit unfortunate that unlike for dropout/maxout, there does not seem to be a closed-form, deterministic activation function at test time that works well. At least the authors did not find any. Instead they propose to average multiple outputs. This makes probout networks much slower at test time than a maxout network. It also puts into perspective the improved classification results over maxout. It seems unlikely that the common practice of halving weights at test time is exactly the optimal way of making predictions for a model trained with dropout. And it is conceivable that some kind of model averaging will be able to improve performance for those networks, too. It is interesting that probout units with group size two tend to yield filter pairs in quadrature relationship, and much more clearly so than maxout. In that respect they behave similar to average pooling units. It would be interesting to investigate this further in future work.
Please log in to comment.
Anonymous 4eb4 10 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 4eb4
Revealed: document: review of Improving Deep Neural Networks with Probabilistic...

10 Feb 2014
Fulfill
Anonymous 4eb4 Conference Track
Fulfilled: Request for review of Improving Deep Neural Networks with...

10 Feb 2014
This manuscript extends the recently proposed “maxout” scheme for neural networks by making the linear subspace pooling stochastic, i.e. by parameterizing the probability of activating each filter as the softmax of the filter responses and then sampling from the resulting discrete distribution. Experimental results are presented on CIFAR10/CIFAR100/SVHN and contrasted with the original maxout work. Novelty: low Quality: low to medium Pros: - The idea is somewhat interesting and worth trying, and the manuscript itself is generally well-written - Experiments and hyperparameter choices are generally described in detail, including software packages used - Attempts a fair-handed comparison between the method and the one they are building off (though see below) Cons: - The benchmark results are quite lackluster: the improvements are of questionable statistical significance (see below) - This questionable gain comes at a >=10x increase in computational cost. - The experimental comparisons have several shortcomings (not all of them strictly advantageous to the proposed method, however -- see below). - The abstract mentions enhancing invariance properties as the goal but there are no attempts to quantify the degree of learned invariance as has been examined in the literature. See, for example, Goodfellow et al’s “Measuring Invariances in Deep Networks” from NIPS 2009 for one such attempt. Detailed comments: The procedure doesn’t yield a single deterministic model at test time, which may be a significant practical drawback as compared with conventional maxout or relu networks. Still, a more significant drawback to the proposed method is that the test time computation involves several forward propagations through the entire network in order to have a noticeable (but statistically negligible) advantage over maxout. Figure 3b suggests that, on a relatively simple dataset, 10 or more fprops per example may be necessary to yield a significant advantage over the maxout baseline. This is in addition to the cost additional cost incurred by sampling pseudorandom variates as part of the inference process. A quick computation of a confidence interval (based on the confidence interval of a Bernoulli parameter, i.e. the probability of classifying an example incorrectly) for both the results reported in Goodfellow et al (2013) and this work reveals that the confidence intervals can be seen to overlap significantly: - CIFAR10 Maxout (no augmentation): 11.68% +/- 0.63%, Probout: 11.35% +/- 0.62% - CIFAR100 Maxout: 38.57% +/- 0.95%, Probout: 38.14% +/- 0.95% - SVHN Maxout: 2.47% +/- 0.19%, Probout: 2.39% +/- 0.19% The same comparison between maxout and the competitors reported in the original work yields non-overlapping confidence intervals for all tasks above. The original maxout work does not achieve a statistically significant improvement over the existing state of the art for CIFAR10, and these authors report no improvement, suggesting the task is sufficiently well regularized by the data augmentation that their complementary regularization does not help. Hyperparameter search: while the authors reused the exact same architectures and other hyperparameters employed by the original maxout manuscript in an attempt to be fair-handed, I believe this is, in truth, a mistake that disadvantages their method in the comparison. Certain hyperparameters, critically the learning rate and momentum, will be very sensitive to changes in the learning dynamics such as those introduced by the stochastic generalization of the activation function. Even the optimal network architecture may not be the same. The way I would suggest approaching this is with a randomized hyperparameter search (even one in the neighbourhood of the original settings) wherein the same hyperparameters are tried for both methods, and each point in the shared hyperparameter space is further optimized via randomized search over the hyperparameters specific to probout. This gives the method in question a fairer shot by not insisting that the optimal hyperparameters for maxout be the same as those for probout (there is no a priori reason that they should be). The choice of temperature schedules also seems like an area that should be further explored. It seems odd that increasing the temperature during training should help (the paper does not specify the length of the linear (inverse) decrease period, this should be noted). What is the intuition for why this helps? How could one validate these intuitions experimentally? The claim that the stochastic procedure “prevents [filters] from being unused by the network” is dubious. Section 8.2 of the original maxout paper suggests that dropout SGD training alone is remarkably effective in this respect. This claim should be quantitatively investigated and verified if it is to be made at all. Finally, the paper motivates around probout units learning better invariances, but no attempt is made at quantitatively validating this claim. As it stands, there is a qualitative assessment of the first layer convolutional filters learned, noting that they appear to resemble easily recognizable transformations & quadrature pairs moreso than those learned by vanilla maxout, but I find this somewhat unconvincing. Just because the invariances learned by vanilla maxout are not always obvious to the human practitioner does not mean they are not useful invariances for the model to encode.
Please log in to comment.
Jost Tobias Springenberg 18 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Jost Tobias Springenberg
Revealed: document:

18 Feb 2014
First of all we want to thank both reviewers for their detailed comments. Before giving a more elaborate response we want to mention that we have incorporated your suggestions in a new version of the paper which should appear on arxiv as of Feb 19. To both reviewers: As already acknowledged in the paper we agree that our proposed method comes with an attached high computational cost during inference. While this cost, as you mentioned, can be seen as an important practical drawback we believe that explorative research towards novel stochastic regularization techniques (for which an efficient inference procedure is not immediately available) still constitutes a worthwhile research direction from which new, more efficient, regularization techniques could be obtained in the future. In addition to this increase in computational cost you point out that the improvement over maxout achieved by our approach is minor/non-significant. Although this is acknowledged at the end of the introduction and in the discussion, we have now additionally rephrased the experiments section to reflect this fact more clearly. We want to reiterate that since both methods are tightly coupled - in both their motivation and computational properties as well as our parameter choices - we believe that our results still provide an interesting contribution and can serve as a starting point for future research on the impact of including subspace pooling in deep neural networks. To reviewer 2 (Anonymous 4eb4): We agree that the best hyperparameters for both maxout and probout cannot generally be assumed to coincide. However, as a full hyperparameter search for both methods on all datasets requires significant computational resources we decided to stick to the original parameters in an attempt to make a comparison that is at worst biased towards the maxout results (as we were concerned to not make an unfair comparison to previous results). We are currently running a parameter search on both CIFAR-10 and SVHN in a similar manner as you suggested and will include the results in an additional updated version in the coming days. As you point out, our investigation of the invariance properties of the network in the original paper is only a qualitative one. We performed an additional quantitative analysis in the new version of the paper, comparing invariance properties of maxout and probout networks in a manner similar to [1,2]. [1] Koray Kavukcuoglu, Marc'Aurelio Ranzato, Rob Fergus, and Yann LeCun, "Learning Invariant Features through Topographic Filter Maps", in Proc. International Conference on Computer Vision and Pattern Recognition (CVPR'09), 2009. [2] Visualizing and Understanding Convolutional Networks M.D. Zeiler, R. Fergus Arxiv 1311.2901 (Nov 28, 2013)
Please log in to comment.
Anonymous 2618 19 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 2618
Revealed: document: review of Improving Deep Neural Networks with Probabilistic...

19 Feb 2014
Fulfill
Anonymous 2618 Conference Track
Fulfilled: Request for review of Improving Deep Neural Networks with...

19 Feb 2014
Authors propose replacing max operation of maxout with probabilistic version, same what Zeiler did for spatial max-pooling in "stochastic pooling" paper. For inference, they run the network 50 times using the same sampling procedure, and average the outputs to get probabilities. They add a "temperature" parameter to allow to interpolate between maxout and "uniform random" sampling, and find that annealing the temperature helps. Also the analyze the per-layer optimal setting of the temperature and found that stochasticity is most important in first 2 convolutional layers, whereas in the last 2 layers, using probout did not give any advantage over maxout. There are a few minor corrections, but overall this is a solid submission with high relevance for ICLR. Issues: - Minor style issue: use \text or \mbox for "softmax" and "multinomial" inside formula - Table 3: could add 2.16% error rate obtained by another ICLR submission -- "Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks" - There's an explanation for why first layers benefit the most from stochasticity -- how stochasticity "pulls the units in the subspace closer together." This is unclear to me, and I would recommend expanding/explaining this view.
Please log in to comment.
Jost Tobias Springenberg 21 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Jost Tobias Springenberg
Revealed: document:

21 Feb 2014
We want to point out that an updated version of the paper is available on arXiv. The main changes are: - Most minor comments of reviewers Anonymous f3f1 and Anonymous 4eb4 are adressed in the new version - The experiments section has been changed to more clearly reflect statistically ties between maxout and probout in the results - An additional experiment on invariance properties was added
Please log in to comment.
Conference Track 23 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Conference Track
Revealed: document:

23 Feb 2014
There seems to be some disagreement among the reviewers regarding the overall assessment of this work. I ask the reviewers to read and consider each others' reviews as well as the revision of the paper posted on arXiv. Do the other reviews and the paper revision change your opinion? Please comment.
Please log in to comment.
Ian Goodfellow 26 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Ian Goodfellow
Revealed: document:

26 Feb 2014
For fig 3b, what sampling distribution did you use for the maxout baseline? Are you sampling from the distribution defined in eqn 3? Why is this a meaningful thing to evaluate for maxout, which hasn't been trained to know you're going to sample in that particular way? It seems like it would be more fair to sample different dropout masks, since each of the maxout subnetworks have actually been trained to do the classification task. It's not very surprising to find that a neural net that has been trained to do task X performs better than a neural net that has not been trained to do task X.
Please log in to comment.
Jost Tobias Springenberg 27 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Jost Tobias Springenberg
Revealed: document:

27 Feb 2014
Hi Ian, the maxout baseline is the maxout network without any sampling, i.e. a simple forward pass through the net. Equation 3 was used for the two other curves (maxout + sampling, probout) for both networks the dropout effect was removed by the `halving the weights trick`. I agree that it is not surprising that maxout performs worse in combination with the sampling procedure as it was never trained to compensate for/utilize the stochastic sampling procedure. This experiment was meant as a simple control experiment and not much more. I did not mean to convey the idea that the result is surprising or is a disatvantage of the maxout model. The idea of also investigating different dropout masks was not really the scope of this experiment, however it is a good idea and I will try to setup such an additional experiment. Btw, I will later today also report back with a few results from my large hyperparameter search which are somewhat interesting. I would be very happy if you could comment on them as well.
Please log in to comment.
Ian Goodfellow 26 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Ian Goodfellow
Revealed: document:

26 Feb 2014
For figure 4, I think it's important to note that your evaluation method only returns a low number if *all* of the units in a layer are invariant to the studied transformation. If the layer has a factored representation that represents the studied transformation with one set of units and represents other properties of the input with another, disjoint set of units, there will still be a large cosine difference between the representation of two transformed versions of the input, because the change in the portion of the representation that corresponds to the studied property will result in a change of the normalization constant and thus a change in all of the code elements. I think it would make more sense to normalize each unit separately based on a statistic such as its standard deviation across the training set, and then plot histograms showing how much each normalized unit changes as you vary the input. This way you still control for the possibility that the models operate on different scales, but you can also tell if the representation is factored or not.
Please log in to comment.
Jost Tobias Springenberg 27 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Jost Tobias Springenberg
Revealed: document:

27 Feb 2014
"For figure 4, I think it's important to note that your evaluation method only returns a low number if *all* of the units in a layer are invariant to the studied transformation." I agree and will add a sentence to the paper noting this. I actually thought about whether there is a better way to produce such plots than what was done in prior work as well. Your suggestions seems solid and would be a nice addition to the whole layer analysis that is depicted in Figure 4. I will try to get around to implementing it this way (although this might have to wait a few days).
Please log in to comment.
Jost Tobias Springenberg 27 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Jost Tobias Springenberg
Revealed: document:

27 Feb 2014
As mentioned in a previous comment I want to report back with preliminary results from the hyperparameter search I am conducting. Essentially I am optimizing over all notable parameters except the number of units in each layer - in order to fix the model size (More specifically this includes: learning rate, momentum, pooling shape, pooling stride, size of the convolutional kernels). My preliminary results suggest that on CIFAR-10 the best probout network trained without data augmentation can achieve an error rate of <= 10.7% . During the hyperparameter search I however also became aware that the parameter search carried out for the original maxout paper was probably not very exhaustive, as improved performance can also be achieved with a vanilla maxout model. While the difference in performance between probout and maxout appears to be observable for all hyperparameter settings, the best maxout model I obtained so far achieves 10.92% error on CIFAR-10 without data augmentation. Interestingly, the best hyperparameter settings found so far are much closer to the parameters that seem to be used in the "Network in Network" paper (also submitted to ICLR) than the original settings used in the maxout paper (and maxout seems to perform on par with "Network in Network" when a fully connected layer is used after the convolutional layers). I will cross-link these results in the discussion to that paper. I will include the results into the paper as soon as the full parameter search is finished and disclose the parameters found. It would be interesting if Ian could jump in with a short comment on how exactly he fixed hyperparameters for his original work.
Please log in to comment.
Ian Goodfellow 03 Mar 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Ian Goodfellow
Revealed: document:

03 Mar 2014
My co-author David Warde-Farley chose the final hyperparameters for CIFAR-10. Our goal was simply to improve upon the state of the art with the limited computational resources that were available to us, so we did not run an exhaustive, automated search. Instead, both David and I guessed a small number of hyperparameter settings by hand, and we stopped working on that particular task after we started to get diminishing marginal utility from our time spent on it. The hyperparameters for the case with no data augmentation are probably particularly poor. We just used the best hyperparameters from the data augmentation case. I think if you want to do an explicit comparison between two methods, like maxout and probabilistic maxout, it's best to do an automated search, like we did for our comparison between maxout and rectifiers. Unfortunately, when we did this automated search, we had to do it for a small maxout model, since we wanted to compare maxout against a significantly larger rectifier model.
Please log in to comment.

Please log in to comment.