Improving Deep Neural Networks with Probabilistic Maxout Units
Jost Tobias Springenberg, Martin Riedmiller
23 Dec 2013
Conference Track
We present a probabilistic variant of the recently introduced maxout unit. The success of deep neural networks utilizing maxout can partly be attributed to favorable performance under dropout, when compared to rectified linear units. It however also depends on the fact that each maxout unit performs a pooling operation over a group of linear transformations and is thus partially invariant to changes in its input. Starting from this observation we ask the question: Can the desirable properties of maxout units be preserved while improving their invariance properties ? We argue that our probabilistic maxout (probout) units successfully achieve this balance. We quantitatively verify this claim and report classification performance matching or exceeding the current state of the art on three challenging image classification benchmarks (CIFAR-10, CIFAR-100 and SVHN).
 State From To ( Cc) Subject Date Due Action New Request Jost Tobias Springenberg Conference Track Request for Endorsed for oral presentation: Improving Deep Neural Networks with... 23 Dec 2013 Reveal: document Jost Tobias Springenberg Revealed: document: Improving Deep Neural Networks with Probabilistic Maxout... 23 Dec 2013 Completed Conference Track Anonymous f3f1 Request for review of Improving Deep Neural Networks with Probabilistic Maxout... 14 Jan 2014 04 Feb 2014 Completed Conference Track Anonymous 4eb4 Request for review of Improving Deep Neural Networks with Probabilistic Maxout... 14 Jan 2014 04 Feb 2014 Completed Conference Track Anonymous 2618 Request for review of Improving Deep Neural Networks with Probabilistic Maxout... 14 Jan 2014 04 Feb 2014

Anonymous f3f1 07 Feb 2014
 State From To ( Cc) Subject Date Due Action Reveal: document Anonymous f3f1 Revealed: document: review of Improving Deep Neural Networks with Probabilistic... 07 Feb 2014 Fulfill Anonymous f3f1 Conference Track Fulfilled: Request for review of Improving Deep Neural Networks with... 07 Feb 2014
The paper introduces a generalization of the maxout unit, called probout. The output of a maxout unit is defined as the maximum of a set of linear filter responses. The output of a probout unit is sampled from a softmax defined on the linear responses. For vanishing temperature this turns into the maxout response. While the idea is probably not revolutionary, it seems reasonable and it seems to work fairly well on the datasets tried in the paper. It is a bit unfortunate that unlike for dropout/maxout, there does not seem to be a closed-form, deterministic activation function at test time that works well. At least the authors did not find any. Instead they propose to average multiple outputs. This makes probout networks much slower at test time than a maxout network. It also puts into perspective the improved classification results over maxout. It seems unlikely that the common practice of halving weights at test time is exactly the optimal way of making predictions for a model trained with dropout. And it is conceivable that some kind of model averaging will be able to improve performance for those networks, too. It is interesting that probout units with group size two tend to yield filter pairs in quadrature relationship, and much more clearly so than maxout. In that respect they behave similar to average pooling units. It would be interesting to investigate this further in future work.
Anonymous 4eb4 10 Feb 2014
 State From To ( Cc) Subject Date Due Action Reveal: document Anonymous 4eb4 Revealed: document: review of Improving Deep Neural Networks with Probabilistic... 10 Feb 2014 Fulfill Anonymous 4eb4 Conference Track Fulfilled: Request for review of Improving Deep Neural Networks with... 10 Feb 2014
Jost Tobias Springenberg 18 Feb 2014
 State From To ( Cc) Subject Date Due Action Reveal: document Jost Tobias Springenberg Revealed: document: 18 Feb 2014
First of all we want to thank both reviewers for their detailed comments. Before giving a more elaborate response we want to mention that we have incorporated your suggestions in a new version of the paper which should appear on arxiv as of Feb 19. To both reviewers: As already acknowledged in the paper we agree that our proposed method comes with an attached high computational cost during inference. While this cost, as you mentioned, can be seen as an important practical drawback we believe that explorative research towards novel stochastic regularization techniques (for which an efficient inference procedure is not immediately available) still constitutes a worthwhile research direction from which new, more efficient, regularization techniques could be obtained in the future. In addition to this increase in computational cost you point out that the improvement over maxout achieved by our approach is minor/non-significant. Although this is acknowledged at the end of the introduction and in the discussion, we have now additionally rephrased the experiments section to reflect this fact more clearly. We want to reiterate that since both methods are tightly coupled - in both their motivation and computational properties as well as our parameter choices - we believe that our results still provide an interesting contribution and can serve as a starting point for future research on the impact of including subspace pooling in deep neural networks. To reviewer 2 (Anonymous 4eb4): We agree that the best hyperparameters for both maxout and probout cannot generally be assumed to coincide. However, as a full hyperparameter search for both methods on all datasets requires significant computational resources we decided to stick to the original parameters in an attempt to make a comparison that is at worst biased towards the maxout results (as we were concerned to not make an unfair comparison to previous results). We are currently running a parameter search on both CIFAR-10 and SVHN in a similar manner as you suggested and will include the results in an additional updated version in the coming days. As you point out, our investigation of the invariance properties of the network in the original paper is only a qualitative one. We performed an additional quantitative analysis in the new version of the paper, comparing invariance properties of maxout and probout networks in a manner similar to [1,2]. [1] Koray Kavukcuoglu, Marc'Aurelio Ranzato, Rob Fergus, and Yann LeCun, "Learning Invariant Features through Topographic Filter Maps", in Proc. International Conference on Computer Vision and Pattern Recognition (CVPR'09), 2009. [2] Visualizing and Understanding Convolutional Networks M.D. Zeiler, R. Fergus Arxiv 1311.2901 (Nov 28, 2013)
Anonymous 2618 19 Feb 2014
 State From To ( Cc) Subject Date Due Action Reveal: document Anonymous 2618 Revealed: document: review of Improving Deep Neural Networks with Probabilistic... 19 Feb 2014 Fulfill Anonymous 2618 Conference Track Fulfilled: Request for review of Improving Deep Neural Networks with... 19 Feb 2014
Authors propose replacing max operation of maxout with probabilistic version, same what Zeiler did for spatial max-pooling in "stochastic pooling" paper. For inference, they run the network 50 times using the same sampling procedure, and average the outputs to get probabilities. They add a "temperature" parameter to allow to interpolate between maxout and "uniform random" sampling, and find that annealing the temperature helps. Also the analyze the per-layer optimal setting of the temperature and found that stochasticity is most important in first 2 convolutional layers, whereas in the last 2 layers, using probout did not give any advantage over maxout. There are a few minor corrections, but overall this is a solid submission with high relevance for ICLR. Issues: - Minor style issue: use \text or \mbox for "softmax" and "multinomial" inside formula - Table 3: could add 2.16% error rate obtained by another ICLR submission -- "Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks" - There's an explanation for why first layers benefit the most from stochasticity -- how stochasticity "pulls the units in the subspace closer together." This is unclear to me, and I would recommend expanding/explaining this view.
Jost Tobias Springenberg 21 Feb 2014
 State From To ( Cc) Subject Date Due Action Reveal: document Jost Tobias Springenberg Revealed: document: 21 Feb 2014
We want to point out that an updated version of the paper is available on arXiv. The main changes are: - Most minor comments of reviewers Anonymous f3f1 and Anonymous 4eb4 are adressed in the new version - The experiments section has been changed to more clearly reflect statistically ties between maxout and probout in the results - An additional experiment on invariance properties was added
Conference Track 23 Feb 2014
 State From To ( Cc) Subject Date Due Action Reveal: document Conference Track Revealed: document: 23 Feb 2014
There seems to be some disagreement among the reviewers regarding the overall assessment of this work. I ask the reviewers to read and consider each others' reviews as well as the revision of the paper posted on arXiv. Do the other reviews and the paper revision change your opinion? Please comment.
Ian Goodfellow 26 Feb 2014
 State From To ( Cc) Subject Date Due Action Reveal: document Ian Goodfellow Revealed: document: 26 Feb 2014
For fig 3b, what sampling distribution did you use for the maxout baseline? Are you sampling from the distribution defined in eqn 3? Why is this a meaningful thing to evaluate for maxout, which hasn't been trained to know you're going to sample in that particular way? It seems like it would be more fair to sample different dropout masks, since each of the maxout subnetworks have actually been trained to do the classification task. It's not very surprising to find that a neural net that has been trained to do task X performs better than a neural net that has not been trained to do task X.
Jost Tobias Springenberg 27 Feb 2014
 State From To ( Cc) Subject Date Due Action Reveal: document Jost Tobias Springenberg Revealed: document: 27 Feb 2014
Hi Ian, the maxout baseline is the maxout network without any sampling, i.e. a simple forward pass through the net. Equation 3 was used for the two other curves (maxout + sampling, probout) for both networks the dropout effect was removed by the halving the weights trick. I agree that it is not surprising that maxout performs worse in combination with the sampling procedure as it was never trained to compensate for/utilize the stochastic sampling procedure. This experiment was meant as a simple control experiment and not much more. I did not mean to convey the idea that the result is surprising or is a disatvantage of the maxout model. The idea of also investigating different dropout masks was not really the scope of this experiment, however it is a good idea and I will try to setup such an additional experiment. Btw, I will later today also report back with a few results from my large hyperparameter search which are somewhat interesting. I would be very happy if you could comment on them as well.
Ian Goodfellow 26 Feb 2014
 State From To ( Cc) Subject Date Due Action Reveal: document Ian Goodfellow Revealed: document: 26 Feb 2014
For figure 4, I think it's important to note that your evaluation method only returns a low number if *all* of the units in a layer are invariant to the studied transformation. If the layer has a factored representation that represents the studied transformation with one set of units and represents other properties of the input with another, disjoint set of units, there will still be a large cosine difference between the representation of two transformed versions of the input, because the change in the portion of the representation that corresponds to the studied property will result in a change of the normalization constant and thus a change in all of the code elements. I think it would make more sense to normalize each unit separately based on a statistic such as its standard deviation across the training set, and then plot histograms showing how much each normalized unit changes as you vary the input. This way you still control for the possibility that the models operate on different scales, but you can also tell if the representation is factored or not.
Jost Tobias Springenberg 27 Feb 2014
 State From To ( Cc) Subject Date Due Action Reveal: document Jost Tobias Springenberg Revealed: document: 27 Feb 2014
"For figure 4, I think it's important to note that your evaluation method only returns a low number if *all* of the units in a layer are invariant to the studied transformation." I agree and will add a sentence to the paper noting this. I actually thought about whether there is a better way to produce such plots than what was done in prior work as well. Your suggestions seems solid and would be a nice addition to the whole layer analysis that is depicted in Figure 4. I will try to get around to implementing it this way (although this might have to wait a few days).