Network In Network
Min Lin, Qiang Chen, Shuicheng Yan
18 Dec 2013 arXiv 15 Comments
Conference Track
We propose a novel network structure called "Network In Network" (NIN) to enhance the model discriminability for local receptive fields. The conventional convolutional layer uses linear filters followed by a nonlinear activation function to scan the input. Instead, we build micro neural networks with more complex structures to handle the variance of the local receptive fields. We instantiate the micro neural network with a nonlinear multiple layer structure which is a potent function approximator. The feature maps are obtained by sliding the micro networks over the input in a similar manner of CNN and then fed into the next layer. The deep NIN is thus implemented as stacking of multiple sliding micro neural networks. With the enhanced local modeling via micro network, we are able to utilize global average pooling over feature maps in the classification layer, which is more interpretable and less prone to overfitting than traditional fully connected layers. We demonstrated state-of-the-art classification performances with NIN on CIFAR-10, CIFAR-100 and SVHN datasets.
State From To ( Cc) Subject Date Due Action
New Request
Min Lin Conference Track
Request for Endorsed for oral presentation: Network In Network

18 Dec 2013
Reveal: document
Min Lin
Revealed: document: Network In Network

18 Dec 2013
Completed
Conference Track Anonymous 5205
Request for review of Network In Network

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous 5dc9
Request for review of Network In Network

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous bae9
Request for review of Network In Network

14 Jan 2014 04 Feb 2014

15 Comments

Anonymous 5205 31 Jan 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 5205
Revealed: document: review of Network In Network

31 Jan 2014
Fulfill
Anonymous 5205 Conference Track
Fulfilled: Request for review of Network In Network

31 Jan 2014
Summary of contributions: Proposes a new activation function for backprop nets. Advocates using global mean pooling instead of densely connected layers at the output of convolutional nets. Novelty: moderate Quality: moderate Pros: -Very impressive results on CIFAR-10 and CIFAR-100 -Acceptable results on SVHN and MNIST -Experiments distinguish between performance improvements due to NIN structure and performance improvements due to global average pooling Cons: -Explanation of why NIN works well doesn’t make a lot of sense I suspect that NIN’s performance has more to do with the way you apply dropout to the model, rather than the explanations you give in the paper. I elaborate more below in the detailed comments. Did you ever try NIN without dropout? Maxout without dropout generally does not work all that well, except in cases where each maxout unit has few filters, or the dataset size is very large. I suspect your NIN units don’t work well without dropout either, unless the micro-net is very small or you have a lot of data. I find it very weird that you don’t explore how well NIN works without dropout, and that your explanation of NIN’s performance doesn’t involve dropout at all. This paper has strong results but I think a lot of the presentation is misleading. It should be published after being edited to take out some of the less certain stuff from the explanations. It could be a really great paper if you had a better story for why NIN works well, including experiments to back up this story. I suspect the story you have now is wrong though, and I suspect the correct story involves the interaction between NIN and dropout. I’ve hear Geoff Hinton proposed using some kind of unit similar to this during a talk at the CIFAR summer school this year. I’ll ask one of the summer school students to comment on this paper. I don’t think this subtracts from your originality but it might be worth acknowledging his talk, depending on what the summer school student says. Detailed comments: Abstract: I don’t understand what it means “to enhance the model discriminability for local receptive fields.” Introduction Paragraph 1: I don’t think we can confidently say that convolutional net features are generally related to binary discrimination of whether a specific feature is present, or that they are related to probabilities. For example, some of them might be related to measurements (“how red is this patch?” rather than “what is the probability that this patch is red?”) In general, our knowledge of what features are doing is fairly primitive, informal, and ad hoc. Note that the other ICLR submission “Intriguing Properties of Neural Networks” has some strong arguments against the idea of looking at the meaning of individual features in isolutian, or interpreting them as probabilistic detectors. Basically I think you could describe conv nets in the intro without committing to these less well-established ideas about how conv nets work. Paragraph 2: I understand that many interesting features can’t be detected by a GLM. But why does the first layer of the global architecture need to be a nonlinear feature detector? Your NIN architecture still is built out of GLM primitives. It seems like it’s a bit arbitrary which things you say can be linear versus non-linear. i.e., why does it matter that you group all of the functionality of the micro-networks and say that together those are non-linear? Couldn’t we just group the first two layers of a standard deep network and say they form a non-linear layer? Can’t we derive a NIN layer just by restricting the connective of multiple layers of a regular network in the right way? Paragraph 3: Why call it an mlpconv layer? Why not call it a NIN layer for consistency with the title of the paper? Last paragraph: why average pooling? Doesn’t it get hard for this to have a high confidence output if the spatial extent of the layer gets large? Section 2: Convolutional Neural Networks eqn 1: use \text{max} so that the word “max” doesn’t appear in italics. italics are for variable names. Rest of the section: I don’t really buy your argument that people use overcompleteness to avoid the limitations of linear feature detectors. I’d say instead they use multiple layers of features. When you use two layers of any kind of MLP, the second layer can include / exclude any kind of set, regardless of whether the MLP is using sigmoid or maxout units, so I’m not sure why it matters that the first layer can only include / exclude linear half-spaces for sigmoid units and can only exclude convex sets for maxout units. Regarding maxout: I think the argument here could use a little bit more detail / precision. I think what you’re saying is that if you divide input space into an included set and an excluded set by comparing the value of a single unit against some threshold t, then traditional GLM feature detectors can only divide the input into two half-spaces with a linear boundary, while maxout can divide the input space into a convex set and its complement. Your presentation is a little weird though because it makes it sound like maxout units are active (have value > threshold) within a convex region, when in fact the opposite is true. Maxout units are active *outside* a convex region. It also doesn’t make a lot of sense to refer to “separating hyperplanes” anymore when you’re talking about this kind of convex region discrimination. Section 3.1 Par 1: I’d argue that an RBF network is just an MLP with a specific kind of unit. Equation 2: again, “max” should not be in italics Section 4.1 Let me be sure I understand how you’re applying dropout. You drop the output of each micro-MLP, but you don’t drop the hidden units within the micro-MLP, right? I bet this is what leads to your performance improvement: you’ve made the unit of dropping have higher capacity. The way you group things to be dropped for the droput algorithm actually has a functional consequence. The way you group things when looking for linear versus non-linear feature detectors is relatively arbitrary. So I don’t really buy your story in sections 1-3 about why NIN performs better, but I bet the way you use dropout could explain why it works so well. Section 4.2 These results are very impressive! While reading this section I wanted to know how much of the improvements were due to global averaging pooling versus NIN. I see you’ve done those experiemnts later in section 4.6. I’d suggest bringing Table 5 into this section so all the CIFAR-10 experiments are together and readers won’t think of this objection without knowing you’ve addressed it. Section 4.3 Convolutional maxout is actually not the previous state of the art for this dataset. The previous state of the art is 36.85% error, in this paper: http://www.cs.toronto.edu/~nitish/treebasedpriors.pdf Speaking of which, you probably want to ask to have your results added to this page, to make sure you get cited: http://rodrigob.github.io/are_we_there_yet/build/ Section 4.4 http://arxiv.org/pdf/1312.6082.pdf gets an error rate of only 2.16% with convolutional maxout + convolutional rectifiers + dropout. Also, when averaging the output of many nets, the DropConnect paper gets down to 1.94% (even when not using dropout / DropConnect). Your results are still impressive but I think it’s worth including these results in the table for most accurate context. Section 4.5 I think the table entries should be sorted by accuracy, even if that means your method won’t be at the bottom. Section 4.6 It’s good that you’ve shown that the majority of the performance improvement comes from NIN rather than global average pooling. It’s also interesting that you’ve shown that moving from a densely connected layer to GAP regularizes the net more than adding dropout to the densely connected layer does. Section 4.7 What is the difference between the left panel and the right panel? Are these just examples of different images, or is there a difference in the experimental setup?
Please log in to comment.
Min Lin 17 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Min Lin
Revealed: document:

17 Feb 2014
Thanks for your detailed comments. The typos will be corrected in the coming version and we address the other comments below: Question: Did you ever try NIN without dropout: Yes, but only on CIFAR-10 as CIFAR-10 is the first dataset I used to test the ideas. The reason why we do not explain about dropout is that we think dropout is a generally used regularization method. Any model with sufficient parameters (such as maxout you mentioned) may overfit to training data, which results in not so good testing performance. The same is true with NIN. For CIFAR-10 without dropout, the error rate is 14.51%, which is almost 4% worse than NIN with dropout. It already surpasses many previous state-of-arts with regularizer (except for maxout). We will add this result in the coming version. For a fair comparison, we should compare this performance with maxout without dropout, but unfortunately, maxout performance without dropout is not available in the maxout paper, which is also the main reason we did not report NIN without dropout. It is suggested in the comment that the performance may involve the interaction between NIN and dropout, and suggested that the grouping of dropout might be the reason. We found that applying dropout only within the micro-net is not doing as good as putting dropout in between mlpconv layers. (moving the two dropout layers into the micro-net results in a performance of 14.10%). Our interpretation is that the regularization effect of dropout is on the weights, as Wager et al. showed in "Dropout Training as Adaptive Regularization". In NIN which has no fully connected layers, most parameters reside in the convolution layer, which is why dropout is applied to the inputs of those layers. In comparison, the number of parameters within the micro-net is negligible. Therefore, rather than saying the way we group dropout with NIN is the reason of the good performance, we would say that dropout acts as a general regularizer. How dropout should be applied on each layer differs among models and is not well understood yet. How dropout should be applied on a model makes another important story itself. From the above, we argue that NIN itself is a good model even without dropout, how to apply dropout on a network is a general question, but not specific to NIN. Abstract: CNN filter acts as a GLM for local image patches and it is a discriminative binary classification model. Adding nonlinearity enhances the potential discriminative power of the model for local image patches within the receptive field of the convolution neuron. We'll refine the language in the coming version. Introduction: Paragraph 1: We'll revise the paper and use less certain language when discribing the output value as probability of a specific feature. What we mean is that in the ideal case, it can be the probabilities of latent concepts. I fully agree that the values are measurements rather than probabilities, but again, if ideally the value is highly correlated with the probability, it would be a very good model. I think it is a goal more than a fact. NIN can achieve the goal better than GLM. Paragraph 2: Please see replies to Section 2. Paragraph 3: NIN is a more general structure, as mentioned in the paper. Other nonlinear networks can also be employed to incoporate different priors on the data distribution. For example, RBF assumes a gaussian mixture on the data. Mlpconv is one instantiation of NIN. Last Paragraph: There is a softmax normalization anyways. The high or low confidence is just relative. Section 2: Note that unlike maxout units which can be applied to either convolution or non-convolution structures, NIN has only convolution version; the non-convolution version of NIN degrades to an MLP. In the non-convolution case, it is equivalent to taking any two layers from the MLP and say it forms a non-linear layer. Thus for MLP, your argument is correct: it does not matter whether the first layer of the network is a linear model or not; multilayers of features overcome the limitation of the linear detector. However, it is not true for convolution structure. Stacking convolution layers is different from stacking fully connected layers. Higher convolution layers cover larger spatial regions of the input than lower layers. The stacked convolution layer does not form multilayers of features as is in mlp. To avoid the limitation of the GLM, NIN forms multilayers of features on a local patch. In my opinion, CNN has two functionalities: 1. Partition 2. Abstraction Partition: In the object recognition case, lower layers are smaller parts and higher layers learn the deformation relationship between the parts. Abstraction: In traditional CNN, the abstraction of a local patch is done using GLM. Better abstraction of a local patch in the current convolution layer can reduce combinatorial explosion in the next layer. Regarding maxout: I think it does not matter whether the positive side or the negative side is defined as the active side; they are just symmetric. For any maxout network, we can construct a minout network by reversing the sign of the weights every other layer. As minout is equivalent to maxout, you can consider a minout network and then the positive side is the active side. Section 3.1: We'll refine the statements. The information we want to convey is that we can incoporate different data priors by choosing the micro-net. For example, RBF models the data in a Gaussian Mixture style, while the GLM in MLP assumes linear subspace structure. Section 4.1: Please see our response to "Did you ever try NIN without dropout?" Section 4.2 to 4.5: We will revise these in the coming version. Section 4.7: They are just examples of different images.
Please log in to comment.
Anonymous 5205 03 Mar 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 5205
Revealed: document:

03 Mar 2014
I think it would be perfectly valid to report your result on CIFAR-10 without dropout. It would definitely be nice to have a fair comparison between maxout and NIN without dropout but the NIN number alone is still interesting.
Please log in to comment.
Anonymous 5dc9 07 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 5dc9
Revealed: document: review of Network In Network

07 Feb 2014
Fulfill
Anonymous 5dc9 Conference Track
Fulfilled: Request for review of Network In Network

07 Feb 2014
> - A brief summary of the paper's contributions, in the context of prior work. Convolutional neural networks have been an essential part of the recent breakthroughs deep learning has made on pattern recognition problems such as object detection and speech recognition. Typically, such networks consist of convolutional layers (where copies of the same neuron look at different patches of the same image), pooling layers, normal fully-connected layers, and finally a softmax layer. This paper modifies the architecture in two ways. Firstly, the authors explore an extremely natural generalization of convolutional layers by changing the unit of convolution: instead of running a neuron in lots of locations, they run a "micro network." Secondly, instead of having fully-connected layers, they have features generated by the final convolutional layer correspond to categories, and perform global average pooling before feeding the features into a softmax layer. Dropout is used between mlpconv layers. The paper reports new state-of-the-art results with this modified architecture on a variety of benchmark datasets: CIFAR-10, CIFAR-100, and SVHN. They also achieve near state-of-the-art performance on MNIST. > - An assessment of novelty and quality. The reviewer is not an expert but believes this to be the first use of the "Network In Network" architecture in the literature. The most similar thing the reviewer is aware of is work designing more flexible neurons and using them in convolutional layers (eg. maxout by Goodfellow et al, cited in this paper). The difference between a very flexible neuron and a small network with only one output may become a matter of interpretation at some point. The paper very clearly outlines the new architecture, the experiments performed, and the results. > - A list of pros and cons (reasons to accept/reject). Pros: * The work is performed in an important and active area. * The paper explores a very natural generalization of convolutional layers. It's really nice to have this so thoroughly explored. * The authors perform experiments to understand how global average pooling affect networks independently of mlpconv layers. * The paper reports new state-of-the-art results on several standard datasets. * The paper is clearly written. Cons: * All the datasets the model is tested on are classification of rather small images (32x32 and 28x28). One could imagine a few stories where the mlpconv layers would have a comparative advantage on small images (eg. the small size makes having lots of convolutional layers tricky, so it's really helpful to have each layer be more powerful individually). If this was the case, mlpconv would still be useful and worth publishing, but it would be a bit less exciting. That said, it clearly wouldn't be reasonable to demand the architecture be tested on ImageNet -- the reviewer is just very curious. * It would also be nice to know what happens if you apply dropout to the entire model instead of just between mlpconv layers. (Again, the reviewer is just curious.)
Please log in to comment.
Min Lin 17 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Min Lin
Revealed: document:

17 Feb 2014
We fully agree that NIN should be tested on larger images such as imagenet. We've got reasonable preliminary results on imagenet, but since the performance of maxout on imagenet is unknown, we didn't include it in the paper. As is mentioned in our reply to Anonymous 5205, we think how dropout should be applied on a model is not yet fully understood; it makes another story itself.
Please log in to comment.
Dong-Hyun Lee 10 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Dong-Hyun Lee
Revealed: document:

10 Feb 2014
I doubt that the mlpconv layers can be easily implemented by successive 1x1 conv layers. In a 1x1 conv layer, the lower feature maps and the upper feature maps at each location are fully-connected. For examples, 5x5 conv - 1x1 conv - 1x1 conv is equivalent to 5x5 mlpconv layers with 3 local layers. Of course, this work is still interesting and valuable even though my thinking is correct. But in that case, it can be easily implemented by the ordinary CNN packages.
Please log in to comment.
Min Lin 17 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Min Lin
Revealed: document:

17 Feb 2014
Yes, in the node sharing case (which is used in the experiment of this paper), it is equivalent to convolution with kernel size 1. By the way, the overfeat paper submitted in iclr2014 uses 1x1 convolution kernel in the last layer. It is true you can use the convolution function in ordinary CNN packages, but the most efficient way is to use matrix multiplication functions in cublas.
Please log in to comment.
Çağlar Gülçehre 16 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Çağlar Gülçehre
Revealed: document:

16 Feb 2014
That is an interesting paper. I have a few suggestions and comments about it: (i) We used the same architecture in a paper publised at ICLR 2013 [1] for a specific problem and we called it as SMLP. Two differences in their approach from our paper are, for NIN authors stack several layers of locally connected MLPs(mlpconv) with tied weights, whereas we used only one layer of mlpconv and we didn't use Global Average Pooling. However sliding neural network over an image to do detection/classification is quiet old [2]. I think authors should cite those papers. (ii) Moreover I think authors should provide more details about their experiments and hyperparameters that they have used (such as size of the local receptive fields, size of the strides). (iii) A speed comparison between regular convolutional neural networks and NIN would be also interesting. [1] Gülçehre, Çağlar, and Yoshua Bengio. "Knowledge matters: Importance of prior information for optimization." arXiv preprint arXiv:1301.4083 (2013). [2] Rowley, Henry A., Shumeet Baluja, and Takeo Kanade. Human face detection in visual scenes. Pittsburgh, PA: School of Computer Science, Carnegie Mellon University, 1995.
Please log in to comment.
Min Lin 17 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Min Lin
Revealed: document:

17 Feb 2014
1. Thanks for the information, we'll cite those papers in the coming version. 2. The hyper-parameters of the models are in our supplementary material and will soon be online. 3. NIN has a smaller number of parameters than CNN. However, it has lots of nodes thus lots of computation. But since the nodes in NIN are fully parallel, it is not a problem if we have lots of computing nodes just as human brain does.
Please log in to comment.
Anonymous bae9 19 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous bae9
Revealed: document: review of Network In Network

19 Feb 2014
Fulfill
Anonymous bae9 Conference Track
Fulfilled: Request for review of Network In Network

19 Feb 2014
Authors propose the following modification to standard architecture: Replace: convolution -> relu with convolution -> relu -> convolution (1x1 filter) -> relu Additionally, instead of using fully connected layers, the depth of the last conv layer is the same as number of classes, which they average over x,y position. This generates a vector of per-class scores. There's a number of internal inconsistencies and omitted details which make reproducing the results impossible. Those issues must be fixed to be considered for acceptance. They do bring up one intriguing idea which gives a novel approach to localization. - Authors would be better off using standard terminology, like I did above, that makes reading the paper easier. - Some space is unnecessarily taken up by issues that are irrelevant/speculative, like discussing that this architecture allows for feature filters that are "universal function approximators." Do we have any evidence this is actually needed for performance? - In section 3.2 they say that last averaged layer is fed into softmax, but this contradicts Figure 4 where it seems that last layer features actually correspond to classes, and no softmax is needed. I assumed the later was the intention. - Following sentence is unclear, consider expanding or removing: "A vectorized view of the global average pooling is that the output of the last mlpconv layer is forced into orthogonal subspaces for different categories of inputs" - Most serious shortcoming of this paper is lack of detailed explanation of architecture. All I have to go on is the picture in Figure 2, which looks like 3 spatial pooling layers and 6 convolutional layers. Authors need to provide following information for each layer -- filter size/pooling size/stride/number of features. Ideally it would be in a succinct format like Table 2 of "OverFeat" paper (1312.6229). We have implemented NIN idea on our own network we used for SVHN and got worse results. Since detailed architecture spec is missing, I can't tell if it's the problem of the idea or the particulars of the network we used. One intriguing/promising idea they bring up is the idea of using averaging instead of fully connected layers. I expected this to allow one to be able to localize the object by looking at the outputs of the last conv layer before averaging, which indeed seems like the case from Figure 4.
Please log in to comment.
Min Lin 28 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Min Lin
Revealed: document:

28 Feb 2014
- Authors would be better off using standard terminology, like I did above, that makes reading the paper easier. It is true that 1x1 convolution is an easier to understand explanation regarding the architecture of NIN. However, regarding the motivation of the architecture and the mechanism why this architecture work, it is better to explain the architecture as a micro mlp convolving the underlying data. Another explanation of the structure is cross-channel parameteric pooling, as each of the output feature maps is a weighted summation of the channels in the input feature maps. We will add the 1x1 convolution explanations in the coming version for easier understanding of the architecture. - Some space is unnecessarily taken up by issues that are irrelevant/speculative, like discussing that this architecture allows for feature filters that are "universal function approximators." Do we have any evidence this is actually needed for performance? In the introduction of the coming version, it is better explained why universal function approximator is prefered to GLM. The discussion is necessary because it is the motivation of proposing this architecture, and it is our explanation why NIN can achieve a good performance. We also refer to maxout as a convex function approximator in our paper, and we think maxout and NIN are both evidences that a potent function approximator is better than GLM. - In section 3.2 they say that last averaged layer is fed into softmax, but this contradicts Figure 4 where it seems that last layer features actually correspond to classes, and no softmax is needed. I assumed the later was the intention. 1. Each node in the last layer corresponds to one of the classes. 2. The values of these nodes are softmax normalized so that the sum equals one. I think there is no incompatibility between the above two. - Following sentence is unclear, consider expanding or removing: "A vectorized view of the global average pooling is that the output of the last mlpconv layer is forced into orthogonal subspaces for different categories of inputs" Global average pooling is equal to vectorizing the feature maps and do a linear multiplication with a predefined matrix, the rows of the matrix lies within orthogonal linear subspaces. We'll remove this sentence in the coming version. - Most serious shortcoming of this paper is lack of detailed explanation of architecture. All I have to go on is the picture in Figure 2, which looks like 3 spatial pooling layers and 6 convolutional layers. Authors need to provide following information for each layer -- filter size/pooling size/stride/number of features. Ideally it would be in a succinct format like Table 2 of "OverFeat" paper (1312.6229). We have implemented NIN idea on our own network we used for SVHN and got worse results. Since detailed architecture spec is missing, I can't tell if it's the problem of the idea or the particulars of the network we used. The details of NIN used for the benchmark datasets will be in the supplementary material that will be added in the comming version. The code (derived from cuda-convnet), the definition files and parameter settings are published and will be completed on my github (https://github.com/mavenlin/cuda-convnet)
Please log in to comment.
Anonymous 5205 03 Mar 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 5205
Revealed: document:

03 Mar 2014
Do you think you could post the revised version of your paper soon? We have until Mar 7 to discuss it. If you could post the revised version before then I'm likely to upgrade my rating of the paper. I don't feel comfortable upgrading the rating just based on discussions on the forum though. Feel free to post the updated version on a separate website so we don't need to wait for it to be approved on ArXiv.
Please log in to comment.
Min Lin 17 Apr 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Min Lin
Revealed: document:

17 Apr 2014
I'm sorry that I missed this comment, but fortunately I updated on March 4th. Tmr I'll leave Banff, really had a good time here enjoying the talks.
Please log in to comment.
Jost Tobias Springenberg 27 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Jost Tobias Springenberg
Revealed: document:

27 Feb 2014
Interesting Paper. I think there are a lot of possibilities hidden in the Ideas brought up here - as well as in the general Idea brought up by the maxout work. I have a short comment to make regarding the performance that you achieve with the "Network in Network" model: Although I do not think that this should influence the decision on this paper (nor do I think it takes anything away from the Network in Network idea) I want to make you aware that I believe a large part of your performance increase over maxout stems from your choice of hyperparameters. I am currently running a hyperparameter search for maxout for a paper submitted to ICLR (the "Improving Deep Neural Networks with Probabilistic Maxout Units" paper). The preliminary best result that I obtained for a maxout network on CIFAR-10 without data augmentation (using the same amount of units per layer as in the original maxout paper) is 10.92 % error. If I understand it correctly then this is approximately the same as the NiN model with a fully connected layer. The hyperparameter settings for this model are very similar to the settings I assume were used in your paper (based on the parameter file you posted here https://github.com/mavenlin/cuda-convnet/blob/master/NIN/cifar-10_def). The most crucial ingredient seems to be the pooling and filter/kernel size. I will post more details on the hyperparameter settings in the discussion on the "Probabilistic Maxout" paper.
Please log in to comment.
Min Lin 02 Mar 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Min Lin
Revealed: document:

02 Mar 2014
Hi Jost, I initialized the hyperparameter according to the parameters released by the maxout paper. For CIFAR-10, there are two things I tuned, one is the weight decay, and the other is the kernel size of the last layer (3x3 instead of 5x5). Tuning the weight decay gives me most of the performance. The other settings, such as 5x5 kernel size instead of 8x8; I just set them once and they were not tuned for performance. I think the t tunable range for kernel size is quite small. It depends on the size of the object within the image. I've no idea whether it would affect the performance that much. I'm very interested about this, looking forward to seeing the effect of hyperparameters.
Please log in to comment.

Please log in to comment.