Convolutional neural network models have recently been shown to achieve excellent performance on challenging recognition benchmarks. However, like many deep models, there is little guidance on how the architecture of the model should be selected. Important hyper-parameters such as the degree of parameter sharing, number of layers, units per layer, and overall number of parameters must be selected manually through trial-and-error. To address this, we introduce a novel type of recursive neural network that is convolutional in nature. Its similarity to standard convolutional models allows us to tease apart the important architectural factors that influence performance. We find that for a given parameter budget, deeper models are preferred over shallow ones, and models with more parameters are preferred to those with fewer. Surprisingly and perhaps counterintuitively, we find that performance is independent of the number of units, so long as the network depth and number of parameters is held constant. This suggests that, computational efficiency considerations aside, parameter sharing within deep networks may not be so beneficial as previously supposed.
State From To ( Cc) Subject Date Due Action
Blocked
Rob Fergus Conference Track
Request for Endorsed for oral presentation: Understanding Deep Architectures...

17 Dec 2013
Reveal: document
Rob Fergus
Revealed: document: Understanding Deep Architectures using a Recursive...

17 Dec 2013
Completed
Conference Track Anonymous 4423
Request for review of Understanding Deep Architectures using a Recursive...

14 Jan 2014 04 Feb 2014
New Request
Conference Track Anonymous 9bc6
Request for review of Understanding Deep Architectures using a Recursive...

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous 975a
Request for review of Understanding Deep Architectures using a Recursive...

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous bd74
Request for review of Understanding Deep Architectures using a Recursive...

20 Jan 2014 04 Feb 2014

13 Comments

David Krueger 04 Jan 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
David Krueger
Revealed: document:

04 Jan 2014
I like that this paper attempts to disentangle the effects of different parameters. However, I find the argument that more parameters should be spent on more layers somewhat unconvincing. Taking this to the extreme suggests that every layer should have only 1 feature map, whereas you only go as low as 32. I don't think overfitting is the only reason a network with M=32,L=16 would (probably) outperform a model with M=1,L=512. Some specific comments: 1.1: 1st sentence: you say "recursive networks" but the refernces you give are for reCURRENT networks. I would explicitly disambiguate the two, as they are easy to confuse. Socher et al [24] and [23] are not the same group or researchers. Saying "more recently [23], they" implies they are. 3.1 "For SVHN, we used between 32 and 256 feature maps and between 1 and 8 layers beyond the first, also incrementing by powers of 2" -- I would match syntax with the previous sentence for readability. ("For SVHN, we used M = 32,64,128, 256 and L = 1,2,4,8") "That we were able to train networks at these large depths is due to the fact that we initialize all W^l_m to the identity" - I feel this sentence could use some explaining "(except once they go beyond 4 layers for CIFAR-10, at which point they overfit, as indicated by the still-decreasing training error)." - actually, the training error goes up as well from M=32, L=8 to M=32, L=16. 3.2.1 "For both CIFAR-10 and SVHN, performance increases as the number of layers increases, although there is an upward tick at 8 layers in the CIFAR-10 curves due to overfitting of the model." - It looks like this is not true for 32 features on CIFAR-10. Why not run linear regressions on the results in figures 6 and 7 to quantify the effects and show how significant/unsignificant these factors are?
Please log in to comment.
David Eigen 07 Jan 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
David Eigen
Revealed: document:

07 Jan 2014
Thanks for your comments. Re. M=1: We believe the results we report using more common configuration ranges are practical and useful, although it could be interesting to explore whether the trends we observe hold at such extreme cases -- and if not, why, and at what point they break down. Keep in mind that M=1 does not mean that there is only one unit in each hidden layer, but rather 64 (one for each 8x8 spatial location in the feature map). Also, in the example you give (M=32,L=16), an M=1 model would need L=17,379 layers to match the parameter count, not 512. Still, for M=1, there is only a single kernel available for initial feature extraction, so the bottom-most features won't contain multiple edge orientations. It seems likely M=1 would perform poorly in this case, but for other small M this is not clear. 1.1: Recurrent/recursive are very much related, the difference being that recurrent nets are fed a new part of the input at each timestep; we will add this disambiguation. For references [24] and [23], the two groups are similar but not identical. 3.1: - "match syntax with the previous sentence" good suggestion - "this sentence could use some explaining" With a zero-centered initialization, it's possible to have vanishing gradients; the identity init avoids this by copying the activations up from the first layer initially (and gradients down from the top layer). Of course, the activations stop being copies of one another as training proceeds. We can add this explanation. - "training error goes up as well" Thanks for the correction, the training error goes down only for the untied case; it indeed goes up a bit for the tied one. 3.2.1: - "not true for 32 features on CIFAR-10" The M=32,L=4 case here does not fit the overall pattern. The trend is still downwards until L=8, but we can point this out. - Linear regressions for experiments 2 and 3 are a nice suggestion, which we will include. They are consistent with our interpretations, though 7(b) is somewhat weaker than the rest. Figure 6 (a) CIFAR-10 vary P: slope: 0.658524 intercept: 0.053026 r_value: 0.968775 p_value: 0.000000 Figure 6 (b) SVHN vary P: slope: 0.668062 intercept: 0.010306 r_value: 0.934713 p_value: 0.000025 Figure 7 (a) CIFAR-10 vary M: slope: 0.956642 intercept: 0.009732 r_value: 0.924918 p_value: 0.000017 Figure 7 (b) SVHN-10 vary M: slope: 0.806380 intercept: 0.008227 r_value: 0.854079 p_value: 0.001657
Please log in to comment.
David Krueger 10 Jan 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
David Krueger
Revealed: document:

10 Jan 2014
I understand that M=1 still has 64 units, but I don't see why you would need 17,379 layers, although I haven't thought about it much. I think it would actually be nice to include in the paper an equation for the total number of parameters (given # of layers, kernels, tied/untied) of a model with the structure you are using. My point here, though, is not that you should try to test these extreme cases. Rather, I am using it as a thought experiment to make an argument against (my understanding of) your interpretation of the results you present. It appears to me that you are claiming that the ONLY reason performance decreases sometimes in deeper models (compared with shallower models with the same # of parameters) is because of overfitting. But I think this is incorrect, for two reasons. The M=1 thought experiment is one reason that demonstrates the intuition of my belief, the other (empirical) reason is that the training error increases from M=32, L=8 to M=32, L=16 (not just test error). If the increase in test error in this case were due to overfitting, I would expect the training error to increase as well. Indeed, in the write-up, you state that the decreasing training error demonstrates that overfitting is taking place "(except once they go beyond 4 layers for CIFAR-10, at which point they overfit, as indicated by the still-decreasing training error)", but since the training error is NOT still decreasing (for all cases), this interpretation does not appear correct to me.
Please log in to comment.
David Eigen 19 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
David Eigen
Revealed: document:

19 Feb 2014
Thanks for the clarifications. We believe overfitting is the predominant cause of test performance decreases in the range of network sizes we study; the fact that the train/test spread monotonically increases as layers increase is a good indication of this. However, there very well may be other effects, particularly for much larger numbers of layers like the L=16 case you point out, as we have not yet studied these cases in depth. We have added this to the discussion as well as experiments section.
Please log in to comment.
Anonymous 4423 03 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 4423
Revealed: document: review of Understanding Deep Architectures using a Recursive...

03 Feb 2014
Fulfill
Anonymous 4423 Conference Track
Fulfilled: Request for review of Understanding Deep Architectures using a...

03 Feb 2014
Summary of contributions: Studies the effect of # of parameters, # of layers, and # of units on convolutional net performance. Uses recurrence to run nets that have e.g. more layers but not more parameters in order to distinguish the effects of these three properties. Novelty: moderate Quality: low Pros: -Nice empirical demonstration that more parameters helps. Cons: -Does not study dropout. I think dropout is really important for this kind of paper, because dropout has a strong effect on the optimal model size. Also, dropout is crucial part of the state of the art system on both CIFAR-10 and SVHN, so it seems out of place for a paper on how to set hyperparameters to good performance out of a neural net to disregard one of the most important techniques for getting good performance. -Insufficient tuning of hyperparameters. -Support for the claims in the abstract seems weak, with many experiments going against the claims -The stated goal is to provide guidance for how to set hyperparameters so that practitioners don’t have to resort to trial and error. But I don’t really see anything here that prevents that. For example, Fig 4a shows standard U-shaped curves for the # of layers hyperparameter. The paper says “adding more layers tends to increase performance” but this is only true on the left side of the U! The whole point of trial and error is to figure out where the bottom of the U is, and this paper completely ignores that. -The kind of parameter tying considered in this paper is not one that is typically used in practice, at least not for this kind of problem. The conclusions are therefore not all that helpful. i.e., the authors introduce a new form of parameter tying, and then show it isn’t useful. We don’t need to publish that conclusion, because no one is using this useless form of parameter tying anyway. -The authors don’t investigate the effect of the tiling range of tiled convolution, which is a form of control on the degree of parameter sharing that people actually use. It’d be much more interesting to study this form of parameter sharing. (This paper feels a bit like it started off as a “new methods” paper advocating convolutional recurrence, and then when the new method didn’t perform well, the authors tried to salvage it as an “empirical investigation” paper, but the empirical investigation part isn’t really targeted at the methods that would be most useful to study) Detailed comments: 1.1 Related work: You should also mention “Multi-Prediction Deep Boltzmann Machines”, Goodfellow et al 2013. This paper uses recurrent networks on the image datasets MNIST and NORB. Like DrSAE, it is discriminative. It may be interpreted as a form of autoencoder, like the methods you mention in the second paragraph. 2 Approach: Your approach is definitely not the first to use recurrence and convolution in the same model. It’s probably worth discussing similarities and differences to Honglak Lee’s convolutional DBN. He describes performing mean field inference in this model. The mean field computations are essentially forward prop in a convolutional recurrent architecture, but the connectivity is different than in yours, since each update reads from two layers, and some of the weight matrices are constrained to be the transpose of each other rather than being constrained to be equal to each other. It’s also probably worth discussing how you handle the boundaries of the image, since this has a strong effect on the performance of a convolutional net. Since you say all the layers have the same size, I’m assuming you implicitly pad the hidden layer with zeros when running convolution so that the output of the discrete convolution operation has the same size as the input. 2.1 Instantiation on CIFAR-10 and SVHN I don’t know what it means to put the word “same” in quotes. I’m assuming this refers to the zero padding that I described above, but it’s worth clarifying. 2.2 I think it’s fairly disappointing that you don’t train with dropout. How did you choose this one fixed learning rate and momentum value? How do you know it doesn’t bias the results? For example, if you find that deeper models are better, are you really finding that deeper models are better in general, or are you just finding that deeper models are more compatible with this specific learning rate and momentum setting? It seems especially important to tune the learning rate in this work because varying the amount of parameter sharing implies varying the number of gradient terms that affect each parameter. The speed at which the parameters move is probably much higher for the nets with many recurrent steps than it is for the nets with no recurrence. 3.1 “That we were able to train networks at these large depths is due to the fact that we initialize all W to the identity” -> it’s not obvious to me that it should be hard to train convolutional rectifier networks at most of these depths. For example, Google’s house number transcription paper submitted to this conference at the same time trains a 12 layer mostly convolutional network with no mention of network depth posing a challenge or requiring special initialization. The maxout paper reports difficulty training a 7 layer rectifier net on MNIST, but that was densely connected, not convolutional. Was it only difficult to train the recurrent nets, or also the untied ones? This is important to explain, since if the recurrent nets are significantly harder to optimize, that affects the interpretation of your results. Are the higher layer weights for all of the networks initialized to the identity, or only the ones with tied parameters? Is it literally identity or identity times some scalar? If it’s literally identity rather than identity times some scalar, it might be too hard for SGD to shrink the initial weights and learn a different more interesting function. Have you tried other initializations that don’t impose such a strong hand-designed constraint, such as James Martens’ sparse initialization, where each hidden units gets exactly k randomly chosen non-zero incoming weights? This initialization scheme is also described as making it easier to train deep or recurrent nets, and it seems to me like it doesn’t trap the recurrent layer as being a fairly useless memory layer that mostly functions to echo its input. “Likewise, for any given combination of feature maps and layers, the untied model outperforms the tied one, since it has more parameters.” I don’t agree with the claim that the untied model performs better because it has more parameters. This would make sense if the tied model was in the underfitting regime. But you have already said in the same paragraph that many of the tied models are in the overfitting regime. If you look at fig 2. there are several points where both the tied and untied model have 0 training error and the tied model has higher validation set error. If the correct story here is overfitting due to too many parameters, then the untied model should do worse. I suspect what’s going on here is something like the identity initialization being a bad prior, so that you fit the training set in a way that doesn’t generalize well, or maybe just your choice of a single momentum and learning rate setting for all experiments ended up benefiting the untied model somehow. For example, as I said above, the recurrent nets will generally have larger gradients on each parameter, so maybe the high learning rate makes the recurrent net adapt too much to the first few minibatches it sees. Fig 2 In the abstract you say “for a given parameter budget, deeper models are preferred over shallow ones.” It would be nice if on the plot on the left you evaluted points along the parameter budget contour lines instead of points on a grid, since the grid points don’t always hit the contour lines. As is, it’s hard to evaluate the claim from the abstract. However, I don’t see a lot of support for it. The best test error you get is toward the bottom right: 0.160 at the rightmost point in the second row from the bottom. Of course, this is the only point on that parameter budget contour, so it may just be winning because of its cost. However, if I look for the point with the most depth, I see one with 0.240 near the 2^18 contour line. At the bottom right of this contour line, the shallow but wide model gets 0.205. Overall, here is my summary of all your contour lines: 2^16: only one point on it 2^17: Contradicts claim 2^18: Contradicts claim 2^19: Contradicts claim 2^20: Supports claim (sort of, points aren’t that close to contour line) 2^21: Supports claim (sort of, points aren’t that close to contour line) 2^22: Supports claim (sort of, points aren’t that close to contour line) So it seems to me that this plot contradicts the claim from the abstract at least as much as it supports it. Right figure: This supports the claim in your abstract. Table 1: While it does make sense to compare *these* experiments against methods that don’t use dropout or data augmentation, I don’t think it makes sense for these to be your only experiments. I think the case for excluding data augmentation from consideration is getting very weak. There is now a lot of commercial interest in using neural nets on very large datasets. Augmentation of small datasets provides a nice low-cost proxy for exploring this regime. As far as I know, the main reasons for not considering data augmentation are 1) data augmentation requires knowledge of the data domain, in this case that the input is an image and the output is invariant to shifts in the input. But you are already exploiting exactly that same knowledge by using a convolutional net and spatial pooling. 2) gaining improvements in performance by improving data augmentation techniques distracts attention from improving machine learning methods and focuses it on these more basic engineering tricks. But I’m not asking you to engineer new data augmentation methods here; you can just use exactly the same augmentation as many previous authors have already used on CIFAR-10 and SVHN. I don’t think there is any valid case at all for excluding stochastic regularization from consideration. It doesn’t require any knowledge of the data domain and it is absolutely a machine learning technique rather than just an engineering trick. Moreover, it is computationally very cheap, and state of the art across the board. By refusing to study stochastic regularization you are essentially insisting on studying obsolete methods. The only regime in which stochastic regularization is not universally superior to deterministic backprop is in the extremely large data domain, which as academics you probably don’t have access to and you are also actively avoiding by not using data augmentation. Fig 2 and 3 in general: I understand it’s too expensive to extensively cross-validate every point on these plots, but I think it’d be good to pick maybe 4 points for each plot (maybe the upper-left and lower-right of two different contour lines) and run around 10 experiments each for those 4 points. Overall that is 80 training runs, which I think is totally reasonable. The current plots are somewhat interesting but it’s hard to have much confidence that the trends they indicate are real. Obtaining higher confidence estimates of the real value of a small number of points would help a lot to confirm that the trends are actually caused by the # of feature maps and depth rather than compatibility with a fixed optimization scheme. Section 4: I don’t think the “received wisdom” is that more depth is always better, just that the optimal depth is usually greater than 1. You say your experiments show that more depth is better for a fixed parameter budget, but doesn’t Fig 2. (right) contradict this?
Please log in to comment.
David Eigen 19 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
David Eigen
Revealed: document:

19 Feb 2014
A major concern was that the claims in our paper are too strong, given the somewhat narrow focus of our experiments. On reflection, we acknowledge this and have rewritten the abstract, introduction and discussion to better reflect this focus. We have also included some new plots which make a clearer case for the central arguments of our paper. Re: contour lines in Fig 2: We feel it's possible the confusion about these conclusions may have stemmed from trying to distill them from Fig 2, rather than looking at Fig 5, where they are clearly visible. Fig 5 ("Experiment 1b" in the latest version) shows performance for these same models according to number of parameters. We have updated it now to include training error as well as test. Contours in Fig. 2 correspond to vertical cross-sections in Fig 5. The trend that more layers tends to help performance is easily visible in Fig 5. Data augmentation and dropout: This was a tough choice we made when designing the experiments. We wanted to include these initially, but also questioned, if we were to choose some of these regularizations, which ones should we use (dropout, maxout, translations, flips, scale jitters, etc)? These introduced many combinations we opted to exclude. However, we feel our experiments still present useful evidence without these regularizations. As Reviewer 975a points out, the key trends are more clearly shown in the training errors, since they are not concerned with generalization, which is affected by the numerous possible regularization approaches. We have thus revised the paper to include plots for both training and test errors. Additional references: Thank you for these references; we have included them in the related work. "The stated goal is to provide guidance for how to set hyperparameters so that practitioners don’t have to resort to trial and error. But I don’t really see anything here that prevents that." We don't aim to eliminate the need for trial-and-error, but do feel our experiments provide useful guidelines in helping to inform sizing choices in convolutional layers. The new abstract and introduction explain this better. "The kind of parameter tying considered in this paper is not one that is typically used in practice, at least not for this kind of problem." Although this is not a commonly used tying scheme, we feel it enables a unique study of convolutional layers' performance characteristics. In particular, we can see the effect of varying the number of layers alone, as Reviewer 975a points out. We furthermore find that varying the number of feature maps appears to affect performance predominantly through changing the number of parameters (as opposed to the representation space), which we think is a useful point in helping inform sizing choices. "especially important to tune the learning rate in this work because varying the amount of parameter sharing implies varying the number of gradient terms that affect each parameter" We chose the hyperparameters using multiple model sizes, picking values that worked well in all cases. For the tied case, we tried dividing the learning rate for the higher layers by the number of layers (i.e. number of gradient terms), as well as not dividing, finding that not dividing the learning rate worked better. As pointed out, there are many different models to run here, and it is impractical to run everything. In addition, the trends we find are consistent across many combinations of model size and type. While perhaps each individual model might be able to perform a bit better with different settings, we find it very unlikely the overall trends would be much affected by different values. "it’s not obvious to me that it should be hard to train convolutional rectifier networks at most of these depths. ... Was it only difficult to train the recurrent nets, or also the untied ones?" Both untied and tied models ran into trouble with zero-centered gaussian initializations at some of the higher depths. "Are the higher layer weights for all of the networks initialized to the identity, or only the ones with tied parameters?" Both untied and tied models use this initialization. "Have you tried other initializations that don’t impose such a strong hand-designed constraint, such as James Martens’ sparse initialization, where each hidden units gets exactly k randomly chosen non-zero incoming weights? " No, we did not try this, but this is an interesting idea that we will try. Thank you. "I don’t agree with the claim that the untied model performs better because it has more parameters. ... If the correct story here is overfitting due to too many parameters, then the untied model should do worse." Adding more parameters helps until it causes more overfitting. The points you mention still appear to have benefitted from the extra parameters. Zero training error is not a hard cutoff that implies there will be more overfitting when parameters are added: Eventually adding parameters will make generalization worse, but it can also still improve performance for some time (as demonstrated by the majority of results in fig. 2). "pad the hidden layer with zeros"; "the word “same” in quotes" Yes, we mean that the edges are padded with zeros; this has been clarified in the newer version.
Please log in to comment.
David Eigen 19 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
David Eigen
Revealed: document:

19 Feb 2014
Thank you for your comments. We have updated the paper with some major revisions, and it is now online. Responses to your comments are below.
Please log in to comment.
Anonymous 975a 07 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 975a
Revealed: document: review of Understanding Deep Architectures using a Recursive...

07 Feb 2014
Fulfill
Anonymous 975a Conference Track
Fulfilled: Request for review of Understanding Deep Architectures using a...

07 Feb 2014
An analysis of deep networks is presented that tries to determine the effect of various convnet parameters on final classification performance. A novel convnet architecture is proposed that ties weights across layers. This enables clever types of analysis not possible with other architectures. In particular, the number of layers can be increased without increasing the number of parameters, thus allowing the authors to determine whether the number of layers is important or whether the number of parameters is important independently. (Normally, these are confounding factors that are hard to judge separately.) Several experiments are proposed that independently vary the number of maps, number of parameters, and number of layers. It is reported that while the number of maps appears to be irrelevant in this setup, the number of layers and number of parameters are very important [more is better!] This is a pretty unorthodox but interesting form of analysis. I think it is worth highlighting that “number of layers” experiment, since I didn’t see immediately how you could do this with another tying strategy. The results confirm our intuitions about important parameters, but also suggest that perhaps weight-tying spatially is one place for improvement. I am a bit worried that there are caveats to this analysis that could be better analyzed. For example, section 3.2.2 shows the “untied” system working better than “tied” in 3.2.2, but could this have more to do with the finicky nature of recurrent models (e.g., failure to find a good minimum) than the number of parameters? Section 3.2.3 might implicitly address this: for fixed number of layers and parameters, the tied model performs about the same. If it had performed worse again, we might have been tricked into thinking that the number of maps mattered, when it could have implied that the tied model itself was a worse performer. A bit more clarity about the potential caveats of this analysis and the implications of each experiment would help. One experiment I was surprised not to see: holding M and P fixed, compare a tied model with L layers to an untied model with fewer layers [to keep P constant]. This is apparently not what is done in 3.2.1, but might help address my concern above. Pros: Clever, novel analysis of interplay of deep network characteristics and their effect on classification performance. Useful rules of thumb that may benefit future work. Appears to confirm widely-held intuitions about depth and scale of models. Cons: The analysis method might be introducing effects that are not clear (e.g., the effect of using recurrence on the optimization problem). Hard to know how these results will transfer to more typical convnets that use max pooling, LCN, etc. Other: In much of the analysis I thought it may be more useful to consider training error as the main metric for comparing networks. At this point, being able to achieve low training error is the main goal of fiddling with the model size, etc. and testing error/generalization is governed by a different bag of tricks [synthetic data, Mechanical Turk, unsupervised learning, extensive cross-validation, bagging, etc.]
Please log in to comment.
David Eigen 19 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
David Eigen
Revealed: document:

19 Feb 2014
Thanks for your review and suggestions. In response to your comments: Re: training error: We now include training error for all experiments in the updated version of the paper. As you point out, training error is not encumbered by generalization issues, and offers a clear view into the effects of model size. "holding M and P fixed, compare a tied model with L layers to an untied model with fewer layers [to keep P constant]." An untied model will have more parameters than a tied model of same M, even reducing L, since each layer adds parameters in the untied case but not the tied one. We compare different L for fixed P using each model in section 3.2.1. "A bit more clarity about the potential caveats of this analysis and the implications of each experiment would help." Thanks for the suggestion, we have significantly revised our discussion to include more on the caveats and implications.
Please log in to comment.
Anonymous bd74 07 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous bd74
Revealed: document: review of Understanding Deep Architectures using a Recursive...

07 Feb 2014
Fulfill
Anonymous bd74 Conference Track
Fulfilled: Request for review of Understanding Deep Architectures using a...

07 Feb 2014
This paper analyzes the effect of different hyper-parameters for a convolutional neural network: the number of convolutional layers (L), the number of feature maps (M), the total number of free parameters in the model (P). The main challenge is the tight relation between these three hyper-parameters. To study the effect of each factor independently, a recurrent architecture is proposed where weights are tied between different convolutional layers so that the number of layers can be varied without changing the total number of parameters. Pooling is only applied to the very first layer but is not applied in any of the tied layers on top. While It is important to see experimental papers that offers analysis of the effect of different design parameters for neural networks like this one, weight tying across different convolutional layers is a bit artificial for this task (you scan the whole image at once in each layer). The main take home message of this paper is that varying the number of feature maps is not important given that the number of free parameters and model depth are held constant. It is important to add experiments/arguments that show if the same conclusion holds when pooling is used for example. General comments: - What are the kind of features learned by the tied network vs the untied (normal) one? Does the network work around tying by dedicating some features to be used most of the time per specific hidden layer but not by the others? (as if it is working in an untied regime but with smaller number of feature maps per layer). - Don’t think table 1 is needed because the paper is not aiming at achieving the best ever accuracy but rather exploring different factors affecting performance. - Regarding writing, the paper gets repetitive at points, for example, the information in table 2 is stated in a paragraph in page 7. The same conclusions are stated in the same way multiple times. - In page 8 “We then compared their classification performance after training as described in Section 2.2.” Is there something specific you mean by “after training”?
Please log in to comment.
David Eigen 19 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
David Eigen
Revealed: document:

19 Feb 2014
Thanks for your comments and critiques. We have responded to your questions here: "if the same conclusion holds when pooling is used" Unpooled convolutional layers are powerful tools in many current models, and our experiments apply directly in characterizing their performance behaviors. While we feel the principles we find are likely to extend to more complex cases as well, we agree that it's unclear how the results might change if pooling were used between layers. We have added this point to the discussion. Note that we assume here that by pooling, one means an operation that explicitly throws away spatial resolution (e.g. an 8x8xM hidden layer may be pooled into a 4x4xM). Aggregating over spatial regions is already done by the convolutions themselves, since they compute weighted averages over 3x3 regions. "Does the network work around tying by dedicating some features to be used most of the time per specific hidden layer but not by the others?" We looked into this by comparing several activation statistics for corresponding units between layers (e.g. number of nonzeros, mean activation, mean nonzero activation), but did not find much clear evidence of it. If it happens, it plays only a partial role in the network behavior. "Don’t think table 1 is needed" We put this in to demonstrate that despite its simplifications, our model still maintains good performance, so studying it is not a large a departure in terms of performance. "Is there something specific you mean by 'after training'?" No, thanks for this edit; we removed this sentence. "the paper gets repetitive at points" Thanks for the feedback; we have tightened up much of the writing in the new version.
Please log in to comment.
David Eigen 19 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
David Eigen
Revealed: document:

19 Feb 2014
We would like to thank all the reviewers and readers for their comments and concerns. In response, we have made many significant revisions, including a new abstract and discussion, and new plots including training error for each experiment. We have responded to each of your comments above in separate replies. The new version is now available on arxiv.
Please log in to comment.
Anonymous 4423 03 Mar 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 4423
Revealed: document:

03 Mar 2014
After seeing the updated version, I'm still not sure what the main useful takeaway message is supposed to be. We already know that most hyperparameters have U-shaped performance curves, where they reduce both train and test error for a while and then start to increase test error due to overfitting. I also feel like the analysis method of introducing the new form of recurrent parameter sharing makes the picture cloudy. I think for further exploration in this direction it would make more sense to alter the number of parameters per layer by varying things like the spatial size of the kernels, the rank of the kernels, or to used tiled convolution and alter the tiling range.
Please log in to comment.

Please log in to comment.