Convolutional networks are one of the most widely employed architectures in computer vision and machine learning. In order to leverage their ability to learn complex functions, large amounts of data are required for training. Training a large convolutional network to produce state-of-the-art results can take weeks, even when using modern GPUs. Producing labels using a trained network can also be costly when dealing with web-scale datasets. In this work, we present a simple algorithm which accelerates training and inference by a significant factor, and can yield improvements of over an order of magnitude compared to existing state-of-the-art implementations. This is done by computing convolutions as pointwise products in the Fourier domain while reusing the same transformed feature map many times. The algorithm is implemented on a GPU architecture and addresses a number of related challenges.
State From To ( Cc) Subject Date Due Action
New Request
Mikael Bruce Henaff Conference Track
Request for Endorsed for oral presentation: Fast Training of Convolutional...

23 Dec 2013
Reveal: document
Mikael Bruce Henaff
Revealed: document: Fast Training of Convolutional Networks through FFTs

23 Dec 2013
Completed
Conference Track Anonymous 3b1a
Request for review of Fast Training of Convolutional Networks through FFTs

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous 9161
Request for review of Fast Training of Convolutional Networks through FFTs

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous c809
Request for review of Fast Training of Convolutional Networks through FFTs

14 Jan 2014 04 Feb 2014

12 Comments

Soumith Chintala 04 Jan 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Soumith Chintala
Revealed: document:

04 Jan 2014
Training convnets on GPUs has been a challenge in terms of both time and memory constraints, more often constraining the size of the models by memory requirements per GPU rather than processing time. It would be helpful to add a comparison of memory requirements for your method compared to purely spatial convolution implementations. You mention: "Also note that our method performs the same regardless of kernel size, since we pad the kernel to be the same size as the input image before applying the FFT" From my understanding without looking at your implementation, I would assume that because of this particular operation, the memory requirements would balloon, making this method impractical to even implement current state-of-the-art models for Image and audio recognition problems. It would be good to have a section talking about practical constraints, issues and how you guys think they should be handled.
Please log in to comment.
Rodrigo Benenson 23 Jan 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Rodrigo Benenson
Revealed: document:

23 Jan 2014
Speeding up convolutional networks is a quite interesting topic. The presented method seems effective and principled. The paper reads easily however the experimental results seems somewhat lacking. The whole premise of the work is to speed-up training and testing, yet there is not even anecdotal results regarding total training time. This matters because, at the end of the day, it is the system wide performance that matters. A theoretical paper that only shows improvement on the number of operations, might hide the fact that the memory access is now more convolved, and thus the overall speed goes down (despite having reduced the number of operations). In the context of this paper, showing some evidence of the overall speed would be welcome. At the same time, full training experiments would show that indeed, in practice, there is no degradation of the learnt model or the predicted scores. There is all kind of implementation issues that can convert theoretically identical results into different outcomes (because in a computer a + (b + c) != (a + b) + c). When using methods based on Fourier transform border effects enter into play. Could you mention how you handle these ? Also section 3 mentions an "additional heuristic", could you detail why this is an heuristic, and what are the consequences of using it ? Finally, are there plan of releasing this new training code ? Open source releases tend to increase the impact of a paper. Overall quite interesting work, looking forward for the results of training filters directly in the Fourier domain.
Please log in to comment.
Anonymous c809 28 Jan 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous c809
Revealed: document: review of Fast Training of Convolutional Networks through...

28 Jan 2014
Fulfill
Anonymous c809 Conference Track
Fulfilled: Request for review of Fast Training of Convolutional Networks through...

28 Jan 2014
The paper presents a technique for accelerating the processing of CNNs by performing training and inference in the frequency (Fourier) domain. The work argues that at a certain scale, the overhead of applying FFT and inverse-FFT is marginal relative to the overall speed gain. As noted by the previous reviewers, the speedup is presented simply as that obtained for three functions that lie at the heart of each convolutional layer. It would be valuable if the speedup could also be presented in the context of a comparison to the overall training time of a CNN on a standard dataset. Also noted is the lack of reference to the fact that Convolution Theorem refers to circular convolution and not linear (i.e. non-circular) convolution. It is assumed inconsequential since CNNs use neither circular convolution (weight filters do not wrap around images) nor linear convolution (weight filters are always fully contained within the image and do not "hang off" the edges). Thus, the resulting differences between circular and linear convolution would not impact the feature map y_f. This seems to be hinted at by the n' term in section 2.2, but is not obvious. The future work seems logical and would be interesting to pursue. One other direction to consider is approximations to the FFT (which there are many) that could retain most of the information needed in context of CNNs at a fraction of the computational cost. Minor editorial issue: in figure 3 the axes are noted in the title of the figure rather than as labels for the x and y axis.
Please log in to comment.
Anonymous c809 28 Jan 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous c809
Revealed: document: review of Fast Training of Convolutional Networks through...

28 Jan 2014
Fulfill
Anonymous c809 Conference Track
Fulfilled: Request for review of Fast Training of Convolutional Networks through...

28 Jan 2014
The paper presents a technique for accelerating the processing of CNNs by performing training and inference in the frequency (Fourier) domain. The work argues that at a certain scale, the overhead of applying FFT and inverse-FFT is marginal relative to the overall speed gain. As noted by the previous reviewers, the speedup is presented simply as that obtained for three functions that lie at the heart of each convolutional layer. It would be valuable if the speedup could also be presented in the context of a comparison to the overall training time of a CNN on a standard dataset. Also noted is the lack of reference to the fact that Convolution Theorem refers to circular convolution and not linear (i.e. non-circular) convolution. It is assumed inconsequential since CNNs use neither circular convolution (weight filters do not wrap around images) nor linear convolution (weight filters are always fully contained within the image and do not "hang off" the edges). Thus, the resulting differences between circular and linear convolution would not impact the feature map y_f. This seems to be hinted at by the n' term in section 2.2, but is not obvious. The future work seems logical and would be interesting to pursue. One other direction to consider is approximations to the FFT (which there are many) that could retain most of the information needed in context of CNNs at a fraction of the computational cost. Minor editorial issue: in figure 3 the axes are noted in the title of the figure rather than as labels for the x and y axis.
Please log in to comment.
Mikael Henaff 01 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Mikael Henaff
Revealed: document:

01 Feb 2014
Thank you all for the constructive comments. Soumith, We added details as to how we addressed the memory issues. It's true that the method requires some extra memory, but you can preallocate a block of memory once and re-use it to store the frequency representations at each layer. So the amount of extra memory needed is equal to the maximum amount of memory required to store the frequency representations of a single layer. This is small compared to the amount of memory needed to store a large network. Rodrigo and Anonymous c809, We added a number of results reporting the running times for several different configurations of image and kernel sizes, as well as different numbers of input and output feature maps. We also added the running times for a training iteration of a whole network (not just a single layer). This is to account for memory accesses, padding and other implementation details. We also mention the results of our unit tests which compare the outputs of the FFT-based convolution and the direct method (the differences are very small). Concerning the "additional heuristic", this case is actually included in the analysis section so we removed it to avoid confusion. Concerning the border effects, we simply computed the circular convolution using the product of frequency representations and crop the output to only include coefficients for which the weight filter is contained within the image. We will clarify this in the paper. We will also edit the graphs in Figure 3 to make the axis labels clearer.
Please log in to comment.
Mikael Henaff 01 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Mikael Henaff
Revealed: document:

01 Feb 2014
Thank you all for the constructive comments. Soumith, We added details as to how we addressed the memory issues. It's true that the method requires some extra memory, but you can preallocate a block of memory once and re-use it to store the frequency representations at each layer. So the amount of extra memory needed is equal to the maximum amount of memory required to store the frequency representations of a single layer. This is small compared to the amount of memory needed to store a large network. Rodrigo and Anonymous c809, We added a number of results reporting the running times for several different configurations of image and kernel sizes, as well as different numbers of input and output feature maps. We also added the running times for a training iteration of a whole network (not just a single layer). This is to account for memory accesses, padding and other implementation details. We also mention the results of our unit tests which compare the outputs of the FFT-based convolution and the direct method (the differences are very small). Concerning the "additional heuristic", this case is actually included in the analysis section so we removed it to avoid confusion. Concerning the border effects, we simply computed the circular convolution using the product of frequency representations and crop the output to only include coefficients for which the weight filter is contained within the image. We will clarify this in the paper. We will also edit the graphs in Figure 3 to make the axis labels clearer.
Please log in to comment.
Anonymous 3b1a 07 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 3b1a
Revealed: document: review of Fast Training of Convolutional Networks through...

07 Feb 2014
Fulfill
Anonymous 3b1a Conference Track
Fulfilled: Request for review of Fast Training of Convolutional Networks through...

07 Feb 2014
The paper describes the use of FFTs to speed-up the computation during training for convolutional neural networks working on images. Essentially this is presented as a pure speed-up technique and doesn't change the learning algorithm, or (in an interesting way) the representation. The idea of applying FFTs to speed up image processing systems, particularly "sliding windows" systems, is far from new and there is a large literature on this.In particular combining FFTs with Neural networks is not new, e.g. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.36.1967 Some of this prior literature should be cited. I am not aware of any work that applys the back-propagation in the Fourier domain too. The resulting speed-ups are significant for the scenario the authors are considering, and it is useful to know that the practical implementation delivers these gains. As they conclude, these results may change the way such problems are formulated by removing the pressure to use small kernels. Expand the caption for Figure 2. Total number of operations for what? Figure 3 needs units for the y axis (text says seconds?), and for the x axes - ie areal or linear pixels? Also for each of the 3 sets of graphs, there needs to be an indication of what are the values of the parameters which are held constant. Please say in the text that all 3 systems (Torch, Authors' and Krizhevsky) were running on the same (which?) GPU. Citation for Cooley-Tukey FFT?
Please log in to comment.
Mikael Henaff 18 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Mikael Henaff
Revealed: document:

18 Feb 2014
Thank you for the feedback. We posted an updated version of the paper which incorporates these changes.
Please log in to comment.
Anonymous 9161 10 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 9161
Revealed: document: review of Fast Training of Convolutional Networks through...

10 Feb 2014
Fulfill
Anonymous 9161 Conference Track
Fulfilled: Request for review of Fast Training of Convolutional Networks through...

10 Feb 2014
"Fast Training of Convolutional Networks through FFTs" compares Fourier-domain vs. spatial-domain convolutions in terms of speed in convnet applications. The question of the relative speed of Fourier vs. Spatial convolutions is common among engineers and researchers, and to my knowledge no one has attempted to characterize (at least in convnet-specific terms) the settings when each approach is preferred. Spatial domain convolutions have been the standard in multiple implementations over 30 years of research by scores of researchers. This paper claims, surprisingly, that FFTs are nearly always better in modern convnets. At the same time, the authors of the paper introduce a strategy for FFT parallelization on GPUs that is somewhat particular to the sorts of bulk FFTs that arise in convnet training, and the conclusions are based on that implementation running on GPU hardware. CONTRIBUTIONS 1. Empirical comparison of spatial and Fourier convolutions for convnets 2. A fast Cooley-Tukey FFT implementation for GPU that's well-suited to convnet application QUALITY The figures and formatting are not very polished. PRO 1. The paper aims at an important issue for convnet researchers 2. The claim that FFT-based convolutions are better will be broadly interesting CON 1. The paper does not explain when spatial-domain calculations would be faster 2. The paper does not discuss how the trade-offs would be different on single-core or multi-core CPUs, or on different GPUs. 3. Details of the Cooley-Tukey implementation are not given 4. No mention is made of downloadable source code, this work might be hard to reproduce COMMENTS - What about non-square images? - Why use big-O notation in 2.2 when the approximate number of FLOP/s is easy to compute? Asymptotic performance isn't really the issue at hand, the relevant values of n and k are not very large. Consider falling back on big-O notation only after making it clear that the main analysis will be done on more precise runtime expressions. - The phrase "Our Approach" is surprising on page 3, because it does not seem like you are inventing a new Fourier-domain approach to convolution. Isn't the spatial domain faster sometimes, Fourier faster sometimes, and you're writing a paper about how to know which is which? - The last paragraph of section 3 is confusing: which of your experiments use and do not use the memory-intensive FFT-reuse trick? The following sentence in particular makes the reader feel he is being duped "All of the analysis in the previous section assumes we are using this memory-efficient approach [which you now tell is is infeasible in important practical applications]; if memory is not a constraint, our algorithm becomes faster." Faster than what? Faster than the thing that doesn't require prohibitive amounts of memory? - Page 4: when you speak of "another means to save memory" what was the first way? (Was the first way to recompute things on demand?) - Page 5: Figure 3: This figure is hard to understand. The axes should be labeled on the axes, and the title should contain the contents of the current caption (not the names of the axes), and the caption should help the reader to understand the significance of what is being shown. - Why is the Torch7 implementation called Torch7(custom), and not just Torch7? - The memory access patterns entailed by an algorithm is at least as important for GPU performance as the number of FLOP/s. How does the Cooley-Tukey FFT algorithm work, and how did you parallelize it? These implementation details are really important for anyone trying to reproduce your experiments. - What memory layout do you recommend for feature maps and filters? This ties in with a request for more detail on the algorithm you used.
Please log in to comment.
Mikael Henaff 18 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Mikael Henaff
Revealed: document:

18 Feb 2014
Thank you for the feedback. To answer your comments: -The paper does not explain when spatial-domain calculations would be faster Our analysis in Section 2.2 compares the theoretical complexity of spatial-domain calculations to the Fourier-based method, and our empirical results in Section 3 compare the performance of two spatial-domain implementations (CudaConv and Torch7 (custom)) to the Fourier-based method. We added a sentence clarifying that these two implementations use the direct method in the spatial domain. - The paper does not discuss how the trade-offs would be different on single-core or multi-core CPUs, or on different GPUs. The main point is that for modern ConvNets with large numbers of input and output feature maps, we can significantly reduce the number of operations required by using the FFT-based method. This result (explained in Section 2.2) holds regardless of the architecture on which the algorithm is implemented. We performed experiments with a GPU implementation because this is the most widely used, but the general result holds regardless of whether we use a CPU or GPU. - Details of the Cooley-Tukey implementation are not given / No mention is made of downloadable source code, this work might be hard to reproduce We added a reference to the Cooley-Tukey algorithm. We will eventually make the source code available. - What about non-square images? We added a footnote explaining that the results also apply to non-square images. - Why use big-O notation in 2.2 when the approximate number of FLOP/s is easy to compute? Asymptotic performance isn't really the issue at hand, the relevant values of n and k are not very large. Consider falling back on big-O notation only after making it clear that the main analysis will be done on more precise runtime expressions. Done. - The phrase "Our Approach" is surprising on page 3, because it does not seem like you are inventing a new Fourier-domain approach to convolution. Isn't the spatial domain faster sometimes, Fourier faster sometimes, and you're writing a paper about how to know which is which? We are aware that the idea of performing a convolution through a Fourier transform is not new. The speedup occurs when we are doing many pairwise convolutions between two sets of matrices, so the analysis is not on the level of a single convolution but for sets of convolutions. As pointed out by another reviewer, a related idea has been explored in the 90's for accelerating inference in previously-trained models. We added a mention of this work. However, our work differs from theirs in the following ways: (1) They use FFTs for inference (i.e. the fprop method only), whereas we show it can be used for all 3 training operations (fprop, backprop and gradient accumulation) as well. (2) They only use FFTs for inference using a previously-trained network (i.e, they do not use it to compute the fprop during training). One reason might be that the number of feature maps used at the time was much smaller (they use 25), and the method was not effective if the filters were not precomputed offline. We use FFTs for the fprop during training and show that it yields a substantial acceleration, even when the FFTs of the filters are not precomputed. This is due to the fact that modern ConvNets have a much larger number of feature maps, which is when the FFT-based method pays off. (3) They use FFTs for the first layer only (all other layers are fully connected), whereas we show that it provides acceleration at all levels. We agree that the main idea is quite simple, but to our knowledge no-one in the machine learning community currently uses this method for training/inference with convnets, which makes it a new approach in our opinion. - The last paragraph of section 3 is confusing: which of your experiments use and do not use the memory-intensive FFT-reuse trick? The following sentence in particular makes the reader feel he is being duped "All of the analysis in the previous section assumes we are using this memory-efficient approach [which you now tell is is infeasible in important practical applications]; if memory is not a constraint, our algorithm becomes faster." Faster than what? Faster than the thing that doesn't require prohibitive amounts of memory? This sentence seems clear to us. By "memory-efficient", we mean the approach that does *not* require very much extra memory. By "All the analysis in the previous section", we are referring to the analysis in the previous section (2.2), which assumes we are using the memory-efficient approach which recomputes the FFTs at each iteration rather than storing them. Nevertheless, we reworded this and hope it is now crystal clear. - Page 4: when you speak of "another means to save memory" what was the first way? (Was the first way to recompute things on demand?) Yes, which is described in the preceding paragraph. - Page 5: Figure 3: This figure is hard to understand. The axes should be labeled on the axes, and the title should contain the contents of the current caption (not the names of the axes), and the caption should help the reader to understand the significance of what is being shown. We fixed this. - Why is the Torch7 implementation called Torch7(custom), and not just Torch7? We mention at the beginning of Section 3 that this is a custom implementation using Torch7. This is different than the version that is shipped with Torch. - The memory access patterns entailed by an algorithm is at least as important for GPU performance as the number of FLOP/s. How does the Cooley-Tukey FFT algorithm work, and how did you parallelize it? These implementation details are really important for anyone trying to reproduce your experiments. / What memory layout do you recommend for feature maps and filters? This ties in with a request for more detail on the algorithm you used. This will be clear when we make the source code available.
Please log in to comment.
victor liparsov 24 Feb 2015
State From To ( Cc) Subject Date Due Action
Reveal: document
victor liparsov
Revealed: document:

24 Feb 2015
Kita coba bersama
Please log in to comment.
victor liparsov 24 Feb 2015
State From To ( Cc) Subject Date Due Action
Reveal: document
victor liparsov
Revealed: document:

24 Feb 2015
Kita coba bersama
Please log in to comment.

Please log in to comment.