Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks
Ian J. Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, Vinay Shet
23 Dec 2013 arXiv 5 Comments
Conference Track
Recognizing arbitrary multi-character text in unconstrained natural photographs is a hard problem. In this paper, we address an equally hard sub-problem in this domain viz. recognizing arbitrary multi-digit numbers from Street View imagery. Traditional approaches to solve this problem typically separate out the localization, segmentation, and recognition steps. In this paper we propose a unified approach that integrates these three steps via the use of a deep convolutional neural-network that operates directly off of the image pixels. This model is configured with 11 hidden layers all with feedforward connections. We employ the DistBelief implementation of deep neural networks to scale our computations over this network. We have evaluated this approach on the publicly available SVHN dataset and achieve over 96% accuracy in recognizing street numbers. We show that on a per-digit recognition task, we improve upon the state-of-the-art and achieve 97.84% accuracy. We also evaluated this approach on an even more challenging dataset generated from Street View imagery containing several 10s of millions of street number annotations and achieve over 90% accuracy. Our evaluations further indicate that at specific operating thresholds, the performance of the proposed system is comparable to that of human operators and has to date helped us extract close to 100 million street numbers from Street View imagery worldwide.
State From To ( Cc) Subject Date Due Action
New Request
Ian Goodfellow Conference Track
Request for Endorsed for oral presentation: Multi-digit Number Recognition from...

23 Dec 2013
Reveal: document
Ian Goodfellow
Revealed: document: Multi-digit Number Recognition from Street View Imagery...

23 Dec 2013
Completed
Conference Track Anonymous dee9
Request for review of Multi-digit Number Recognition from Street View Imagery...

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous 88a9
Request for review of Multi-digit Number Recognition from Street View Imagery...

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous ac02
Request for review of Multi-digit Number Recognition from Street View Imagery...

14 Jan 2014 04 Feb 2014

5 Comments

Anonymous 88a9 07 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 88a9
Revealed: document: review of Multi-digit Number Recognition from Street View...

07 Feb 2014
Fulfill
Anonymous 88a9 Conference Track
Fulfilled: Request for review of Multi-digit Number Recognition from Street View...

07 Feb 2014
This submission describes an approach for digit and sequence recognition that results in improved performance on the StreetView house number dataset using a simple structured output to recognize the entire sequence as an ordered set of detections and a very deep convolutional network (11 layers). The approach is end-to-end, without requiring multiple networks or a composite of techniques. The approach has very high accuracy and is attractive in its simplicity, although the authors make clear that it could not be extended to, for instance, general text recognition in images. The relevance for ICLR is high, although the paper has some significant omissions that keep it from being a very strong candidate for acceptance. First, the method is not clear, in particular the interface between the varia softmax digit classifiers and the output of the convnet. It is unclear whether there is any representation of locality in this interface beyond the assignment of different classifiers for each position. In the introduction, it is stated that the subtasks of localization, segmentation, and recognition are solved in an integrated way, but section 3 introduces the dataset, which has localized inputs- numbers which fill at least ⅓ of the image. This needs to be clarified. Second, to further the contribution of the paper, additional experiments or analysis could have been performed to understand the features, the architecture, the loss function, or other aspects of the approach. With these missing, the submission is somewhat thin and the contribution lessened. Smaller issues: the variables used in the plate model in Fig. 1 need to be defined in the caption and supporting text as well as later when the method is explained, and DistBelief should not be used in the abstract and intro without citation or footnote.
Please log in to comment.
Anonymous ac02 07 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous ac02
Revealed: document: review of Multi-digit Number Recognition from Street View...

07 Feb 2014
Fulfill
Anonymous ac02 Conference Track
Fulfilled: Request for review of Multi-digit Number Recognition from Street View...

07 Feb 2014
The paper presents an application of deep neural networks to the problem of reading multi-digit housenumbers from StreetView images. The basic architecture is essentially standard (maxout units, ReLu units, convolution, and several dense layers), but is unusually deep (11 layers). The output of the detector is a softmax predictor for the length of a sequence, as well as softmax predictors for each digit in the sequence. This simple output encoding is sufficient to achieve a high recall at very high precision that is competitive with human labelers. The authors conclude that this particular OCR application may regarded as solved at this stage. There is relatively little new in terms of algorithms here, but the results are excellent. The paper is clearly written, though the prose could be tightened up a bit. If room can be made, I think a deeper analysis of the method’s success would be useful. For example, is it possible that the “deeper” networks are fitting the training set better as a result of more model parameters? Or is the depth truly the deciding factor? Most surprising to me is the fact that there is no explicit need to model the label structure beyond the obvious: a detector for sequence length and softmax outputs for digit classes, with a small tweak to choose the most likely length of the sequence during test time. This follows along with recent work on detection systems that suggests sophisticated regressors are able to do something similar for object classes, so I think this otherwise simple component is a useful datapoint for that conversation. Pros: Simple off-the-shelf application with excellent performance; perhaps high enough to count this task as “solved”. A useful reference point for work on predicting structured outputs like character sequences. Cons: Essentially boiler-plate neural network. A bit more detailed analysis of the result would be useful.
Please log in to comment.
Anonymous dee9 08 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous dee9
Revealed: document: review of Multi-digit Number Recognition from Street View...

08 Feb 2014
Fulfill
Anonymous dee9 Conference Track
Fulfilled: Request for review of Multi-digit Number Recognition from Street View...

08 Feb 2014
The authors propose an integrated approach to sequence recognition in the case of limited number of characters (house numbers with 5 characters at most). Avoiding separating localization, segmentation and recognition is novel in this context. This is the right approach and results are good but the model is not well explained (see below). Pros: - integrated sequence recognition rather than traditional localization/segmentation/recognition step. - new record on single digit - accuracy high enough for real world deployment (although the ability to pay human operators for remaining errors in the 98% regime means that real world deployment is possible even at very low accuracy, depending how much money the company is willing to spend). Cons: - "and has special code for handling much of the mechanics such as proposing candidate object regions" and "we take a similar approach, but with less post-processing": this is misleading and debatable that there is less post-processing because the authors methods relies on a pre-detection step that gives relatively tight bounding boxes around the number, as indicated here: "Beyond cropping the image close to the street number". One could argue that there is as much post-processing in the other cited work once the detection is performed. - what happens at the top of the network is not clear at all to me. Are they using an HMM or not? Does the 64x64 input image yield a grid of probabilities? then what is the size of that grid? Or is the network directly predicting N outputs, and based on the value of L, uses only the first L values out of N? - "a softmax classifier that is attached to intermediate features": digit softmax are located on the intermediate hidden layers? which ones? - "On this task, due to the larger amount of training data, we did not need to train with dropout": it would have been nice to see the numbers with and without dropout instead of just relying on this claim.
Please log in to comment.
Ian Goodfellow 18 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Ian Goodfellow
Revealed: document:

18 Feb 2014
In order to avoid the ArXiv approval delay, we have posted a revised copy of our paper directly here: https://drive.google.com/file/d/0B64011x02sIkd3RwSDRpTXlKSzQ/edit?usp=sharing "Anonymous 88a9", "Anonymous ac02" and "Anonymous dee9" have raised concerns about the clarity of the description of the architecture and inference process. We have updated the paper to add Appendix A, a worked example showing exactly how the inference process works step by step. We hope this resolves the ambiguities in the main text. "Anonymous ac02" suggests to use larger shallow networks to have another datapoint on the importance of a deep network. This is a great idea and we have started experiments with models of varying sizes using 3 hidden layers to see their improvements based on the number of parameters alone. We will post another update when these experiments are complete but didn't manage to finish them during the recommended rebuttal period. "Anonymous dee9" suggested that we were doing heavy post-processing of the image because we were doing tight crops around the street number. We have five responses to this criticism: 1) We mistakenly omitted some details about the limitations of our preprocessing of the larger internal dataset. These details make it clear that the preprocessing is not especially useful. Specifically, the centroid is not well-known, the scale is not known at all, and the crop size we use for that dataset is 128x128 compared to the 54x54 we used on the public SVHN. Our results on this dataset therefore indicate that the network is able to localize the house number itself as well as localizing the digits within the number. They also demonstrate the network is able to handle wide variations in the scale of the house number. 2) It is not computationally practical to run a convolutional network on a completely uncropped Street View panorama, so some degree of preprocessing is inevitable. 3) On the public SVHN, all pre-existing methods make use of more ground truth localization information than our method does. All previous authors who have published on this dataset use the version that is tightly cropped per-digit. We use the much looser crop that identifies only the region in which the multi-digit number occurs, but we still improve upon the state of the art for single digit recognition. This demonstrates that the system is able to localize individual digits on this dataset. 4) Other systems for transcription represent the concept of the sequence external to the neural net, i.e. the sequence parsing is handled by post-processing techniques such as HMM inference, non-maxima suppression, etc. Only our approach trains a neural net with an internal concept of a sequence. 5) Our system uses *no post-processing at all* but only pre-processing, while other systems use both pre-processing and post-processing. "Anonymous dee9" suggests that we should have given the accuracy number with and without dropout for our internal dataset to prove that "we did not need to train with dropout". We agree that this claim is overreaching, we changed the paper to say that we were not seeing large overfitting and decided to not use dropout primarily in order to speed up the training, which is time-consuming for the larger models.
Please log in to comment.
Ian Goodfellow 05 Mar 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Ian Goodfellow
Revealed: document:

05 Mar 2014
We ran some more experiments to determine the effect of increasing the number of parameters in shallow models. Specifically, we used a model with 3 hidden layers: two convolutional layers followed by a fully connected layer. If we increase the size of the fully connected layer it rapidly starts to overfit. If we increase the size of the convolutional layers, the accuracy increases, but with diminishing marginal utility. We have launched a second round of experiments with even larger convolutional layers. We are hoping to find the point at which they start to overfit so that we can include this result in the final version of the paper. So far we have not been able to match the accuracy of our 11-layer model with this approach.
Please log in to comment.

Please log in to comment.