While the majority of today's object class models provide only 2D bounding boxes, far richer output hypotheses are desirable including viewpoint, fine-grained category, and 3D geometry estimate. However, models trained to provide richer output require larger amounts of training data, preferably well covering the relevant aspects such as viewpoint and fine-grained categories. In this paper, we address this issue from the perspective of transfer learning, and design an object class model that explicitly leverages correlations between visual features. Specifically, our model represents prior distributions over permissible multi-view detectors in a parametric way -- the priors are learned once from training data of a source object class, and can later be used to facilitate the learning of a detector for a target class. As we show in our experiments, this transfer is not only beneficial for detectors based on basic-level category representations, but also enables the robust learning of detectors that represent classes at finer levels of granularity, where training data is typically even scarcer and more unbalanced. As a result, we report largely improved performance in simultaneous 2D object localization and viewpoint estimation on a recent dataset of challenging street scenes.
State From To ( Cc) Subject Date Due Action
New Request
Bojan Pepikj Conference Track
Request for Endorsed for oral presentation: Multi-View Priors for Learning...

23 Dec 2013
Reveal: document
Bojan Pepikj
Revealed: document: Multi-View Priors for Learning Detectors from Sparse...

23 Dec 2013
Completed
Conference Track Anonymous 8fcf
Request for review of Multi-View Priors for Learning Detectors from Sparse...

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous 42a4
Request for review of Multi-View Priors for Learning Detectors from Sparse...

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous 138a
Request for review of Multi-View Priors for Learning Detectors from Sparse...

14 Jan 2014 04 Feb 2014

6 Comments

Anonymous 42a4 03 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 42a4
Revealed: document: review of Multi-View Priors for Learning Detectors from...

03 Feb 2014
Fulfill
Anonymous 42a4 Conference Track
Fulfilled: Request for review of Multi-View Priors for Learning Detectors from...

03 Feb 2014
This work aims to perform simultaneous detection and viewpoint estimation in the face of having few training examples for many classes or viewpoints, but where many examples are available for a smaller or related set of source instances. To accomplish this, two new types of weight regularization are described, for use with deformable parts models. Both regularizers form a quadratic penalty on the weights by means of a covariance matrix, constructed using models trained on the sources. The first, SVM-MV, constructs the covariance by averaging across all pairs of HoG cells that overlap between adjacent object viewpoints; overlaps are found by projecting to a (provided or guessed) 3D object model. The second, SVM-Sigma, constructs an explicit all-to-all covariance of the weights by averaging across different model instances. SVM-MV extends the ideas of Gao et al. to work across views, while SVM-Sigma breaks from these careful constructions and uses all pairwise interactions that arise. The authors evaluate their effectiveness using two datasets, 3D-objects and KITTI, concluding that such regularization enables good performance in these tasks, with SVM-Sigma generally outperforming the other methods. Unfortunately, the exposition is rather dry and can be hard to follow. Illustrative examples and diagrams would help a lot here, though, particularly sketches of the projections and overlaps for SVM-MV. I think it also would help to instantiate model early on, using one of the datasets from the experiments (i.e., in sec 3, linking \sim_n, w^s, w^t to concrete instances). I'm also a bit confused on what exactly comprised the source vs target data. For sec 4.1 (3D-objects), how was the data divided between sources (used for the priors) and the targets? Was there any overlap between these, either by datapoint or by object instance? For sec 4.2, p.8 para 1 seems to say that for KITTI, the priors were trained from source data drawn from either from 3D-objects or KITTI (i.e. there are two different cases). In the former case, did viewpoints need to be mapped to transfer between the datasets, and which object classes were used? In the latter case, did the prior data overlap the target data? Pros: - Presents new regularizers that exploit structure relations, learned from the data in cases where there are dense subsets or aggregates - Experiments are detailed Cons: - Dry and hard to grasp - Could use more illustrations of the method and problem setup - Results presentation confusing at times Minor comments: - Fig 1 right, TD2ND: the labels along the rows (y-axis) appear swapped: The text indicates the block with ones should connect with-data to no-data. - I found the italics on occurrences of "target" and "source" somewhat distracting; it tended to take my eyes away from the parts of paragraphs I wanted to concentrate on much of the time. - p.5 last para: says there are 9 object classes, but the webpage for this dataset says there are 8? - p.6 (Experiments sec): k \in {1,5,10,all} -- would be nice if this said how many "all" is as well. - Fig 2: figures could use titles - Fig 3: I'm confused about which views were included/excluded for each plot -- are the included views progressive subsets? It looks like the differences are more than that. Maybe a key with on/off bitmap listing each view would help.
Please log in to comment.
Bojan Pepikj 18 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Bojan Pepikj
Revealed: document:

18 Feb 2014
We would like to thank the reviewer for his valuable comments. We've included all suggestions made by the reviewer; they can be found in the newest paper version on arXiv. Specifically, we added model visualizations in the supplemental material section (Fig. 7), as well as we improved Fig. 3 and introduced bars visually explaining the training setups. Additionally, we incorporated the technical suggestions proposed by the reviewer. Answers to specific questions: In Sec. 4.1 (3D object classes dataset) each class consists of 10 instances, depicted from 8 different viewpoints, 3 different scales and heights. We use 5 instances for training, 5 for testing, resulting in 360 images per train and test set. During source model training, from the training set, we sample 15 images per viewpoint. During target model training, we sample K = {1, 5, 10} images per viewpoint. This is done for each class separately. We didn't strictly enforce the training data to be non-overlapping among the source and target models. In Sec 4.2 (KITTI), when learning the priors from 3D Object Classes, we used only the car class. As KITTI and 3D Object Classes are different datasets, there was no data overlap among the training sets for the source and target models. The viewpoint annotations had to be mapped among the two datasets, which is rather trivial to do. When using KITTI data only, the source and the target training data are at a different level in the class hierarchy (e.g. target data at car type level, while source data at car class level), therefore it might happen that the source and the target data overlap. Regarding the 3D Object Classes dataset, it actually comes with 10 classes, all previous work excludes the monitor class from the experiments, while the head class is included in more recent work, thus the 9 classes. In the experiments section, k = all means that all training data for the subordinate category has been used. The amount of training data varies across the subordinate classes. Figures 4, 5 and 6 provide training data distributions per class.
Please log in to comment.
Anonymous 8fcf 04 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 8fcf
Revealed: document: review of Multi-View Priors for Learning Detectors from...

04 Feb 2014
Fulfill
Anonymous 8fcf Conference Track
Fulfilled: Request for review of Multi-View Priors for Learning Detectors from...

04 Feb 2014
Summary This paper proposes to improve multi-view object detection and viewpoint estimation, particularly in the case where some viewpoints are undersampled, by introducing a multi-view prior into the standard SVM training framework used to learn many HOG-based object detectors. The work extends Gao et al. [16], which considers a specific case of the more general form of priors considered in this paper. (A citation is missing for a very relevant paper by Hariharan et al. [a] which also estimates covariances between HOG cells.) The paper presents an extensive empirical study of two newly proposed priors (SVM-Sigma and SVM-MV) compared with the SVM-SV prior of Gao et al. [16] and a standard SVM with no multi-view prior. The experimental results are dense and at times hard to parse, but SVM-Sigma shows a clear advantage on several benchmarks. Pros + The topic is quite interesting and relevant to researchers working on object detection and coarse viewpoint estimation. + The outline of the approach is clear. + The experimental results look quite good. Cons / Questions for author feedback - While the outline of the approach is clear, some of the details are hard to follow. A main confusion throughout the paper is what data is used to estimate the prior? For example, when the target is KITTI and the prior comes from 3D-Objects, are only the car objects used from the 3D-Objects dataset (I would assume so, but this was unclear to me). - For the MV prior “...we first establish pairs of cells in the target model which satisfy a certain relation type ~n.” It’s clear how these pairs would be established when CAD data is used. How are these correspondences established in the case of KITTI data? - Sec. 3.2: I think it would be good to clarify what is meant by “bootstrapping” (i.e. training multiple models on bootstrapped samples) to avoid confusion with the ill-named hard negative “bootstrapping” for training SVMs. - Is setting a single value for C reasonable? The regularization (via the prior) is changing quite a bit and it’s not clear that keeping C constant is reasonable. That said, searching over C can only improve the already good results. - In Table 2 the ‘base’ SVM-Sigma results are the same for 3D-Objects and KITTI priors---perhaps a bug in the table? [a] Bharath Hariharan, Jitendra Malik and Deva Ramanan. Discriminative decorrelation for clustering and classification. In ECCV 2012.
Please log in to comment.
Bojan Pepikj 18 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Bojan Pepikj
Revealed: document:

18 Feb 2014
We would like to thank the reviewer for the valuable comments. We uploaded a new paper version and we included the prior work [a]. Regarding the questions: - Indeed we used the car class from the 3D Object classes dataset to learn the 3D object priors that are later on used to train target class detectors on KITTI. - On both KITTI and 3D Object Classes, we use CAD data to establish the pairs of corresponding cells across views which are used in the case of SVM-MV. - "Bootstrapping" refers to the method from classical statistics where the data is re-sampled multiple times in order to provide an estimate of the underlying distribution. Specifically, we train N source models by sub-sampling K positive training examples from the training set for each of the source models. - Regarding the C parameter, for the baseline (SVM) we followed the suggestions of [12] and [13] and used the value C = 0.002. For the proposed method, we ran experiments with varying amount of data per viewpoint and different values of C and we observed two things: firstly, C = 0.002 is optimal value in most of the cases or it is very close to the optimal and secondly, the performance is stable in the range C*[0.1, 10], therefore we chose the same value for all the models. We are happy to include those results in the supplemental material. Obviously one should do k-fold cross validation on all the tunable parameters jointly, but due to the scarce training data, large search space, costly training time and extensive experiments, doing the proper k-fold cross validation is time consuming and prohibitive. - Table 2 is correct. The models are trained differently, but they result in very similar performance.
Please log in to comment.
Anonymous 138a 07 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 138a
Revealed: document: review of Multi-View Priors for Learning Detectors from...

07 Feb 2014
Fulfill
Anonymous 138a Conference Track
Fulfilled: Request for review of Multi-View Priors for Learning Detectors from...

07 Feb 2014
The paper presents a method to learn a quadratic regularizer that improves the performance of multi-view object detectors when very little training data is available for each object or view. The regularizer is computed using a sparse correlation matrix to identify similarity amongst feature weights in the detectors for different views of the same object (in some cases relying on a 3D model to help establish correspondence between features). The similarity matrix is then used to define a Laplace regularization term for a standard SVM, thus requiring that a new multi-view detector trained with the regularizer share similar structure across views. Results are presented on several detection tasks using the KITTI (driving/urban) dataset and 3D Object Classes dataset. It is shown that the main advantage is that the method can learn to detect views of novel target objects even when some views of these objects have few or no training examples. Overall the paper is clearly written and the experiments are extensive. The authors broke out all of the different cases [number of examples per view, and which views are missing] to make their point. In the regime where the method is intended to help (where very few examples are available for a particular view of a target object) the regularizer does help for the multi-view DPM model considered in the paper. The main caveat here is that the method seems to assume the use of separate detectors for every subclass and view (and thus each detector requires its own dataset). That really exacerbates the problem of too little data where other methods might not have an issue. This is a nice trick, but since the one-detector-per-object-per-view approach has multiple scalability issues, it looks like the proposed solution also suffers from those barriers. If this approach to multi-view detection is not workable in the near future, what is the high-level idea that we should take from this work? I do not see it. Also, the “dense” sparsity pattern appears to be by far the best performer. This is somewhat interesting by itself, but also detracts from the “sparse” proposals in the paper (and, of course, the dense approach is not very scalable). Some candid discussion on the consequences of this result might clarify what parts of the system are important. Pros: A relatively simple idea to exploit knowledge of regularity across views in multi-view detectors. Primary value is in cases where there are few or no examples for a particular view, in which case it does appear to help over baseline approaches. Cons: Relies on a fair amount of prior geometric knowledge. It is unclear how to apply this to cases where the number of features is much larger than DPM-style models [e.g., deep neural nets] or has deeper difficult-to-interpret layers where geometric information cannot be leveraged. Other: The SVM regularization penalty might need to be tuned for complete fairness of comparisons. Since the regularizer is fundamental to performance and the number of training examples is varied, the penalty setting could alter the testing numbers. (It is possible that the experiments/implementation are set up in such a way that this does not matter much; a note to this effect, if true, would be helpful.)
Please log in to comment.
Bojan Pepikj 18 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Bojan Pepikj
Revealed: document:

18 Feb 2014
We would like to thank reviewer 138a for his valuable comments. The reviewer expresses his concern about one of the main assumptions in the paper - the use of a separate detector for each subclass and view. Regarding this, we would like to point out two things: First, a fair number of state-of-the-art methods for viewpoint estimation and object localization belongs to this category. For example, see references [25,28,34,40], as well as Tab. 1. They are all detectors which assume a dedicated detector per viewpoint. Second, our results confirm that having a detector per subclass and per view (more specific detectors) indeed leads to better performance (Tab. 3, results for the baseline SVM (KITTI)): the car-type SVM detector is consistently better than the base SVM detector. Thus the usage of more specific detectors is justified. Naturally, as we go more specific, the data distribution across viewpoints becomes scarce - to compensate for that, we successfully employ stronger and structured regularization (SVM-Sigma). This is confirmed in Tab. 3: SVM-Sigma is better than SVM in all comparable settings. Another concern is the usage of prior geometric knowledge. In this work we model the geometric structure for a reason. We assume and believe that there is an underlying cause (object geometry) which drives the appearance variation and changes of the object. Rather than letting the method discover this structure, we explicitly model it (to a certain degree) in our work. How one would apply this to deep NNs is indeed an open and interesting question. Regarding the C parameter, for the baseline (SVM) we followed the suggestions of [12] and [13] and used the value C = 0.002. For the proposed method, we ran experiments with varying amount of data per viewpoint and different values of C and we observed two things: firstly, C = 0.002 is optimal value in most of the cases or it is very close to the optimal and secondly, the performance is stable in the range C*[0.1, 10], therefore we chose the same value for all the models. We are happy to include those results in the supplemental material. Obviously one should do k-fold cross validation on all the tunable parameters jointly, but due to the scarce training data, large search space, costly training time and extensive experiments, doing the proper k-fold cross validation is time consuming and prohibitive.
Please log in to comment.

Please log in to comment.