Several interesting generative learning algorithms involve a complex probability distribution over many random variables, involving intractable normalization constants or latent variable normalization. Some of them may even not have an analytic expression for the unnormalized probability function and no tractable approximation. This makes it difficult to estimate the quality of these models, once they have been trained, or to monitor their quality (e.g. for early stopping) while training. A previously proposed method is based on constructing a non-parametric density estimator of the model's probability function from samples generated by the model. We revisit this idea, propose a more efficient estimator, and prove that it provides a lower bound on the true test log-likelihood, and an unbiased estimator as the number of generated samples goes to infinity, although one that incorporates the effect of poor mixing (making the estimated likelihood worse, i.e., more conservative).
State From To ( Cc) Subject Date Due Action
New Request
Yoshua Bengio Conference Track
Request for Endorsed for oral presentation: Bounding the Test Log-Likelihood of...

18 Dec 2013
Reveal: document
Yoshua Bengio
Revealed: document: Bounding the Test Log-Likelihood of Generative Models

18 Dec 2013
Completed
Conference Track Anonymous 16f7
Request for review of Bounding the Test Log-Likelihood of Generative Models

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous 0661
Request for review of Bounding the Test Log-Likelihood of Generative Models

14 Jan 2014 04 Feb 2014
Completed
Conference Track Anonymous 60ea
Request for review of Bounding the Test Log-Likelihood of Generative Models

14 Jan 2014 04 Feb 2014

7 Comments

Anonymous 60ea 07 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 60ea
Revealed: document: review of Bounding the Test Log-Likelihood of Generative...

07 Feb 2014
Fulfill
Anonymous 60ea Conference Track
Fulfilled: Request for review of Bounding the Test Log-Likelihood of Generative...

07 Feb 2014
This paper proposes an estimator for the log-likelihood of intractable models with latent variables. The approach is simple in that it has no free parameters, doesn’t require an explicit likelihood, and only needs samples from the model. The approach is most useful for model comparison since the estimate is conservative rather than optimistic. I enjoyed reading this paper. The proposed method is quite novel and elegant and has the potential to be a useful tool for model comparison. One issue is that the estimator seems to require a large number of samples in order to converge, and this potentially exacerbated by increasing model size. As stated in the paper, this is likely to do with the convergence of MCMC within the model itself. One empirical test of this would be to compare the efficiency of the estimator with exact samples vs MCMC in e.g., a small RBM. The biased CSL is also novel, but seems to be even more optimistic than AIS. The argument of the paper is based on the idea that we would prefer conservative estimates to optimistic estimates for model comparison. When would the authors expect the biased CSL method to be useful in practice? How many steps would be required before biased CSL matches AIS? Minor thoughts and some found typos below. 1. In table 1 the AIS and CSL estimates are vastly different. One is optimistic and one is conservative - which one is closer to the truth? Is there a reason that GSN is so much better? Obviously the truth is impossible to determine, but it is clear that more samples are needed before the estimate converges. 2. The RBM used in table 2 is quite small, using only 5 hidden units. 20 hidden units is slightly larger but still tractable. It would be good to see how the efficiency of the estimator is affected by model size. 3. An RBM trained with PCD is thought to have much better likelihood than an RBM trained with CD. Is this reflected in CSL estimates? formulat -> formulate (section 1) collecte -> collect (section 2) in -> \in (Monte-Carlo estimator in section 4) 30 steps -> 300 steps (or the the legend in Figure 1 has a typo, section 6) mode -> model (section 7)
Please log in to comment.
KyungHyun Cho 09 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
KyungHyun Cho
Revealed: document:

09 Feb 2014
Dear Reviewer (60ea), We thank you for the thorough and insightful comment. Allow us to respond to some of your comments below. "When would the authors expect the biased CSL method to be useful in practice?" As shown in Fig. 1, the biased CSL well reflects the ordering of the performances of different models correctly, however, optimistic. This is true even in the case where only a single MCMC step was taken from each test sample. As stated in Sec. 6, we believe the biased CSL will be useful in comparing models when there is no alternative way to compute/approximate the log-likelihood (such as GSN). "which one is closer to the truth?" As you have correctly mentioned, it is not possible to answer this exactly for those models in Table 1. However, the result in Table 2 suggests that with enough samples, both the CSL and the estimate using AIS approach the true log-likelhood closely. "Is there a reason that GSN is so much better?" One important factor affecting the CSL estimate is the "mixing" rate of MCMC chain. As has been shown earlier, the MCMC sampling by GSN mixes among different modes very quickly, potentially leading to more accurate CSL estimate with less number of samples. "an RBM trained with PCD is thought to have much better likelihood than an RBM trained with CD. Is this reflected in CSL estimates?" As the samples generated from an RBM trained with CD are generally bad (most of them tend to be from suprious modes), we believe this will be well reflected in CSL estimates. Also, our preliminary experiment with CD revealed the same tendency (not included in the paper). Thank for you for pointing out typos in the paper. We will correct them in the next version.
Please log in to comment.
Anonymous 0661 08 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 0661
Revealed: document: review of Bounding the Test Log-Likelihood of Generative...

08 Feb 2014
Fulfill
Anonymous 0661 Conference Track
Fulfilled: Request for review of Bounding the Test Log-Likelihood of Generative...

08 Feb 2014
The paper proposes a method for estimating the log probability of any probabilistic model that can generate samples. The method builds a local density estimator around the samples using the model's conditional probability, which is used to evaluate the log probability of a test set. An important selling point of the method is that it evaluates the probabilistic model and the sampling procedure jointly, and that it is asymptotically consistent, in the sense that the estimates converge to the true likelihood as the number of samples approaches infinity. This work is quite novel, and it places the idea of used by Breuleux et al. in a rigorous framework. Empirically, the method works well on small models, although it exhibits very substantial divergence from AIS on larger models, as shown in Table 1. Perhaps the greatest weakness of this method, which is worth discussing, is that the number of samples that's needed in order to accurate compute a log probability grows exponentially with the entropy of the distribution. For an example, consider the dataset consisting the concatenations of of 10 randomly chosen MNIST digits. It is fairly clear any sample set <<< 10^10 will vastly underestimate the log probability of a perfectly good sample. That is unfortunate, because it means that the method will not work well on complicated models of high entropy distributions, such as images or speech. This weakness notwithstanding, the method is very adequate for model comparison. To summarize: Pros: interesting method for obtaining conservative underestimates of the log probability, works with nearly any model. Cons: method's complexity is exponential in the distribution's entropy; the proposed fix is no longer conservative.
Please log in to comment.
KyungHyun Cho 09 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
KyungHyun Cho
Revealed: document:

09 Feb 2014
Dear Reviewer (0661), We thank you for the thorough and insightful comment. Allow us to respond to some of your comments below. "it places the idea of used by Breuleux et al. in a rigorous framework" We agree that the proposed method is closely related to that by Breuleux et al. However, we claim that it is an improvement over the method by Breuleux et al. in two ways: (1) the CSL is more efficient because each sample of latent variables h can cover many x's. (2) the CSL does not have any tuning parameter such as bandwidth. "it exhibits very substantial divergence from AIS on larger models, as shown in Table 1." As you have pointed out earlier in your review, the proposed CSL estimator does not only evaluate the model itself but also the sampling procedure. When "mixing" by MCMC sampling is fast (as in GSN), the CSL estimate tend to converge quickly, while in the opposite case (as in RBM and DBN), the convergence is slow. "the number of samples that's needed in order to accurate compute a log probability grows exponentially with the entropy of the distribution" This is true, but one thing to be noted is that we are not aware of any alternative approach that does not suffer from this problem, when an approach such as AIS is not applicable. Furthermore, we believe the fact that the CSL estimator uses the samples of latent variables h which are in a more abstract space than the raw input space, may make sampling-based methods such as the proposed CSL greatly reduce the curse of dimensionality. Nonetheless, we agree that this is where future work lies, and we are indeed currently exploring different ways of exploiting the presence of a high-level (deep) representation to make the problem of likelihood estimation much easier and its convergence faster. Much more work is needed before these new ideas can be proven right and this paper should instead be judged in comparison with the past published work.
Please log in to comment.
KyungHyun Cho 09 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
KyungHyun Cho
Revealed: document:

09 Feb 2014
Dear Reviewer (0661), We thank you for the thorough and insightful comment. Allow us to respond to some of your comments below. "it places the idea of used by Breuleux et al. in a rigorous framework" We agree that the proposed method is closely related to that by Breuleux et al. However, we claim that it is an improvement over the method by Breuleux et al. in two ways: (1) the CSL is more efficient because each sample of latent variables h can cover many x's. (2) the CSL does not have any tuning parameter such as bandwidth. "it exhibits very substantial divergence from AIS on larger models, as shown in Table 1." As you have pointed out earlier in your review, the proposed CSL estimator does not only evaluate the model itself but also the sampling procedure. When "mixing" by MCMC sampling is fast (as in GSN), the CSL estimate tend to converge quickly, while in the opposite case (as in RBM and DBN), the convergence is slow. "the number of samples that's needed in order to accurate compute a log probability grows exponentially with the entropy of the distribution" This is true, but one thing to be noted is that we are not aware of any alternative approach that does not suffer from this problem, when an approach such as AIS is not applicable. Furthermore, we believe the fact that the CSL estimator uses the samples of latent variables h which are in a more abstract space than the raw input space, may make sampling-based methods such as the proposed CSL greatly reduce the curse of dimensionality. Nonetheless, we agree that this is where future work lies, and we are indeed currently exploring different ways of exploiting the presence of a high-level (deep) representation to make the problem of likelihood estimation much easier and its convergence faster. Much more work is needed before these new ideas can be proven right and this paper should instead be judged in comparison with the past published work.
Please log in to comment.
Anonymous 16f7 11 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
Anonymous 16f7
Revealed: document: review of Bounding the Test Log-Likelihood of Generative...

11 Feb 2014
Fulfill
Anonymous 16f7 Conference Track
Fulfilled: Request for review of Bounding the Test Log-Likelihood of Generative...

11 Feb 2014
In this paper, the authors propose a new way to estimate the probability of data under a probabilistic model from which sampling is hard but for which an efficient Markov chain procedure exists. They first present an asymptotically unbiased estimator, then a more efficient biased estimator. The idea is undeniably interesting. Some of the most used generative models satisfy these constraints and being able to calculate the probability of data under these models is crucial to comparing them. However, the results presented in this paper are underwhelming. For models where AIS was usable (the DBN, the DBM and the RBM), the CSL results wildly differ from the AIS ones. Since the results on the small RBM (Table 2) give a clear advantage to AIS, I am inclined to believe these results more. Another caveat, unfortunately extremely difficult to avoid, is that the effectiveness of these methods can only be empirically proven on tiny models where mixing problems do not occur. I really do not blame the authors for that but this really limits the potential impact of the method. Experiment in Figure 1 is also very light to conclude on the effectiveness of Biased CSL. Binary MNIST is a very particular dataset and this experiment does not convince me that it is actually usable to compare models, especially of different types. Conclusion: this paper does not prove the effectiveness of the proposed method. The propositions are not worth publication by themselves. Other comments: - CSL is only a lower bound on the true log probability of the data in expectation. This should be made clearer in the paper. - The pseudo code should either be commented or removed entirely. As it is, it is only useful to people who already understood the algorithm. - Could you give more details on the parameters for AIS? How many chains? How many intermediate temperatures? How does the computation time compare to CSL?
Please log in to comment.
KyungHyun Cho 17 Feb 2014
State From To ( Cc) Subject Date Due Action
Reveal: document
KyungHyun Cho
Revealed: document:

17 Feb 2014
Dear Reviewer (16f7), We thank you for the thorough and insightful comment. Allow us to respond to some of your comments below. "For models where AIS was usable (the DBN, the DBM and the RBM), the CSL results wildly differ from the AIS ones." The proposed CSL estimator does not only reflect the model's generative capability but also the MCMC sampler used to collect samples of the latent variables. We believe the higher variance (or more slowly converging) CSL estimates, compared to AIS, are due to the inefficiency, or poor mixing, of Gibbs sampling in well-trained RBMs. However, keep in mind that the use case (and motivation) for CSL was to estimate the likelihood of GSNs, for which AIS is not available and where mixing tends to be much better. As we have already stated in our responses to the other reviewers' comments, the proposed CSL estimator seems to be the only one that can be used for models that have no explicit formula for computing the probability (either normalized or unnormalized). However, we agree that improving the variance of CSL is an important objective for future work and we are studying options. "Another caveat, unfortunately extremely difficult to avoid, is that the effectiveness of these methods can only be empirically proven on tiny models where mixing problems do not occur" We agree, and this is a problem in general with any estimator. For instance, even with AIS, any empirical evidence that compares it with the true value can be only given for tiny models. However, we believe this does not and should not discourage research and development of new estimators, especially, considering that some generative models such as GSNs do not have any better alternative at the moment. "Binary MNIST is a very particular dataset and this experiment does not convince me that it is actually usable to compare models, especially of different types" We agree with you that more experiments with different types of data may support our claim better. In a next version of the paper, the proposed estimators should be tested on other datasets. We have started experiments on the TFD dataset and will be able to add these results to the paper before the conference. "The propositions are not worth publication by themselves." We agree that the math in this paper is very simple. However, please consider the following contributions: (1) We improve over a previously available likelihood estimator (Breuleux et al) for models such as GSNs - by sampling over h rather than over x, making CSL more efficient because each h can cover many x's in a way that should be better than a poorly informed kernel density (e.g. centered on a sampled x) - by not requiring a bandwidth hyper-parameter to be tuned (just for the purpose of estimating the likelihood) (2) We study experimentally the properties of this estimator and compare it to exact and AIS estimates. (3) We introduce a biased variant and experiments find it to order models well. "CSL is only a lower bound on the true log probability of the data in expectation. This should be made clearer in the paper." Indeed. We will make the text more clear in the revision. "The pseudo code ... is only useful to people who already understood the algorithm." We do not understand what you mean when writing that it is only useful to those who already understood the algorithm. We would appreciate if you could further explain the problem with the presented algorithm. We will then make changes accordingly. "Could you give more details on the parameters for AIS? How many chains? How many intermediate temperatures? How does the computation time compare to CSL?" We used 100 independent AIS runs with 30,000 chains each. The chains were unevenly distributed between the inverse temperature 0 (independent variables) and 1 (true model distribution) such that there were 10k chains between 0 and 0.5, another 10k chains between 0.5 and 0.9, and the remaining 10k chains between 0.9 and 1. Hence, the computation required for AIS is roughly equivalent to computing the CSL estimates with 1.5 million samples, considering that the CSL need to compute the conditional probability for a test sample. For instance, the time taken by the AIS estimator is somewhere between those taken by the CSL with 10k and 50k samples in Table 1.
Please log in to comment.

Please log in to comment.