Efficient discriminant viewpoint selection for active Bayesian recognition

Catherine Laporte and Tal Arbel

Introduction

Given a database of labeled objects, the object recognition problem requires associating a label with previously unseen images of these objects. The pose estimation problem consists in determining the pose of the objects seen in these images with respect to the reference frames defined by the objects in the database. These problems are difficult due to the ambiguities in appearance which are intrinsic to the particular database. For example, two different objects may have a similar appearance when seen from certain points of view, or an object may look the same in several different positions. In order to resolve those ambiguities, it is helpful to use multiple observations of the object instead of one. Further improvements can be obtained by applying active vision techniques to select observations in order to optimise the process [2, 3, 4]. Such decision making, however, comes at a computational cost. This work focuses on the development of observation selection criteria which are efficient both in terms of the number of views required to solve the problem, and in terms of computational tractability.


Sequential Bayesian recognition

Consider a database of objects $o_i$, $i \in \{1, \ldots N_o\}$ and a mobile camera facing an unknown object from this set whose class and pose are to be determined. The objects may be positioned in $N_{\theta}$ discrete poses defined according to a global reference frame. The observed scene may be illuminated by one of $N_l$ different light sources, $l$. The camera measurement is parameterised by a feature vector $\mathbf{d}$, which depends on the identity $o$ of the object, its pose $\mathbf{\theta}$, the light source $l$ and the viewing position $\mathbf{v}$. Under uncertainty, this relationship is represented through a conditional probability density function $p(\mathbf{d}\vert o, \mathbf{\theta}, l, \mathbf{v})$, which may be obtained from a physical model or estimated off-line from training data.

Given a measurement $\mathbf{d}$, a known viewing position $\mathbf{v}$ and a prior distribution $P(o, \mathbf{\theta}, l)$ over object class, object pose and light source, the probability of each class-pose-light source tuple is computed using Bayes' rule:

\begin{displaymath}
P(o, \mathbf{\theta}, l\vert\mathbf{d}, \mathbf{v}) \propto ...
...t o, \mathbf{\theta}, l, \mathbf{v}) P(o, \mathbf{\theta}, l),
\end{displaymath} (1)

Assuming that subsequent measurements are independent of each other given object class, pose, light source and viewing positions, one obtains the recursive Bayesian update rule
\begin{displaymath}
P(o, \mathbf{\theta}, l\vert\mathbf{d}_t, \mathbf{v}_t, \ldo...
...{t-1}, \mathbf{v}_{t-1}, \ldots, \mathbf{d}_1, \mathbf{v}_1).
\end{displaymath} (3)

A sequential recognition engine based on this evidence fusion scheme exploits the information provided by the appearance of the object and, more importantly, by its spatial structure [7]. Finally, a probability distribution over object class and pose is obtained by marginalising (2) over light sources:
\begin{displaymath}
P(o, \mathbf{\theta}\vert\mathbf{d}_t, \mathbf{v}_t, \ldots,...
..., \mathbf{v}_t, \ldots, \mathbf{d}_1, \mathbf{v}_1). \nonumber
\end{displaymath}  


Active viewpoint selection

The object recognition and pose estimation problem is difficult, mainly because for certain choices of viewing positions $\mathbf{v}$, the observed data may be well explained by more than one hypothesis. This difficulty can be alleviated by choosing a shift in viewpoint such that competing hypotheses will appear as different as possible in order to facilitate distinction. Considering the recognition task as a series of pairwise discrimination subtasks, and given a measure of dissimilarity $\Delta(p\vert\vert q)$ between two probability density functions $p$ and $q$, the following general form is proposed as a criterion for the selection of a viewpoint $\mathbf{v}$ at step $t+1$:
$\displaystyle \mathbf{v}^*_{t+1} =$ $\textstyle \underset{\mathbf{v}_{t+1}}{\mbox{argmax}}$ $\displaystyle \sum_{i = 1}^{N_o}\sum_{j = 1}^{N_\theta}
\sum_{k = i}^{N_o}\sum_...
...thbf{\theta}_j\vert\mathfrak{D}_t)
P(o_k, \mathbf{\theta}_m\vert\mathfrak{D}_t)$  
    $\displaystyle \Delta(p(\mathbf{d}_{t+1}\vert o_i, \mathbf{\theta}_j, \mathbf{v}...
...athbf{d}_{t+1}\vert o_k, \mathbf{\theta}_m, \mathbf{v}_{t+1}, \mathfrak{D}_t)),$ (4)

where $\mathfrak{D}_t \equiv
\{\mathbf{d}_t, \mathbf{v}_t, \ldots, \mathbf{d}_1, \mathbf{v}_1\}$ and

\begin{displaymath}
m_{ijk} = \begin{cases}
j+1 & k = i, \\
1 & k > i.
\end{cases}\end{displaymath}

Each term of the sum achieves a pairwise comparison of two hypotheses. Since the most probable hypotheses account for most of the ambiguity, more effort is made to disambiguate highly probable hypotheses. The general form of the criterion makes no assumption about the dissimilarity measure $\Delta$. Good choices for this measures include Mahalanobis distances, when appearance models are well represented by their mean and variance, or the information theoretic Jeffrey or Kullback-Leibler divergences (and fast approximations thereof) for more general cases [5].

Genearlly speaking, it is possible to choose a $\Delta$ which can be computed fast, making the viewpoint selection algorithm faster than those based on mutual information[4]. The viewpoint selection criterion criterion can also be simplified by neglecting terms involving extremely low probability object class and pose pairs, which contribute little to the sum. This causes the computation of (4) to get increasingly fast as the Bayesian inference engine converges toward a single winning hypothesis.


Experiments

The proposed active object recognition and pose estimation framework may be used in conjunction with a broad variety of feature extractors and appearance models. The results presented here are based on a low dimensional appearance-based object representation obtained by principal component analysis (PCA) [6]. A Gaussian appearance model was fitted to the projections of labeled training images onto eigenspace.

Case study 1: synthetic 3D models

The first case study was conducted with a database of 31 synthetic 3D models of aircraft [1]. Figure 1 shows sample rendered images of these objects.

Figure 1: Sample objects from the first case study.
\includegraphics[width=.65in]{figs/airplane/acrotwin.eps} \includegraphics[width=.65in]{figs/airplane/avio.eps} \includegraphics[width=.65in]{figs/airplane/B25mb.eps} \includegraphics[width=.65in]{figs/airplane/bbmysliwiec.eps}
\includegraphics[width=.65in]{figs/airplane/bf109.eps} \includegraphics[width=.65in]{figs/airplane/Cessna_Card.eps} \includegraphics[width=.65in]{figs/airplane/custom.eps} \includegraphics[width=.65in]{figs/airplane/dash8.eps}
The problem considered was that of identifying an unknown object and estimating its pose under the illumination of a single possible light source. The observer is a virtual camera with two degrees of freedom about a sphere, within which the object pose can vary according to two degrees of freedom (pan and tilt).

In a first set of experiments, the proposed observation selection strategy was compared to a random navigation strategy where no active selection of observations was employed. The object recognition and pose estimation results are summarised in table 1.


Table 1: Comparison of the accuracy of recognition and pose estimation results for the aircraft database, using the random and proposed observation selection strategies.
  Recognition rate Average pose error
Random navigation 81% 1.84 degrees
Proposed strategy 83% 2.19 degrees


While there are no significant differences in accuracy between the two methods, the proposed active view selection method requires fewer views for recognition.

Similar experiments were then performed on 14 of the synthetic objects using an observation selection criterion based on mutual information [4]. The proposed strategy achieved similar results to mutual information, both in terms of accuracy and in terms of the number of views required for recognition. These are summarised in table 2. However, the proposed observation selection strategy is much less computationally expensive than mutual information. This is illustrated in figure 2. Notice that the amount of time required for the first decision is one order of magnitude lower with the proposed strategy than for the strategy based on mutual information. Furthermore, the time needed for decision making using the proposed strategy dramatically decreases as the recognition process progresses.

Table 2: Performance comparison of the recognition and pose estimation results for 14 objects of the aircraft database, using the random, proposed and mutual information observation selection strategies.
  Recog. rate Avg. pose error Avg. views
Proposed strategy 82% 2.63 degrees 3.4 views
Mutual information 81% 0.43 degrees 3.3 views


Figure 2: Comparison of the average time required for viewpoint selection as the recognition task progresses, using the mutual information and proposed observation selection strategies.
\includegraphics[]{figs/decision_time.eps}


Case study 2: real imagery

The second case study considers the more general problem of object recognition and pose estimation under varying lighting conditions. The study was conducted with a set of 13 objects which were custom-built with the purpose of rendering the recognition task difficult, and two light sources. Images of 5 sample objects as seen from different points of view are shown in figure 3 and one object is also shown as illuminated from the two possible light sources in figure 4.

Figure 3: Sample objects used for the second case study as seen from eight different points of view
object $0^{\circ}$ $45^{\circ}$ $90^{\circ}$ $135^{\circ}$ $180^{\circ}$ $225^{\circ}$ $270^{\circ}$ $315^{\circ}$
1 \includegraphics[width=.6in]{figs/house2.000-1.eps} \includegraphics[width=.6in]{figs/house2.045-1.eps} \includegraphics[width=.6in]{figs/house2.090-1.eps} \includegraphics[width=.6in]{figs/house2.135-1.eps} \includegraphics[width=.6in]{figs/house2.180-1.eps} \includegraphics[width=.6in]{figs/house2.225-1.eps} \includegraphics[width=.6in]{figs/house2.270-1.eps} \includegraphics[width=.6in]{figs/house2.315-1.eps}
2 \includegraphics[width=.6in]{figs/house3.000-1.eps} \includegraphics[width=.6in]{figs/house3.045-1.eps} \includegraphics[width=.6in]{figs/house3.090-1.eps} \includegraphics[width=.6in]{figs/house3.135-1.eps} \includegraphics[width=.6in]{figs/house3.180-1.eps} \includegraphics[width=.6in]{figs/house3.225-1.eps} \includegraphics[width=.6in]{figs/house3.270-1.eps} \includegraphics[width=.6in]{figs/house3.315-1.eps}
3 \includegraphics[width=.6in]{figs/house4.000-1.eps} \includegraphics[width=.6in]{figs/house4.045-1.eps} \includegraphics[width=.6in]{figs/house4.090-1.eps} \includegraphics[width=.6in]{figs/house4.135-1.eps} \includegraphics[width=.6in]{figs/house4.180-1.eps} \includegraphics[width=.6in]{figs/house4.225-1.eps} \includegraphics[width=.6in]{figs/house4.270-1.eps} \includegraphics[width=.6in]{figs/house4.315-1.eps}
4 \includegraphics[width=.6in]{figs/house5.000-1.eps} \includegraphics[width=.6in]{figs/house6.045-1.eps} \includegraphics[width=.6in]{figs/house5.090-1.eps} \includegraphics[width=.6in]{figs/house5.135-1.eps} \includegraphics[width=.6in]{figs/house6.180-1.eps} \includegraphics[width=.6in]{figs/house5.225-1.eps} \includegraphics[width=.6in]{figs/house5.270-1.eps} \includegraphics[width=.6in]{figs/house6.315-1.eps}
5 \includegraphics[width=.6in]{figs/house6.000-1.eps} \includegraphics[width=.6in]{figs/house6.045-1.eps} \includegraphics[width=.6in]{figs/house6.090-1.eps} \includegraphics[width=.6in]{figs/house6.135-1.eps} \includegraphics[width=.6in]{figs/house6.180-1.eps} \includegraphics[width=.6in]{figs/house6.225-1.eps} \includegraphics[width=.6in]{figs/house6.270-1.eps} \includegraphics[width=.6in]{figs/house6.315-1.eps}

Figure 4: Sample views of object 1 illuminated from two different sources
source 1 2
$0^{\circ}$ \includegraphics[width=1.3in]{figs/house2.000-1.eps} \includegraphics[width=1.3in]{figs/house2.l2.000-1.eps}
$45^{\circ}$ \includegraphics[width=1.3in]{figs/house2.045-1.eps} \includegraphics[width=1.3in]{figs/house2.l2.045-1.eps}
$225^{\circ}$ \includegraphics[width=1.3in]{figs/house2.225-1.eps} \includegraphics[width=1.3in]{figs/house2.l2.225-1.eps}
The proposed observation selection strategy was compared to a random navigation strategy (without active viewpoint selection). It was found that the proposed strategy performed poorly under realistic conditions due to the appearance model not fitting the data very well. While the appearance model can easily be changed to better reflect experimental conditions, flaws will always remain. Instead, a heuristic was introduced into the navigation strategy whereby a viewpoint cannot be visited more than once [3]. This prevents the system from using consistently biased information to do inference and select viewpoints. Table 3 and figures 5 and 6 summarise the results.


Table 3: Comparison of the results obtained with and without the non-repeating navigation constraint for the random and proposed viewpoint selection approaches.
  Recog. rate Avg. pose error Avg. views
Random 87% 0.69 degrees 10.06
Proposed strategy 76 % 2.66 degrees 8.54
Random non-repeating 93% 0.49 degrees 9.36
Proposed strategy non-repeating 94% 1.71 degrees 6.85


Figure 5: Comparison of the average number of steps required for recognition and pose estimation of the house-like objects of the second case study using both the random and proposed observation selection strategies.
\includegraphics[width=6.5in]{figs/steps_house_tabu.eps}
Figure 6: A comparison between the repeating and non-repeating versions of the random and proposed navigation strategies based on the evolution of recognition accuracy over time, using real data.
[Evolution of the recognition rates over time for the different navigation strategies.]\includegraphics[width=3in]{figs/rate_evolution_house_tabu.eps} [Evolution of the pose estimation error over time for the different navigation strategies.]\includegraphics[width=3in]{figs/pose_evolution_house_tabu.eps}
Clearly, the non-repeating navigation constraint improves the accuracy of the recognition results. The average rate of correct classification obtained with the non-repeating version of the proposed observation selection strategy compares with that obtained with the non-repeating random navigation strategy. The slight degradation in the accuracy of the pose estimates in the case of the proposed observation selection strategy is largely compensated by the lower cost implied by the acquisition of measurements. As shown in figure 5, the number of views required for recognition and pose estimation of the different objects is consistently and significantly lower in the case of the proposed navigation strategy than in the random case.

Conclusion

The proposed active viewpoint selection criterion allows for competing hypotheses to be effectively disambiguated and is an efficient alternative to popular techniques that maximize mutual information. The proposed observation selection strategy is much quicker than a strategy based on mutual information, and requires fewer measurements than a random navigation strategy. Conceivably, the new approach could be combined with instance-based learning techniques to further accelerate the viewpoint selection process[7].

Bibliography

1
Radio control - computer aided design gallery.
http://www.rccad.com/Gallery-Classic8.htm.

2
T. Arbel and F. P. Ferrie.
Entropy-based gaze planning.
Image and Vision Computing, 19:779-786, 2001.

3
H. Borotschnig, L. Paletta, M. Prantl, and A. Pinz.
Appearance-based active object recognition.
Image and Vision Computing, 18:715-727, 2000.

4
J. Denzler and C. M. Brown.
Information theoretic sensor data selection for active object recognition and state estimation.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2):145-157, 2002.

5
D. J. C. MacKay.
Information-based objective functions for active data selection.
Neural Computation, 4(4):589-603, 1992.

6
H. Murase and S. K. Nayar.
Visual learning and recognition of 3-D objects from appearance.
International Journal of Computer Vision, 14:5-24, 1995.

7
L. Paletta, M. Prantl, and A. Pinz.
Learning temporal context in active object recognition using Bayesian analysis.
In Proceedings of the 15th International Conference on Pattern Recognition, pages 695-699, Barcelona, Spain, 2000.

For more information on this work, see the following publications

  • Catherine Laporte and Tal Arbel,
    "Efficient discriminant viewpoint selection for active Bayesian recognition",
    International Journal of Computer Vision, 68(3):267-287, July 2006.
  • Catherine Laporte, Rupert Brooks and Tal Arbel,
    "A fast discriminant approach to active object recognition and pose estimation",
    In Proceedings of the 17th International Conference on Pattern Recognition, vol. 3, pages 91-94, Cambridge, U.K., 2004. [PS][PDF]
  • Catherine Laporte,
    "A fast discriminant approach to active Bayesian visual recognition", M. Eng. thesis, McGill University, Montreal, Canada, 2004. [PS][PDF]

    About this document ...

    Efficient discriminant viewpoint selection for active Bayesian recognition

    This document was generated using the LaTeX2HTML translator Version 2002 (1.62)

    Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
    Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

    The command line arguments were:
    latex2html -split 0 subm.tex

    The translation was initiated by Catherine Laporte on 2006-08-14


    next_inactive up previous
    Catherine Laporte 2006-08-14