Date of Award


Embargo Period


Degree Type


Degree Name

Doctor of Philosophy (PhD)


Robotics Institute


J. Andrew Bagnell

Second Advisor

Martial Hebert


Extracting a rich representation of an environment from visual sensor readings can
benefit many tasks in robotics, e.g., path planning, mapping, and object manipulation.
While important progress has been made, it remains a difficult problem to effectively
parse entire scenes, i.e., to recognize semantic objects, man-made structures, and landforms.
This process requires not only recognizing individual entities but also understanding
the contextual relations among them.

The prevalent approach to encode such relationships is to use a joint probabilistic or
energy-based model which enables one to naturally write down these interactions. Unfortunately,
performing exact inference over these expressive models is often intractable
and instead we can only approximate the solutions. While there exists a set of sophisticated
approximate inference techniques to choose from, the combination of learning and
approximate inference for these expressive models is still poorly understood in theory
and limited in practice. Furthermore, using approximate inference on any learned model
often leads to suboptimal predictions due to the inherent approximations.

As we ultimately care about predicting the correct labeling of a scene, and not
necessarily learning a joint model of the data, this work proposes to instead view the
approximate inference process as a modular procedure that is directly trained in order
to produce a correct labeling of the scene. Inspired by early hierarchical models in the
computer vision literature for scene parsing, the proposed inference procedure is structured
to incorporate both feature descriptors and contextual cues computed at multiple
resolutions within the scene. We demonstrate that this inference machine framework
for parsing scenes via iterated predictions offers the best of both worlds: state-of-the-art
classification accuracy and computational efficiency when processing images and/or
unorganized 3-D point clouds. Additionally, we address critical problems that arise in
practice when parsing scenes on board real-world systems: integrating data from multiple
sensor modalities and efficiently processing data that is continuously streaming from
the sensors.



Included in

Robotics Commons