Abstract

Visual relationships capture a wide variety of interactions between pairs of objects in images (e.g. "man riding bicycle" and "man pushing bicycle"). Consequently, the set of possible relationships is extremely large and it is difficult to obtain sufficient training examples for all possible relationships. Because of this limitation, previous work on visual relationship detection has concentrated on predicting only a handful of relationships. Though most relationships are infrequent, their objects (e.g. "man" and "bicycle") and predicates (e.g. "riding" and "pushing") independently occur more frequently. We propose a model that uses this insight to train visual models for objects and predicates individually and later combines them together to predict multiple relationships per image. We improve on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship. Our model can scale to predict thousands of types of relationships from a few examples. Additionally, we localize the objects in the predicted relationships as bounding boxes in the image. We further demonstrate that understanding relationships can improve content based image retrieval.

* equal contribution
To appear in ECCV 2016 (Oral)

Code and Dataset

Download the JSON serialized dataset here.

Download the MATLAB dataset here.

Find additional resources on Github, including:

  • Training/test code (uses Matlab)
  • Pretrained model
  • Dataset
  • Evaluation code

Paper

Download the paper here.

Bibtex

@inproceedings{lu2016visual,
  title={Visual Relationship Detection with Language Priors},
  author={Lu, Cewu and Krishna, Ranjay and Bernstein, Michael and Fei-Fei, Li},
  booktitle={European Conference on Computer Vision},
  year={2016}
}

Dataset

To benchmark progress in visual relationship detection, we also introduce a new dataset containing 5000 images with 37,993 thousand relationships. THe dataset contains 100 object categories and 70 predicate categories connecting those objects together. Predicates can be widely categorized into the 5 following types:

Example Results

What the predictions from the model looks like.
Browse more results on using our demo using our github code.

Content Based Image Retrieval

By understanding the relationships between objects in images, we show that our model can outperform existing image retrieval models that use SIFT, CNN or Visual Phrase features.

Future Extensions

With the release of the Visual Genome dataset, visual relationship detection models can now be trained on millions of relationships instead of just thousands.

This dataset contains 1.1 million relationship instances and thousands of object and predicate categories. Visual relationship prediction can now be studied at a much larger open world scale, allowing us to build models with a dense understanding of image contents.