The Street View Text (SVT) dataset was harvested from
Google Street View. Image text in this data exhibits high variability
and often has low resolution. In dealing with outdoor street
level imagery, we note two characteristics. (1) Image text often
comes from business signage and (2) business names
are easily available through geographic business searches.
These factors make the SVT set uniquely suited for word spotting
in the wild: given a street view image, the goal is to identify words
from nearby businesses. More details about the data set
can be found in our paper,
Word Spotting in the Wild.
For our up-to-date benchmarks on this data, see our paper,
End-to-end Scene Text Recognition.
This dataset only has word-level annotations (no character bounding boxes)
and should be used for (A) cropped lexicon-driven word recognition and
(B) full image lexicon-driven word detection and recognition.
If you need character training data then you should look into the
Chars74K and ICDAR datasets .
|
EXAMPLE
Task: locate all the words in an image that appear
in its lexicon. While there is other text in the image, only
the lexicon words are to be detected. This contrasts from the
more general OCR problem.
|
|
Lexicon:
HOLIDAY, INN, EXPRESS, HOTEL, NEW, YORK, CITY, FIFTH,
AVENUE, MICHAEL, FINA, CINEMA, CAFE, 45TH, STARBUCKS,
BINDER, DAVID, DDS, MANHATTAN, DENTIST, BARNES, NOBLE,
BOOKSELLERS, AVE, ART, BROWN, INTERNATIONAL, PEN, SHOP,
MORTON, THE, STEAKHOUSE, DISHES, BUILD, BEAR, WORKSHOP,
HARVARD, CLUB, CORNELL, PACE, UNIVERSITY, LENSCRAFTERS,
SETTE, FOSSIL, STORE, 5TH, JEWEL, INDIA, RESTAURANT, KELLARI,
TAVERNA, YACHT
|
DOWNLOAD
DATA COLLECTION
We used Amazon's Mechanical Turk to harvest
and label the images from
Google Street View.
To build the data set, we created several Human Intelligence Tasks (HITs)
to be completed on Mechanical Turk.
|
Harvest images. Workers are assigned a unique city and are
requested to acquire 20 images that contain text from Google
Street view. They were instructed to:
(1) perform a Search Nearby:* on their city,
(2) examine the businesses in the
search results, and
(3) look at the associated street view for images containing
text from the business name. If words are found, they compose
the scene to minimize skew, save a screen shot, and record the
business name and address.
|
Image annotation. Workers are presented with an image and a list of candidate
words to label with bounding boxes. This contrasts with the ICDAR Robust
Reading data set in that we only label words associated with businesses.
We used Alex Sorokin's Annotation Toolkit
to support bounding box image annotation.
For each image, we obtained a list of local business names using the Search
Nearby:* in Google Maps at the image's address. We stored the top 20 business
results for each image, typically resulting in 50 unique words. To summarize, the
SVT data set consists of images collected from Google Street View, where each
image is annotated with bounding boxes around words from businesses around
where the image was taken.
|
RELATED DATASETS
REFERENCES
Kai Wang, Boris Babenko and Serge Belongie.
End-to-end Scene Text Recognition.
ICCV 2011,
Barcelona, Spain.
[PDF]
Galleries: ICDAR, SVT.
|
Kai Wang and Serge Belongie. Word Spotting in the Wild.
ECCV 2010, Heraklion, Crete.
[PDF]
[Note: the dataset has undergone revision since the time it was
evaluated in this publication. Please consult the ICCV2011 paper for most up-to-date results.]
|
CONTACT
For questions about the dataset please contact Kai Wang at
k...@cs.ucsd.edu.
|
|