Wednesday, April 22, 15:00 - 17:30, Location: Show and Tell Area B
Sam S. Tsai, David M. Chen, Jatinder Singh, Bernd Girod
Handheld mobile devices, such as camera-phones or PDAs, are expected to become ubiquitous platforms for visual search and mobile augmented reality applications. For mobile image matching, a visual database is typically stored at a server in the network. Hence, for a visual comparison, information must be either uploaded from the mobile to the server, or downloaded from the server to the mobile. With relatively slow wireless links, the response time of the system critically depends on how much information is transferred in both directions. Depending on relative processing power available at the mobile client, the server, and the bandwidth available to connect them, interactive applications benefit from image processing on the phone, the server, or both.
To demonstrate the real-time image-based retrieval system we have designed, we present a mobile CD cover recognition system. Imagine a customer in a music store who snaps a photo of a CD cover. Query information is uploaded to a server where the matching CD cover is identified and a snippet of music from the CD is retrieved to be played on the phone. The demonstration has been implemented on a Nokia N95 camera-phone as the client and a dual-core Linux server with 4GB RAM located at Stanford University containing a database of 10K CD cover images. A user-friendly query interface guides the user through the matching process. Users can take query pictures of CD albums that we will provide and play music samples directly on the camera-phone of the CDs that the system has identified. The system can also recognize DVD covers and play movie trailers.
In the following, we provide some details of the algorithms we have developed and implemented in the system to overcome the key challenges. The first challenge is that photos taken by camera-phones typically suffer from geometric and photometric distortions. These image distortions are overcome by performing image matching using robust feature descriptors. In our implementation, we use SURF, which can be executed either on the phone or the server [1]. The second challenge is that the system must meet stringent time constraints. Even for a large database, the time to calculate accurate image matches must be small to provide the user with an interactive experience. We have implemented a scalable vocabulary tree (SVT) [2] to search through large databases. Combining multiview SVTs [3], post-tree scoring decisions, and inverted file lookup methods, we are able to search through a database of 10K images within 1 second. Multiview SVTs can also improve the retrieval performance under perspective variations [3]. The third challenge is the limited network bandwidth. An EDGE cellular network has a typical uplink of 50-80 kbit/sec. Moreover, cellular services costs plans in some regions and when roaming in across countries can be based on the amount of data transmitted. Thus, the query data for image matching must be concise and compressed. In our system, only the features in the image are used by the server to perform image matching. A feature descriptor compression algorithm [4] can be applied to greatly reduce the query data sent across the wireless network with minimal degradation in matching accuracy.