In order to overcome these problems, we are developing a new research facility known as the Shared Reality Environment. The primary goal of this environment is to support the exchange of low-latency, high-fidelity audio and video streams between multiple users in different locations. Satisfying this goal for the video stream presents a number of difficulties.
A first approach is to make use of M-JPEG or MPEG encoded video. The problems here are cost and latency. MPEG hardware tends to be expensive, and while this is less of an issue for M-JPEG, with current technology, either method introduces a minimum of 50 ms latency for compression and decompression, on top of the image acquisition time. Avoiding compression presents the option of transmitting raw data. For high resolution, 30 fps video, this requires massive amounts of bandwidth. Even on a 100 Mbps ethernet, transmission of a single frame of 640x480 at 24 bits takes approximately 100 ms.
The fact that much of the data in a video frame is redundant forms the basis of compression techniques. For example, a static background in a sequence of images may constitute the majority of each frame. Since our goal is to allow users to interact, we may simply remove the background in its entiretiy, and thus reduce encoding and decoding time. The remaining image components, if sufficiently small, may be transmitted as raw data without compression, thereby reducing overall latency. Key to this work is the ability to locate, quickly, an approximate bounding box of a person in a scene.
A. Xu, J. Cooperstock