4. The Classification of Video Shots

4.1 Shot Segmentation and Key Frame Extraction

The first step in news video analysis is to segment the input news video into shots. We employ the multi-resolution analysis technique developed in our lab [15] that can effectively locate both abrupt and gradual transition boundaries effectively. We adopt a "greedy" strategy to over-segment the video in order to minimize the chance of missing the shot and hence the story boundaries. This strategy is reasonable as most of the falsely detected shots will be merged into story units in subsequent analysis.

After the video is segmented, there are several ways in which the contents of each shot can be modelled. We can model the contents of the shot: (a) using a representative key frame; (b) as feature trajectories; or, (c) using a combination of both. In this research, we adopt the hybrid approach as a compromise to achieve both efficiency and effectiveness. Most visual content features will be extracted from the key frame while motion and audio features will be extracted from the temporal contents of the shots.

4.2 Selection of Shot Categories

We studied the set of categories employed in related works, and the structures of news video in general and local news in particular. The categories must be meaningful so that the category tag assigned to each shot is reflective of its content and facilitates the subsequent stage of segmenting and classifying news stories. The granularity of the choice of categories is important. Here we want the categories selected to reflect the major structure and concepts in news presentation. Thus, while it is reasonable to have sports and weather categories, it is too fine-grain to consider subcategories of sports like basketball or football. These subcategories are not important in understanding the overall news structure. Based on this consideration, we arrive at the following set of shot categories: Intro/Highlight, Anchor, 2Anchor, Meeting/Gathering, Speech/Interview, Live-reporting, Still-image, Sports, Text-scene, Special, Finance, Weather, and Commercial. Figure 44.4 shows a typical example in each category.

click to expand
Figure 44.4: Examples of the predefined categories and example shots

4.3 Choice and Extraction of Features for Shot Classification

For a learning base classification system to work effectively, it is imperative to identify a suitable set of features based on in-depth understanding of the domain. In addition, we aim to derive a comprehensive set of features that can be extracted automatically from MPEG video.

4.3.1 Low-level Visual Content Feature

Colour Histogram: Colour histogram models the visual composition of the shot. It is particularly useful to resolve two scenarios in shot classification. First, it can be used to identify those shot types with similar visual contents such as the weather and finance reporting. Second, it can be used to model the changes in background between successive shots, which provides important clues to determining a possible change in shot category or story. Here, we represent the content of key frame using a 256-colour histogram.

4.3.2 Temporal Features

Background scene change: Following the discussions on Colour histogram, we include the background scene change feature to measure the difference between the Colour histogram of the current and previous shots. We employ a higher threshold than that for shot segmentation to detect background scene change. It is represented by 'c' if there is a change and 'u' otherwise.

Audio type: This feature is very important especially for Sport and Intro/Highlight shots. For Sport shots, its audio track includes both commentary and background noise, and for Intro/Highlight shots, the narrative is accompanied by background music. We adopt an algorithm similar to that discussed in Lu et al. [17] to classify audio into the broad categories of speech, music, noise, speech and noise, speech and music, and silence.

Speaker change: Similar to background scene change feature, this feature measures whether there is a change of speaker between the current and previous shot. It takes the value of 'u' for no change, and 'c' if there is a change. The latter condition also applies to shots that do not contain speech but when there is a change from the previous speech to non-speech shot or vice versa. The detection of non-speech shot from speech shot can be done by detecting the shot's audio type as described earlier.

Motion activity: For MPEG video, there is a direct encoding of motion vectors, which can be used to indicate the level of motion activities within the shot. We usually see high level of motion in sports and certain live reporting shots such as the rioting scenes. Thus, we classify the motion into low (like in an Anchor-person shot where only the head region has some movements), medium (such as those shots with people walking), high (like in sports), or no motion (for still frame shots and Text-scene shots).

Shot duration: For Anchorperson or Interview type of shots, the duration tends to range from 20 to 50 seconds. For other types of shots, such as the Live-reporting or Sports, the duration tends to be much shorter, ranging from a few seconds to about 10 seconds. The duration is thus an important feature to model the rhythm of the shots. We set the shot duration to short (if it is less than 10 seconds), medium (if it is between 10 to 20 seconds), and long (for shot greater than 20 seconds in duration).

4.3.3 High-level Object-based Features

Face: Human activities are one of most important aspects of news videos, and many such activities can be deduced from the presence of faces. Many techniques have been proposed to detect faces in an image or video. In our study, we adopt the algorithm developed in [6] to detect mostly frontal faces in the key frame of each shot. We extract in each shot the number of faces detected as well as their sizes. The size of the face is used to estimate the shot types.

Shot type: We use the camera focal distance to model the shot type, which include closed-up, medium-distance or long-distance shot etc. Here, we simply use the size of the detected face to estimate the shot type.

Videotext: Videotext is another type of object that appears frequently in news video and can be used to determine video semantics. We employ the algorithm developed in [21] to detect videotexts. For each shot, we simply determine the number of lines of text that appear in the key frame.

Centralized Videotext: We often need to differentiate between two types of shots containing videotexts, the normal shot where the videotexts appear at the top or bottom of a shot to annotate its contents, the text-scene shot where only a sequence of texts is displayed to summarize an event, such as the results of a soccer game. A text-scene shot typically contains multiple lines of centralized text, which is different from normal shots. This feature takes the value "true" for centralized text and "false" otherwise.

Figure 44.5 presents a view of a news video in our approach.

click to expand
Figure 44.5: The model of a news video

4.4 Shot Representation

After all features are extracted, we represent the contents of each shot using a colour histogram vector and a feature vector. The histogram vector is used to match the content of a shot with the representative shot of certain categories, while the feature vector is used by Decision Tree to categorize the shots into one of the remaining categories. The feature vector of a shot is of the form:

(44.1)

where:

a	the class of audio, a ∈ {t=speech, m=music, s=silence, n =noise, tn = speech + noise, tm= speech + music, mn=music+noise}
m	the motion activity level, m ∈ {l=low, m=medium, h=high}
d	the shot duration, d ∈ {s=short, m=medium, 1=long}
f	the number of faces, f ≥ 0
s	the shot type, s ∈ {c= closed-up, m=medium, 1=long, u=unknown}
t	the number of lines of text in the scene, t ≥ 0
c	set to "true" if the videotexts present are centralized, c ∈ {t=true, f=false}

For example, the feature vector of an Anchorperson shot may be (t, 1, 1, 1, c, 2, f). Note that at this stage we did not include the background scene change and speaker change features in the feature set. These two features are not important for shot classification and will be included in detecting story boundary using HMM.

4.5 Classification of Video Shots

In shot classification process, we first remove the commercials before performing the classification of the remaining shots.

In most countries, it is mandatory to air several black frames preceding or proceeding a block of commercials. However, this is not always the case in many countries, like in Singapore. Our studies have shown that commercial boundary can normally be characterized by the presence of black frames, still frames, and/or audio silence [14]. We thus employ a heuristic approach to identify the presence of commercials and detect the beginning and ending of the commercials blocks. Our tests on six news videos (180 minutes) obtained from the MediaCorp of Singapore demonstrate that we are able to achieve a higher detection accuracy of over 97%.

We break the classification of remaining shots into two sub-tasks. We first identify the shot types that have very similar visual features. Examples of these shot types include the Weather and Finance reports. For these shot types, we simply extract the representative histogram of the respective categories and employ the histogram-matching algorithm developed in [5] to compute the shot-category similarity that takes into consideration the perceptually similar colours. We employ a high threshold of 0.8 to determine whether a given shot belongs to the Weather or Finance category.

For the rest of the shots, we employ Decision Tree (DT), in particular C4.5 algorithm, to perform the classification in a learning-based approach. Decision tree is one of the most widely used techniques in machine learning. The technique has the advantages that it is robust to noisy data, capable of learning disjunctive expression, and the training data may contain missing or unknown values [2, 18].

The Decision Tree approach has been successfully employed in many multi-class classification problems [8, 22]. We thus select the Decision Tree for our shot classification problem.