# Vision, Image Processing & Sound Lab

## Spectator Crowd Analysis Investigating people at the stadiums/theaters/events

### Background

The project focuses on a specific kind of crowd, the spectator crowd. This type of social gathering is formed by people "interested in watching something specific that they came to see" [ ].
Differently from the generic crowd instances analyzed in computer vision, this entity has unique features that distinguish it, demanding for specific analysis techniques.

• People is not moving, but semi-static, centered most of the time on a unique position/seat; this opens up to novel applications never faced so far in the crowd literature, such as gesture recognition and attention analysis
• The environment where the crowd is present is usually known beforehand (a stadium, a square) so that specific camera setting could be exploited
• People is interested to see a specific event, so there is a specific event to consider, whose analysis, intertwined with the crowd analysis, may bring to novel goals and challenges (discovering which actions trigger the attention of the people the most)

### Research aims

The project aims at exploring all the unique features of the spectator crowd, yielding to fresh-new tasks or classical computer vision applications which are faced under a new perspective, such as

• spectator detection: individuating where the single person are in the crowd
• spectator segmentation: in a competition, understanding who is connected with which team
• global head orientation: understanding what most of the people is looking at
• automatic highlight generation: individuating the intervals of the event which triggered the attention/excitement of the public the most
• gesture segmentation: due to the static nature of the crowd, gesture analysis may be possible
• social signal processing: once again, due to the static nature of the crowd, social signal processing analysis can be pursued, such as identifying people that know themselves, such as couples/families/friends each category of them being identifying by analyzing specific cues

The research is based on a project started within an international hockey competition, originating the first spectator crowd dataset: S-HOCK. The S-HOCK dataset and some applications on the spectator crowd have been published at CVPR 2015 [ ]. For the dataset and more information, please see here.

### Published results

• #### Overview

The data collection campaign focused on 4 hockey matches held in Trento (Italy) during the 26th Winter Universiade. We used 5 cameras, a full HD camera (1920x1080, 30 fps, focal length 4mm) for the ice rink, another one for a panoramic view of all the bleachers, and 3 high resolution cameras (1280x1024, 30 fps, focal length 12mm) focusing on different parts of the spectator crowd.
From each match we selected a pool of sequences in order to represent a wide, uniform and representative spectrum of situations, e.g. tens of instances of goals, shots on goal, saves, faults, timeouts (each sequence has more than one event).
Each sequence has been annotated frame by frame, spectator by spectator, by a first annotator, using the ViPER format. The annotator had to perform three different macro tasks: detection (localizing the body and the head), posture and action annotation, respectively. capturing fine grained actions such as hands on hips, clapping hands, watching the cellphone etc. (click here for more details), for a total of more than 100 millions of annotations.

After each match, we asked to a percentage of uniformly distributed spectators (30%) to fill a simple questionnaire with three questions:

• Which team did you support in this match?
• Did you know at the beginning of the match who was sitting next to you?
• Which has been the most exciting action in this game?

• #### Results

In this work we present a set of possible applications on S-HOCK. In particular we focus on two classical tasks, such as people detection and head pose estimation, and one more interesting application from the social point of view, such as spectator categorization.

##### People Detection

People detection is a standard and still open research topic in computer vision, with the HOG features and the Deformable Part Model (DPM) as workhorses, and plenty of alternative algorithms. Unfortunately, most of the methods in the literature are not directly usable in our scenario, mostly for two reasons: low resolution - a person has an average dimension of 70x110 pixels - and occlusions usually only the upper body is visible, rarely the entire body and sometimes only the face.
We provided 5 different baselines for people detection:

HOG+SVM
Detector based on HOG (cell size of 88 pixels) and a linear SVM classifier.
HASC+SVM
Detector based on Heterogeneous Auto-Similarities of Characteristics (HASC) descriptor and a linear SVM classifier.
ACF
Aggregate Channel Features detector which uses the Viola-Jones framework to compute integral images (and Haar wavelets) over the color channels, fusing them together.
DPM
Deformable Part Model which combines part's templates arranged in a deformable configuration fed into a latent SVM classifier.
CUBD
Calvin Upper Body Detector which is a combination of the DPM framework trained on near-frontal upperbodies - i.e. head and upper half of the torso of the person - and the Viola-Jones face detector.

On top of all these methods, we propose an extension based on the strong prior we have in our kind of crowd, i.e. the people are "constrained" by the environment to arrange in a grid - the seats on the bleachers. Assuming a regular grid (considering the camera perpendicular at the plane of the bleachers and ignoring distortion effects) and acounting for the fact that people are more likely to be located on the same rows and columns, we can just add to the detection confidence map the average of the map over the rows and the columns.

 Method no prior with prior Prec. Rec. F1 Prec. Rec. F1 HOG+SVM 0.743 0.561 0.639 0.662 0.709 0.684 HASC+SVM 0.365 0.642 0.465 0.357 0.685 0.469 ACF 0.491 0.622 0.548 0.524 0.649 0.580 DPM 0.502 0.429 0.463 0.423 0.618 0.502 CUBD 0.840 0.303 0.444 0.613 0.553 0.581
\begin{table}
\begin{center}
\begin{tabular}{|l|c|c|c|c|c|c|}
\hline
& \multicolumn{3}{c|}{no prior} & \multicolumn{3}{c|}{with prior} \\
\cline{2-7}
Method & Prec. & Rec. & $F_1$ & Prec. & Rec. & $F_1$ \\
\hline
HOG + SVM & 0.743 & 0.561 & \textbf{0.639} & \textbf{0.662} &\textbf{0.709} & \textbf{0.684} \\
HASC+SVM & 0.365 & \textbf{0.642} & 0.465 & 0.357 & 0.685 & 0.469 \\
ACF & 0.491 & 0.622 & 0.548 & 0.524 & 0.649 & 0.580 \\
DPM & 0.502 & 0.429 & 0.463 & 0.423 & 0.618 & 0.502 \\
CUBD & \textbf{0.840} & 0.303 & 0.444 & 0.613 & 0.553 & 0.581 \\
\hline
\end{tabular}
\end{center}
\end{table}

Method AVG Accuracy Training time (sec) Testing time (sec)
Orozco et al. 0.368 105303 6263
WArCo 0.376 186888 87557
CNN 0.346 16106 68
SAE 0.348 9384 3
CNN+EACH 0.354 16106 68
SAE+EACH 0.363 9384 3
\begin{table}
\begin{center}
\begin{tabular}{|c|c|c|c|}
\hline
& AVG Accuracy & Training time & Testing time \\
Method & & {[}sec{]} & {[}sec{]} \\
\hline
Orozco et al. & 0.368 & 105303 & 6263 \\
WArCo & 0.376 & 186888 & 87557 \\
\textbf{CNN } & 0.346 & 16106 & 68 \\
\textbf{SAE} & 0.348 & 9384 & 3 \\
\textbf{CNN + EACH } & 0.354 & 16106 & 68 \\
\textbf{SAE + EACH } & 0.363 & 9384 & 3 \\
\hline
\end{tabular}
\end{center}
\end{table}

##### Spectator Categorization

The spectator categorization task consists in finding the supporters of each team, accounting for the fact that fan of the same team will have a similar behavior at some events (a goal etc.) which is strongly different from that of the other supporters.

 Ground Truth AS2007 MMS2010 Our Accuracy 0.592 0.559 0.621

• #### Contacts

• Davide Conigliaro:

### References

1. Understanding and planning for different spectator crowds.
A. Berlonghi; Safety Science, 18:239--247, 1995

[pdf] [bibtex]
@article{berlonghi1995understanding,
title={Understanding and planning for different spectator crowds},
author={Berlonghi, Alexander E},
journal={Safety Science},
volume={18},
number={4},
pages={239--247},
year={1995},
publisher={Elsevier}
}

2. The S-Hock Dataset: Analyzing Crowds at the Stadium
Davide Conigliaro, Paolo Rota, Francesco Setti, Chiara Bassetti, Nicola Conci, Nicu Sebe, Marco Cristani; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2039-2047

[pdf] [bibtex] [Cited by:
]
@InProceedings{Conigliaro_2015_CVPR,
author = {Conigliaro, Davide and Rota, Paolo and Setti, Francesco and Bassetti, Chiara and Conci, Nicola and Sebe, Nicu and Cristani, Marco},
title = {The S-Hock Dataset: Analyzing Crowds at the Stadium},
journal = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2015}
}