This was the final project for Brown's "Data-Driven Vision" graduate seminar, taught in Fall 2010 by James Hays.
Local space-time descriptors have become a successful video representation for human action classification. Recent investigations suggest that for diverse, realistic datasets, computing these descriptors on a dense, regular grid outperforms sparse interest point methods. We suggest this dense sampling succeeds by capturing meaningful global context in addition to local dynamics. In this investigation, we evaluate this claim by comparing state-of-the-art space-time descriptors (HOG and HOF) with static scene descriptors ( GIST and dense SIFT) in a bag-of-features classification task on the Hollywood2 actions dataset. Results indicate that for one-third of tested action categories, static scene descriptors can outperform dynamic ones. We also show that combining static and dynamic descriptors yields even further improvements, suggesting novel avenues for further research in video representation.
For this project, I developed my own versions of both HOG and HOF descriptors, in order to understand them better. I didn't use multi-scale representations that are common in the "state-of-the-art" HOG/HOF packages (e.g. Laptev's STIP 2.0). I did use common implementations of GIST and SIFT, so perhaps this comparison is a bit lop-sided.
Attached at the bottom of this page.