Access Type

Open Access Dissertation

Date of Award

January 2018

Degree Type


Degree Name



Computer Science

First Advisor

Xue-wen Chen


Convolutional neural networks (ConvNet) have improved the state of the art in many applications. Face recognition tasks, for example, have seen a significantly improved performance due to ConvNets. However, less attention has been given to video-based face recognition. Here, we make three contributions along these lines.

First, we proposed a ConvNet-based system for long-term face tracking from videos. Through taking advantage of pre-trained deep learning models on big data, we developed a novel system for accurate video face tracking in the unconstrained environments depicting various people and objects moving in and out of the frame. In the proposed system, we presented a Detection-Verification-Tracking method (DVT) which accomplishes the long-term face tracking task through the collaboration of face detection, face verification, and (short-term) face tracking. An online trained detector based on cascaded convolutional neural networks localizes all faces appeared in the frames, and an online trained face verifier based on deep convolutional neural networks and similarity metric learning decides if any face or which face corresponds to the query person. An online trained tracker follows the face from frame to frame. When validated on a sitcom episode and a TV show, the DVT method outperforms tracking-learning-detection (TLD) and face-TLD in terms of recall and precision. The proposed system is tested on many other types of videos and shows very promising results.

Secondly, as the availability of large-scale training dataset has a significant effect on the performance of ConvNet-based recognition methods, we presented a successful automatic video collection approach to generate a large-scale video training dataset. We

designed a procedure for generating a face verification dataset from videos based on the long-term face tracking algorithm, DVT. In this procedure, the streams can be collected from videos, and labeled automatically without human annotation intervention. Using this procedure, we assembled a widely scalable dataset, FaceSequence. FaceSequence includes 1.5M streams capturing ~500K individuals. A key distinction between this dataset and the existing video datasets is that FaceSequence is generated from publicly available videos and labeled automatically, hence widely scalable at no annotation cost.

Lastly, we introduced a stream-based ConvNet architecture for video face verification task. The proposed network is designed to optimize the differentiable error function, referred to as stream loss, using unlabeled temporal face sequences. Using the unlabeled

video dataset, FaceSequence, we trained our network to minimize the stream loss. The network achieves verification accuracy comparable to the state of the art on the LFW and YTF datasets with much smaller model complexity. In comparison to VGG, our

method demonstrates a significant improvement in TAR/FAR, considering the fact that the VGG dataset is highly puried and includes a small label noise. We also fine-tuned the network using the IJB-A dataset. The validation results show competitive verifiation accuracy compared with the best previous video face verification results.