Evaluating the Generalizability of Deepfake Detection Models: A Comparative Analysis of GAN and Diffusion-Based Generated Content

Hasan, Md. Tarek

View/Open

Thesis - Md. Tarek Hasan 0122410019 - MSCSE (1).pdf (12.63Mb)

Date

2025-08-02

Author

Hasan, Md. Tarek

Metadata

Show full item record

Abstract

The rapid advancement of deepfake technology has significantly increased the realism and accessibility of synthetic media. Emerging techniques such as diffusion-based models and Neural Radiance Fields (NeRF), along with improvements in traditional Genera- tive Adversarial Networks (GANs), have enabled the sophisticated generation of deepfake videos, posing growing threats to biometric security and trust. In parallel, detection meth- ods have advanced through innovations in Transformer-based architectures, contrastive learning, and other deep learning approaches. Yet, this progress continues to play out within a persistent cat-and-mouse dynamic between generation and detection. In this work, we present a comprehensive empirical evaluation of state-of-the-art deepfake detec- tion methods, alongside a human subject study focused on identifying deepfakes, using curated stimuli set generated by cutting-edge deepfake synthesis techniques. Unlike prior efforts, our study establishes a benchmark that captures the challenges posed by the lat- est generation methods in realistic settings. Our findings expose a critical vulnerability: both leading detection models and human evaluators struggle when confronted with high- quality, modern deepfakes. To address this gap, we introduce a multimodal detection framework that incorporates both audio and visual modalities, enhancing the robust- ness of detection systems in cross-modal scenarios. Our methodology includes evaluating performance across diverse conditions—such as different resolutions and clip lengths—and comparing unimodal versus multimodal fusion strategies. Extensive experimentation high- lights the urgent need to refine detection models to keep pace with rapidly evolving gener- ative techniques. By establishing a rigorous benchmark and revealing current limitations, our study offers a timely foundation for developing more robust and future-ready deep- fake detection systems. Our results demonstrate that incorporating the audio modality alongside video consistently improves detection performance, underscoring the value of multimodal analysis for robust generalization. Notably, the proposed multimodal frame- work—evaluated on the FakeAVCeleb and AV-Deepfake1M datasets—achieved superior performance across all tested conditions, with early fusion yielding the highest AUC and precision, and cross-modal attention demonstrating particular effectiveness under low- resolution scenarios.

URI

http://dspace.uiu.ac.bd/handle/52243/3310

Collections

M.Sc Thesis/Project [166]