Generalizable and Interpretable Deepfake Detection via Multi-Scale vector Transformer Fusion network
Keywords:
Deepfake Detection, Crossover Forex Component Analysis (CFCA), Multi-Scale Vector Transformer Fusion Network (MSVTF-Net), FaceForensics Dataset, Explainable AIAbstract
The rapid progress in deepfake generation poses a threat to information integrity, digital security, and public trust. State-of-the-art detection algorithms tend to rely on low-level convolutional features or training only on the datasets that they are built on; the former limits generalization to the unseen manipulation cases and the latter limits interpretability. To address these issues, we propose a two-stage detection framework that combines Crossover Forex Component Analysis (CFCA) and the Multi-Scale Vector Transformer Fusion Network (MSVTF-Net). As the first algorithm, CFCA extracts the crossover frequency-domain and residual features from manipulated facial regions by factoring video frames into component subspaces, which represent subtle inconsistencies that are invisible to human eyes. The resulted multi-component vectors are then fed as input to MSVTF-Net, which is the second algorithm. MSVTF-Net develops hierarchical transformer-based vector fusion at multiple scales to integrate local and global spatiotemporal cues for robust classification and interpretable attention-based segmentation localization of manipulated regions. The pipeline is tested on the open-access FaceForensics++ dataset and is reproducible and promotes fair benchmarking. Experimental results indicate that CFCA -> MSVTF-Net framework substantially outperforms the state-of-the-art baselines in cross-manipulation detection accuracy, robustness, and interpretability, which is a practical development for trustworthy deepfake forensic applications.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 K. Thulasimani, G. Kasthuri

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.