A Lightweight LFCC–CNN Framework for Robust Audio Deepfake Detection Under Noisy and Cross-Dataset Conditions

CH. Bhupati, E. Gopinandha sai, K. S. Paarthipan, G. Chandrika,

doi:10.7492/f89msq59

Authors

CH. Bhupati, E. Gopinandha sai, K. S. Paarthipan, G. Chandrika, Author

DOI:

https://doi.org/10.7492/f89msq59

Abstract

Multimedia content may not be trusted anymore when audio deepfakes, which include synthesized speech and voice conversion, have been produced. Moreover, these audio deepfakes are likely to pose security threats to systems that automatically verify the speaker. To combat this threat, we propose a lightweight novel LFCC-based CNN architecture for reliable spoofed speech detection on the ASVspoof benchmark datasets. Compared to MFCC (Mel Frequency Cepstral Coefficient) features, LFCC (Linear Frequency Cepstral Coefficient) (the widely used feature in an ASV system) could better maintain the important linear spectral characteristics of speech, which get distorted during the spoof attacks. They can have additions that help to identify spoof signals and real speech. The ASVspoof 2019 Logical Access dataset was used to train the suggested model. Under three challenging scenarios, the model is assessed. Clean speech is the first. Two SNRs, 10 dB and 20 dB, allow for additive noisy scenarios. ASVspoof 2021 Logical Access cross-dataset is the third. According to the empirical results, the proposed LFCC–CNN framework provides significantly lower EER than the mainstream MFCC-based system in all three scenarios. To ensure the interpretability of our model, we also use a gradient-based visualization technique (Grad-CAM) for local discriminative time-frequency regions. In conclusion, it can be inferred from the findings that the LFCC features considerably enhance the system robustness against noisy and cross-dataset conditions. Additionally, the proposed framework is the ideal lightweight solution for various applications.