Spoofing detection project in 2nd YAICON
🤗 박준영, YAI 9th
👤 변지혁, YAI 8th
👤 주다윤, YAI 10th
👤 김강현, YAI 10th
git clone https://github.com/junia3/Synthetic-Speech-Detection.git
cd Synthetic-Speech-Detection
conda create -n ssd python=3.8
conda activate ssd
pip install ipykernel # Optional
python -m ipykernel install --user --name ssd --display-name ssd # Optional
Actually, you do not need to add environment kernel for jupyter notebook(optional)
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
pip install spafe matplotlib soundfile tqdm torchsummary
This project basically use ASVspoof-2019 dataset, which can be downloaded on this page. Or you can just download LA.zip file with following command(recommended).
curl -o LA.zip https://datashare.ed.ac.uk/bitstream/handle/10283/3336/LA.zip?sequence=3&isAllowed=y
unzip LA.zip -d datasets
Then, your repository will have following structure.
Synthetic-Speech-Detection
├── datasets
│ ├── dataset.py
│ ├── LA
│ │ ├── ASVspoof2019_LA_asv_protocols
│ │ ├── ASVspoof2019_LA_asv_scores
│ │ ├── ...
Detailed implementation is written on dataset page.
Check file 'baseline.ipynb'.
Hello and welcome to the README for this project!
If you are planning to use the baseline model for training, we strongly recommend that you train for more than 50 epochs. This is because in the initial training steps, the validation loss may be unstable and fluctuate quite a bit. However, with more epochs, the validation loss tends to stabilize and converge to a more meaningful value.
So, to ensure that you get the best results from your training, we suggest training the baseline model for at least 50 epochs. Of course, you may need to adjust this number depending on the specifics of your project and dataset.
Thank you for using our code, and we wish you the best of luck with your training!
Audio spoofing is the act of attempting to deceive a system that relies on audio input, such as a speaker recognition system, by playing a recording of a legitimate user's voice instead of speaking in real-time. Detecting audio spoofing is important to prevent unauthorized access to sensitive information and protect against fraud. One way to evaluate the performance of an audio spoofing detection system is to use the EER metric. The EER is the point at which the false acceptance rate (FAR) and the false rejection rate (FRR) are equal.
The FAR is the proportion of spoof attacks that are incorrectly accepted as genuine, while the FRR is the proportion of genuine attempts that are incorrectly rejected as spoof attacks. Ideally, a spoofing detection system should have low values for both FAR and FRR.
To calculate the EER, a system is tested on a dataset of both genuine and spoofed audio samples, and the FAR and FRR are calculated at different thresholds. The threshold represents the level of confidence required for a system to classify a sample as either genuine or spoofed. The EER is the point where the FAR and FRR intersect on a Receiver Operating Characteristic (ROC) curve.
In summary, the EER metric is a useful way to evaluate the performance of an audio spoofing detection system. It takes into account both false acceptance and false rejection rates, and provides a single value that represents the level of performance achieved by the system.
The t-DCF is a commonly used metric for evaluating the performance of audio spoofing detection systems, especially in the context of speaker verification systems. The t-DCF metric takes into account both the detection accuracy of the system as well as the potential financial cost associated with false acceptance and false rejection errors.
The t-DCF is calculated as the weighted sum of two costs: the false alarm cost (Cfa) and the missed detection cost (Cmiss). The false alarm cost represents the cost associated with incorrectly accepting a spoof attack as genuine, while the missed detection cost represents the cost associated with incorrectly rejecting a genuine attempt as a spoof attack.
The t-DCF metric is computed using a scoring function that assigns a score to each audio sample based on the likelihood that it is a genuine or a spoofed sample. The t-DCF metric is then calculated using a set of parameters that define the costs of false acceptance and false rejection errors, as well as the prior probabilities of genuine and spoofed samples in the dataset.
The t-DCF metric is especially useful for evaluating the performance of audio spoofing detection systems in real-world scenarios where the cost of false alarms and missed detections can be high. For example, in a speaker verification system used for financial transactions, a false alarm could result in unauthorized access to an account, while a missed detection could result in a legitimate user being denied access to their own account.
In summary, the t-DCF metric is a widely used evaluation metric in the field of audio spoofing detection, which takes into account the financial costs associated with false acceptance and false rejection errors.
Basically you need some requirements to run this program(for augmentation)
pip install soundfile
pip install audiomentations
You should match model specification(pre-trained model) with service setting.
- Default setting(transform = LFCC, feature length =
$750$ )
python service.py
- Free setting(transform or feature length)
python service.py --feature 500 --transform cqcc
Be sure you have to 'download' best_model.pt(pretrained model) and locate it on the same directory as service.py
The front-end web is running on your local URL.
Model | EER(%) | t-DCF | Acc(%) | Link | Model | EER(%) | t-DCF | Acc(%) | Link |
---|---|---|---|---|---|---|---|---|---|
RN18/500/LFCC | 6.74370 | 0.18492 | 67.7219 | download | RN18/750/LFCC | 6.19861 | 0.13363 | 75.7644 | download |
RN34/500/LFCC | 7.24569 | 0.18298 | 81.2626 | download | RN34/750/LFCC | 8.40194 | 0.17825 | 60.7552 | download |
Press button 'Choose an audio file' and upload audio data
If you uploaded audio file properly, click 'Upload and play'.
Apply augmentation on your data. If want to run it with default setting, just set two values
Apply augmentation and wait for data pre-processing. Then click 'Try Me!' button. You can use our service for free in your own device.
After inference is over, the result is presented on the page! It only takes few seconds for 10 sec sample even in CPU environment!
And with "show result" button, you can check the specification details on model prediction!!
As the probability of "Prediction" approaches zero, the probability of it being a synthesized voice is low. In other words, on the graph, the blue color (values less than 0.5) represents the location predicted to be a real voice, while the red color (values greater than or equal to 0.5) represents the location predicted to be a spoof voice.
P.S. We have a small gift for anyone who runs the web demo application on their own device! Give it a try and discover what it is.