Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Majority of SNPs have allele frequency = 0.000000 #2321

Closed
ryanskiba opened this issue Nov 19, 2024 · 4 comments
Closed

Majority of SNPs have allele frequency = 0.000000 #2321

ryanskiba opened this issue Nov 19, 2024 · 4 comments

Comments

@ryanskiba
Copy link

ryanskiba commented Nov 19, 2024

Hi,

I'm using bcftools to call and filter markers for a haploid biparental population of 118 individuals. I have Illumina sequencing for all progeny. After indexing my reference genome (one of the parents of the biparental population), converting my .fq files to .sam to sorted .bam files, I called markers using:
bcftools mpileup -Ou -f genomic.fasta *.bam | bcftools call -mv -Ob --ploidy 1 --threads 4 -o calls1.bcf

and convert to .vcf using: bcftools view -Ov calls1.bcf > test.vcf

I then run bcftools stats calls1.bcf > stats.txt.

I don't understand why, out of the 3.2 million SNPs in my .bcf file, roughly 1.9 million have an allele frequency of 0.000000. Additionally when I use grep -c "AC=0" calls1.vcf, it gives me a result of 0. So I'm not sure how to look at these markers that are supposedly present at such low frequency. Nor do I understand why markers would be called if they truly aren't present in the population.

Here's a portion of the data from bcftools stats:

# AF, Stats by non-reference allele frequency:
# AF    [2]id   [3]allele frequency     [4]number of SNPs       [5]number of transitions        [6]number of transversions      [7]number of indels     [8]repeat-consistent    [9]repeat-inconsistent  [10]not applicable
AF      0       0.000000        1925317 1318327 606990  271     0       0       271
AF      0       0.008475        25134   23556   1578    34      0       0       34
AF      0       0.016949        86417   79970   6447    110     0       0       110
AF      0       0.025424        82610   76945   5665    115     0       0       115

I can get rid of these by filtering with a minor allele frequency, but I'd like to know why they were present in the first place.

Thanks!

@pd3
Copy link
Member

pd3 commented Dec 1, 2024

Without seeing the VCF it is hard to say if the problem is with the data or the program. I suggest you take a small region and check if the stats counts correspond to what you are able to verify manually. Then, if you find a discrepancy, can provide a small test case to reproduce the problem?

@ryanskiba
Copy link
Author

ryanskiba commented Dec 2, 2024

seg10stats.txt
vcf.txt

Attached is a small VCF file (in txt format) as well as the result I got running bcftools stats on this vcf file. the VCF file contains 11 snps. 10 are present in 1/118 individuals and 1 is present in 67 individuals. When I do the math myself, I would expect the allele frequency to be 1/118 = 0.008475 for the first SNP and 67/118 = 0.567797 for the second SNP. the stats file however gives this result:

# AF    [2]id   [3]allele frequency     [4]number of SNPs       [5]number of transitions        [6]number of transversions      [7]number of indels     [8]repeat-consistent    [9]repeat-inconsistent  [10]not applicable
AF      0       0.000000        10      9       1       0       0       0       0
AF      0       0.559322        1       1       0       0       0       0       0

Thanks for your help!

@pd3
Copy link
Member

pd3 commented Dec 4, 2024

I see what you mean now. The output should be interpreted differently, there are 10 variants in the first allele frequency bin. There are 100 bins by default, the 0.008475 falls in the first bin (0,0.01). The option --af-bins allows to give explicit binning, for example this creates two bins --af-bins 0,1e-4,1

$ bcftools stats vcf.txt --af-bins 1e-4,1 | grep ^AF
AF	0	0.000050	10	9	1	0	0	0	0
AF	0	0.500050	1	1	0	0	0	0	0

I see the output is a bit confusing, in the first case (--af-bins not given) it prints the first bin as 0, but when the explicit list of bins is given, it prints the midpoint of the interval (0.000050). Traditionally we were interested only in variant sites, ie. AF would be always bigger than 0, so the output never considered such case.

This could be improved, but hopefully it sheds some light on what the output means.

@ryanskiba
Copy link
Author

Thank you for explaining that! I think I will use bcftools +fill-tags to add the AF tag to my marker sets and then sort into bins myself in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants