Haplogroups


Mitochondria Haplogroups

There are several different programs available to find out the mitochondrial haplogroup of a sample (see ...), but the two most commonly used by us are: Haplofind and HaploGrep.


Haplofind

Haplofind [2] is an online website where you can submit a consensus mitochondria in fasta-format and it will tell you the most likely haplogroup. In the pipeline, a mitochondrial consensus sequence in fasta-format is always produced and you can find it in /proj/snic2020-2-10/nobackup/private/Data/Human/Ancient/mt_consensus_strict/.
Haplofind does not handle ambigous bases, and sets them to N. Since our files do contain ambigous bases, you might want to create a new mitochondria using only the unambigouse nucleotides. This is done using below command which outputs a new fasta-file.

mtconsensus_angsd.sh <BAMFILE> <OUTNAME>

HaploGrep

HaploGrep [3] can be used either on their website or running in the terminal together with phylotree 17 [1] to identify haplogroup. It is similar to Haplofind but it can handle ambigous bases, so here you can use the files in /proj/snic2020-2-10/nobackup/private/Data/Human/Ancient/mt_consensus_strict/ directly.

In the Umbrella bin we have downloaded HaploGrep 2.1.16, this is used for the sequence stats in the Atlas database. It will estimate haplogroup using phylotree 17 and the output will be printed in a small text file.

java -jar haplogrep-2.1.16.jar --in /proj/snic2020-2-10/nobackup/private/Data/Human/Ancient/mt_consensus_strict/${file} --format fasta --out ${file}.haplogroups.txt




Y Chromosome Haplogroups

For the definition of haplogroups, Torsten has prepared 3 different lists of biallelic SNP sites from PhylotreeY (20160309):

/proj/snic2020-2-10/private/Data/Human/Ancient/bin/Ytyping/Y.snps.Phylo.noIndel.bed - all single base subsitutions from Phylotree

/proj/snic2020-2-10/private/Data/Human/Ancient/bin/Ytyping/Y.snps.Phylo.noPMD.noIndel.bed - the same as above but without transitions since those could be post mortem damage

/proj/snic2020-2-10/private/Data/Human/Ancient/bin/Ytyping/Y.snps.Phylo.noPMD.noIndel.noStrand.bed - the same as above but also A/T and G/C sites are removed to exclude strand misidentification

The first file is obviously the most comprehensive regarding haplogroup information. However, a single damaged site may skrew up things and lead to conflicts in the classification. An alternative to using the “noPMD” version would be to rescale the base qualities of Ts and As close to fragment ends (see MapDamage2.0). That is not part of our normal sequence processing but you might want to do it for your specific sample. Furthermore, I have noticed that some sites in the Phylotree list may be defined for the wrong strand (example: I found one marker CTS616 which defines I2a2a1 derived in samples where no other SNP indicated an I, so I suspect that something went wrong since it’s a C->G SNP), so such sites should be treated with caution - hence the “noStrand” file.

Here is how to use it:

  1. Log in to Uppmax and start an interactive session.

  2. Load the modules bioinfo-tools and samtools/1.3

  3. Create a pileup file of your sample’s Y chromosome at the sites of interest:

    samtools mpileup -B -q 30 -Q 30 -f /proj/snic2020-2-10/private/Data/Human/Ancient/ref_seqs/hs37d5.fa -l <insert the BED file of your choice> <insert BAM file of your choice> > sample.out

    Note

    1. This only works for BAM files mapped to the hg19 reference genome.

    2. The last part “> sample.out” belongs to the command to write the output into a file called sample.out while the two “<>” statements need to be replaced by file names and their absolute paths (you can replace sample.out with any other file name, but you’ll need the ‘>’).

    3. -q and -Q denote minimum mapping and base quality, 30 is our usual cutoff but it might be a little stringent.

  4. Run the script Ytyper.py to merge the pileup with the Phylotree haplogroup definitions:

    Ytyper.py <sample.out> (sample.out obviously needs to be replaced by your mpileup outfile’s name)

    This will produce a file sample.out.merge.txt which has one line per mutation that got called in your individual. The file is tab-separated and each column has a title. Please note that the reference sequence does not always match the ancestral allele. Commas and dots in the mpileup column indicate reads identical to the reference, alternative bases are shown as letters (see more details on the mpileup codes here). Each line is annotated as ancestral, derived or conflict (either two different alleles at that site (=contamination or sequencing/mapping error) or an allele that differs from the two alleles in Phylotree (=sequencing/mapping error or damage)).


Most interesting for the haplogroup classification are derived sites, you can check them by executing:
grep "derived" sample.out.merge.txt, which will give you a list of all derived mutations in your sample. You can then check those mutations against Phylotree, ISOGG or Wikipedia to learn more about the haplogroups.

You may also want to check ancestral alleles, e.g. to find out whether your sample is actually an X (because other mutations are ancestral) or an X* (because you don’t have data for those other mutations):
grep ancestral sample.out.merge.txt

Finally, you should check conflicting sites which can also serve as a quality check for your sample:
grep "conflict" sample.out.merge.txt


Bibliography

[1] Mannis van Oven. PhyloTree Build 17: Growing the human mitochondrial DNA tree. Forensic Science International: Genetics Supplement Series, 5:e392–e394, December 2015. URL: https://linkinghub.elsevier.com/retrieve/pii/S1875176815302432, doi:10.1016/j.fsigss.2015.09.155.
[2] Dario Vianello, Federica Sevini, Gastone Castellani, Laura Lomartire, Miriam Capri, and Claudio Franceschi. HAPLOFIND: A New Method for High-Throughput mtDNA Haplogroup Assignment. Human Mutation, 34(9):1189–1194, September 2013. URL: http://onlinelibrary.wiley.com/doi/10.1002/humu.22356/abstract, doi:10.1002/humu.22356.
[3] Hansi Weissensteiner, Dominic Pacher, Anita Kloss-Brandstätter, Lukas Forer, Günther Specht, Hans-Jürgen Bandelt, Florian Kronenberg, Antonio Salas, and Sebastian Schönherr. HaploGrep 2: mitochondrial haplogroup classification in the era of high-throughput sequencing. Nucleic Acids Research, 44(W1):W58–W63, July 2016. URL: https://academic.oup.com/nar/article/44/W1/W58/2499296, doi:10.1093/nar/gkw233.