aDNA Pipeline


Current Version

The current verion of the pipeline uses cutadapt v. 2.3 for trimming adapters and FLASH v. 1.2.11 for merging of fastq reads. Cutadapt searches for a predefined adapter sequence and trims the reads if at least 3 bp overlaps between the end of the read and the adapter sequence. The pipeline is also set up in such a way that it is sensitive to dual vs. single indexing, as well as HiSeq vs. NovaSeq sequencing techiques and it accepts a 20% error level in the overlapping region. When the reads are trimmed, FLASH collapses the PE data into a single fastq file if the read-pair overlaps with 11 bp. The FLASH output fastq files (ExtendedFrags, notCombined_1 and and notCombined_2) are then merged into a single fastq file called CutAdapt-eq_set-FLASH_corrected and ends with .all.fastq.gz. You can find these fastq files in /proj/snic2020-2-10/1000AncientGenomes/mergedfastqs/.

This essentially single-end fastq file is then mapped against a reference genome using bwa aln (-l 16500 -n 0.01 -o 2). The bamfile will contain all mapped reads (including PCR duplicates, short reads etc) and is located in /proj/snic2020-2-10/1000AncientGenomes/hg19bams/mapped/

Per default all deliveries are mapped against human reference genome build 37 (hg19).


Version Filename
hg18 human_b36_male_nohaps.fa
hg19 hs37d5.fa
hg38 GRCh38_full.fa

Next, a slightly modified version of FilterUniqueSAMCons.py [1] is used to condense the reads with identical start and end position into a consensus read. Then reads shorter than 35 base pairs and reads with less than 90 % consensus with the refernce are filtered out using percidentity_threshold.py Skoglund2012. This final file is located in /proj/snic2020-2-10/1000AncientGenomes/hg19bams/.

Several statistic are calculated and uploaded to the Atlas database; Read Length and Damage Pattern plots are produced and placed in /proj/snic2020-2-10/1000AncientGenomes/nobackup/RLplots/ and /proj/snic2020-2-10/1000AncientGenomes/nobackup/damageplots/.






Previous Versions

The previous verion of the pipeline used Adapter Removal v. 2.1.7 [2] for trimming and merging of fastq reads. The program parameters were set to merge (collapse) forward and reverse reads if an overlap of at least 11 bp was identified. It also trimmed the reads if traces of Illumina default adapters were present. Collapsed, collapsed.truncated, pair1.truncated, and pair2.truncated files were merged into a fastq file named ARmerged.
This pipeline got retired since it either added adapter bases or removed fragment bases if the PE reads were not symetrical (i.e one of the PE reads contained Ns at the 5´end).

If you find a file with merged in its name instead of ARmerged or CutAdapt-eq_set-FLASH_corrected, then your file have been merged using the Kircher 2009 [1] script.


Sequencing Machines

Our data have been sequenced pair-end on either an Illumina HiSeq 2500, HiSeq X10 or NovaSeq 6000, depending on when they were sequenced. In general if the sequence date contains ST-E0XXX then it has been sequenced on X10-machine while A00XXX means NovaSeq. All other are HiSeq 2500, with the exception of a couple of (mostly) failed MiSeq runs.

Identifier Techinque
A00XXX NovaSeq 6000, both Stockholm and Uppsala
ST-E0XX X10, both Stockholm and Uppsala
D00XXX HiSeq 2500, Uppsala
ACXXX HiSeq 2500, Sthlm

Be aware that the first year with NovaSeq sequencing they used the 1.0 chemistry version, while they from early 2021 swiched to the 1.5 chemistry version. The difference between the reagents is in which direction the p5 index is read (see this link for detailed infromation about the difference). This means that a library sequenced with both chemistry versions will show the same p7 index, but for p5 they will show the reverse complement of each other. For an example, see below:

NovaSeq 6000 v1.0: CATACCT-AGTTGGT

NovaSeq 6000 v1.5: CATACCT-ACCAACT






Comparative Sequences

Comparative sequences can be found in the Google spreadsheet and at /proj/snic2020-2-10/1000AncientGenomes/comparative_seqs/comparative_seqs/.


Comparative Datasets

There are some downloaded datasets ready for you to use located in /proj/snic2020-2-10/1000AncientGenomes/SNPrefs/.

The most commonly used ones are HumanOrigins.autosomal.release.tped which contains all the Human Origins(HO) samples, and the Eurasian (plus YRI) populations from 1000 genomes project(KGP) Eurasian.KGP.tped. HO have many populations but with fewer individuals per population than the KGP. All the files are in plink format.


Bibliography

[1] Martin Kircher. Analysis of High-Throughput Ancient DNA Sequencing Data. In Beth Shapiro and Michael Hofreiter, editors, Ancient DNA, volume 840, pages 197–228. Humana Press, Totowa, NJ, 2012. URL: http://link.springer.com/10.1007/978-1-61779-516-9_23.
[2] Mikkel Schubert, Stinus Lindgreen, and Ludovic Orlando. AdapterRemoval v2: rapid adapter trimming, identification, and read merging. BMC Research Notes, December 2016. URL: http://www.biomedcentral.com/1756-0500/9/88, doi:10.1186/s13104-016-1900-2.