 |
AllelePipe |
Identifying
alleles in population genomic datasets without a reference genome
Allelic variation within
species provides fundamental insights into the evolution and ecology of
organisms, and information about this variation is becoming
increasingly available in sequence datasets of multiple and/or outbred
individuals. Unfortunately, identifying true allelic variants poses a
number of challenges, given the presence of both sequencing errors and
alleles from other closely-related loci. Particularly tricky is the
case where alleles must be identified without mapping them to a fully
resolved reference genome, and where sequence depth information cannot
be used to infer the putative number of loci sharing the same
sequence. This situation is commonly found in transcriptome
and publicly available post-assembly datasets. The AllelePipe takes in
assembled sequence contigs from one or more individuals and passes them
through the following steps to extract allelic variation at putative
individual loci:
-
1) Similarity
is assessed among all sequences from all individuals using SSAHA2 according
to user-defined minimum similarity and alignment length thresholds.
- 2)
Alignment throughout the region of overlap
is verified.
-
3)
Sequences are clustered
by either
single-linkage clustering or MCL
as desired, with the option of re-starting the clustering with
alternative methods/granularities.
- 4)
Multiple
alignments
are created for sequences within each cluster and their consensus
sequence generated, using CAP3.
A single consensus genomic reference fasta file is generated for the
whole dataset which can be used again in other analyses.
- 5)
Optionally, putatively chimeric
clusters
are removed, assuming that these are clusters where only one sequence
bridges an internal region of the multiple alignment. This step is only
appropriate for datasets with many individuals and good coverage of
loci, where many sequences should be aligning across the length of each
locus.
- 6) SNPs
(currently excluding indels) are identified using ssahaSNP
against the reference sequence for the same or different sets
of individuals, as desired. The program can be restarted
from this step for additional analyses with different parameter and/or
new individuals.
- 7)
Clusters are sorted as being single
or multi-locus,
based upon user settings for the maximum number of alleles allowed per
individual
Availability
Citing
AllelePipe
-
For
the AllelePipe itself:
-
Dlugosch
KM, et al. In prep. -- stay tuned
The
pipeline uses the following programs,
which should also be cited:
- Huang X, Madan A (1999) CAP3: a DNA
sequence assembly program. Genome Res 9:868-877.
- Ning Z, Cox AJ, Mullikin JC (2001) SSAHA: a
fast search method for large DNA databases. Genome Res 11:1725-1729.
- If
you use the MCL option, you should
ALSO cite:
-
Li
L, Stoeckert CJ, Roos DS (2003) OrthoMCL:
identification of ortholog groups for eukaryotic genomes. Genome Res
13:2178-2189.
 |
ChromEvol |
Model
the evolution of chromosome numbers across a phylogeny
Chromosome number is a remarkably dynamic feature of eukaryotic
evolution. Chromosome numbers can change by a duplication of the whole
genome (a process termed polyploidy), or by gaining or losing single
chromosomes. Of the various mechanisms of chromosome number change,
polyploidy has received significant attention because of the impact
such an event may have on the organism. Polyploids often differ
markedly from their progenitors in morphological, physiological, or
life history characteristics, and these differences may contribute to
the establishment and success of a polyploid species in novel
ecological settings.
ChromEvol implements a series of likelihood models for the evolution of
chromosome numbers. By comparing the fit of the different models to
biological data, it may be possible to gain insight regarding the
pathways by which the evolution of chromosome number proceeds. For each
model, the program infers the set of ancestral chromosome numbers and
estimates the location along the tree for which polyploidy events (and
other chromosome number changes) occurred.
Availability
Citing
ChromEvol
 |
DupPipe |
A Pipeline to Infer
Gene Family Phylogenies and Summarize the Age of Duplication Events
Phylogenies reveal the evolutionary history of organisms and genes, and
phylogenetic analyses of genome scale data can uncover genome-wide
events that occurred in the past. For example, examining the overall
shape of the distribution of all gene family duplications for an
organism reveals the patterns of birth and death of gene copies through
time. Large-scale birth events, such as ancient polyploidy, are
apparent in these distributions as peaks and by comparison with other
species may be placed in phylogenetic context. Such analyses are also
useful to evaluate the resolution of close paralogs to assess assembly
quality - overassembled or short read assemblies without sufficient
power will not resolve recent paralogs. The DupPipe provides an online
server to estimate gene family phylogenies and plot the distriubtion of
duplications for subsequent evolutionary analyses and assembly quality
assessment.
Gene family members are identified as sequences that demonstrate at
least 40% sequence similarity over at least 300 bp from a discontiguous
MegaBlast (Zhang et al. 2000; Ma et al. 2002). Reading frames for each
sequence pair are identified by comparison to available protein
sequences by searching against a set of proteins provided by the user
or available on GenBank (Wheeler et al. 2007) using BlastX (Altschul et
al.1997). Best hit proteins are paired with each gene at a minimum
cutoff of 30% sequence similarity over at least 150 sites. Genes that
do not have a best hit protein at this level are removed. To determine
reading frame and generate estimated amino acid sequences, each gene is
aligned against its best hit protein by Genewise 2.2.2 (Birney et al.
2004). Using the highest scoring Genewise DNA-protein alignments,
custom Perl scripts are used to remove stop and 'N' containing codons
and produce estimated amino acid sequences for each gene. Amino acid
sequences for each duplicate pair are then aligned using MUSCLE 3.6
(Edgar 2004). The aligned amino acids are subsequently used to align
their corresponding DNA sequences using RevTrans 1.4 (Wernersson
and Pedersen 2003). Ks values (synonymous substitution rates) for each
duplicate pair are calculated using the maximum likelihood method
implemented in codeml of the PAML package (Yang 1997) under the F3-4
model (Goldman and Yang 1994).
Further cleaning of the dataset is conducted to remove duplication
events that could bias the results. To reduce the possibility that
identical genes are represented in the dataset, but missed by the TGICL
clustering due to alternative splicing, all Ks values from one member
of a duplicate pair with Ks = 0 were removed. Further, to reduce the
multiplicative effects of multicopy gene families on Ks values, we use
simple hierarchical clustering to construct phylogenies for each gene
family, identified as single-linked clusters, and calculate the node Ks
values.
Run the DupPipe on the
EvoPipes Server Now!
Output Files
- final_ks_values:
Contains the node Ks values for each gene family cluster to plot the
distribution of gene ages
- pamloutput:
Pairwise Ka, Ks, and Ka/Ks for each duplicate gene pair used to
generate the node Ks values for each gene family cluster
- genewise_dnas.fasta:
Cleaned up DNAs for PAML (e.g., removed stop codons) and placed in
reading frame by best blast hit to known proteins with genewise.
- genewise_prots.fasta:
Estimated protein sequences that correspond to the reading frame of the
same sequence number in genewise_dnas.fasta.
- sequence:
Original DNA fasta file.
- indices: The
index of each fasta header used by the DupPipe with the original
corresponding header.
Citing the DupPipe
- The citation for the DupPipe itself:
- For using the EvoPipes.net server, please
ALSO cite:
- The pipeline uses the following programs,
which should also be cited:
- Altschul SF, Madden TL, Schaffer AA,
Zhang J, Zhang Z, Miller W, Lipman DJ. 1997. Gapped BLAST and
PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res. 25:3389-3402.
- Birney E, Clamp M, Durbin R. 2004.
GeneWise and Genomewise. Genome Res. 14:988-995.
- Edgar RC. 2004. MUSCLE: multiple sequence
alignment with high accuracy and high throughput. Nucl. Acids Res.
32:1792-1797.
- Goldman N, Yang Z. 1994. A codon-based
model of nucleotide substitution for protein-coding DNA sequences. Mol.
Biol. Evol. 11:725-736.
- Ma B, Tromp J, Li M. 2002. PatternHunter:
faster and more sensitive homology search. Bioinformatics. 18:440-445.
- Wernersson R, Pedersen AG. 2003. RevTrans:
multiple alignment of coding DNA from aligned amino acid sequences.
Nucl. Acids Res. 31:3537-3539.
- Wheeler DL, Barrett T, Benson DA, Bryant
SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S,
et al. (30 co-authors). 2007. Database resources of the National Center
for Biotechnology Information. Nucl. Acids Res. 35:D512.
- Yang Z. 1997. PAML: a program package for
phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci.
13:555-556.
- Zhang Z, Schwartz S, Wagner L, Miller W.
2000. A greedy algorithm for aligning DNA sequences. J. Comput. Biol.
7:203-214.
 |
findSSR |
Identify
SSRs in a collection of DNA sequences
findSSR is a pipeline that
identifies all genes cointaining SSRs or microsatellites, including
di-, tri-, tetra- and penta-nucleotide repeats of at least five
repeats. Similar in scope to the approach of Temnykh et al. (2001),
findSSRs was designed to pull out SSRs that would be useful for
genotyping. For this purpose, unlike other programs, findSSRs only
reports repeats found greater than 20 nucleotides from the ends of the
sequence, leaving room for primers on either side. Output includes a
list of each sequence name for genes containing an SSR, the location of
the SSR, repeated motif, number of repeats, and the total length of the
sequence examined. The program has been used to identify
microsatellites that have been developed into markers used in several
publications (Kane et al. 2009, Kane and Rieseberg 2007, Kane and
Rieseberg 2008, Kawakami et al. 2010, Yatabe et al. 2007).
Run
findSSR on the EvoPipes Server Now!
Output
Files
- out.ssr*:
Row by row listing of SSRs for each fasta sequence with the repeat
motif, number of repeats, position of repeat, and total length of the
fasta sequence.
Citing findSSRs
- The citation for
findSSR itself:
- Kane NC and Rieseberg LH. 2007.
Selective sweeps reveal candidate genes for adaptation to drought and
salt tolerance in common sunflower, Helianthus annuus.
Genetics 175: 1823-1824.
- For using findSSR on
the EvoPipes.net server, please ALSO cite:
Simulate gene and genome evolution with
your data using NU-IN and EvolSimulator
NU-IN is is an adaptation and expansion of the EvolSimulator 2.1.0
genome evolution simulation program by Beiko and Charlebois (2007,
http://bioinformatics.org.au/evolsim/). NU-IN was designed to expand
EvolSimulator in two fundamental ways: 1) Allow synonymous and
non-synonymous nucleotide evolution and 2) Permit input of genomes,
gene family membership, and gene 'usefulness' (the selective retention
of particular loci in particular environments). With these changes, the
user has the ability to use real genomic (coding) sequence data to
initiate a simulation of one or more lineages, generate mutations
through SNPs and copy number variation (as well as horizontal gene
transfer), evolve genomes by drift and selection, and use output of
previous simulations as starting points for further evolution.
Availability
Citing NU-IN
- The citation for NU-IN itself:
- Please ALSO cite along with NU-IN:
- Beiko RG, Charlebois RL. 2007. A simulation test
bed for hypotheses of genome evolution. Bioinformatics 23:825-831.
Available here.
 |
RBH
Orthologs |
A
pipeline to identify reciprocal best BLAST hits among a set of fasta
files.
The RBH Orthologs pipeline is
designed to provide a list of reciprocal best blast hits for a given
set of files. This is one of many approaches to identifying putatively
orthologous sequences in a collection of genomic data, an important
step in searching for candidate genes under selection or constructing
phylogenies with only orthologous sequences. Note that lineage specific
duplications and missing data pose problems for the identification of
orthologs using all approaches, but the stringent requirements of the
RBH algorithm should minimize these errors.
For each fasta formatted data set uploaded, reciprocal BLAST searches
are conducted for all pairwise combinations using a discontiguous
megablast (Zhang et al. 2000; Ma et al. 2002) with an intial filter of
at least 50% sequence identity and e-value of 0.1. Each blast search is
further parsed to keep hits with at 70% sequence identity over at least
100 base pairs. The top hits among each of these pairwise BLAST
searches are then examined for reciprocal best BLAST hits among all
uploaded fasta files. A file containing the names of these RBH
orthologs, listed by row, is provided as output. Names of each sequence
are coded with the three letter file id provided by the user and the
index number for each sequence, based on its position in the original
fasta file.
Run
RBH Orthologs on the EvoPipes Server Now!
Output
Files
- orthologs.id1.id2...:
Row by row listing of rbh orthologs with each fasta sequence identified
in the ortholog groups based on user provided IDs and position of the
sequence in the original fasta files.
Citing the RBH
Ortholog Pipeline
- For using RBH
Orthologs on the EvoPipes.net server, please cite:
- The pipeline includes
the following program(s):
- Ma B, Tromp J, Li M. 2002.
PatternHunter: faster and more sensitive homology search.
Bioinformatics. 18:440-445.
- Zhang Z, Schwartz S, Wagner L,
Miller W. 2000. A greedy algorithm for aligning DNA sequences. J.
Comput. Biol. 7:203-214.
 |
SCARF |
Scaffolded and Corrected Assembly of Roche
454
A
next-gen sequence assembly tool for evolutionary genomics. Designed
especially for assembling 454 EST sequences against high quality
reference sequences from related species.
SCARF was created in order to knit together low-coverage 454 contigs
that do not assemble during traditional de novo assembly, using a
reference sequence library to orient the 454 sequences. SCARF is
especially well suited for non-contiguous or low depth data sets such
as EST (expressed sequence tag) libraries. SCARF can also be used to
sort and assemble a pool of 454 sequence data according to a set of
reference sequences (e.g. for metagenomics). See the documentation for
a full description of the methodology behind SCARF.
Run
SCARF on the EvoPipes Server Now!
Download
SCARF
Citing
SCARF
- Barker
MS, Dlugosch KM, Reddy ACC, Amyotte SN, Rieseberg LH. 2009. SCARF:
Maximizing next-generation EST assemblies for evolutionary and
population genomic analyses. Bioinformatics 25(4): 535-536.
- If
you run SCARF using the EvoPipes server, please ALSO cite:
Barker MS, Dlugosch KM, Dinh L, Challa RS, Kane NC, King MG, Rieseberg
LH. 2010. EvoPipes.net: Bioinformatic tools for ecological and
evolutionary genomics. Evolutionary Bioinformatics 6: 143-149.
 |
SnoWhite |
An Aggressive Cleaning
Pipeline for DNA Sequence Reads
Snowhite is a pipeline designed to flexibly and
aggressively clean sequence reads (gDNA or cDNA) prior to assembly. It
takes in and returns fasta formatted sequence and (optionally) quality
files. It employs several steps:
- 1) Adapter Clipping: SnoWhite can clip a
user-specified number of bases from the beginning of each sequence.
- 2&4) Seqclean: SnoWhite passes files to
TGI's Seqclean,
a relatively old but still excellent tool for trimming polyA/T tails,
primer contaminants, and uninformative sequences (Ns).
- 3) PolyA/T Trimming: SnoWhite provides additional
trimming governed by many tunable parameters. In short, users can set
tolerances for what constitutes a polyA/T, where to look for it in the
sequence, and how much error to allow.
- 5) TagDust: SnoWhite optionally implements TagDust,
which is designed to find sequences that are composed almost entirely
of primer/adapter fragments. These primer 'multimers' or 'concatmers'
are a persistent low-abundance feature of many datasets, and are
extremely difficult to remove using traditional contaminant searches.
Data
Types
- 454:
SnoWhite was written for Roche 454 data, and is ideal for this.
- Illumina
& SOLiD: May require large amounts of RAM.
- Sanger:
Note that TagDust evaluates only the first 999bp of sequence,
and TagDust does not tolerate vector sequences >2000nt.
Availability
Citing SnoWhite
- For SnoWhite:
Dlugosch KM, Rieseberg LH. SnoWhite: A pipeline for aggressive cleaning
of next-generation sequence reads. In prep.
- If you use the TagDust option, you should
ALSO cite:
Lassmann T, Hayashizaki Y, Daub CO. 2009. TagDust - A program to
eliminate artifacts from next generation sequencing data.
Bioinformatics 25: 2839-2840.
 |
TransPipe |
A
pipeline to translate a collection of genomic or cDNA sequences to
protein and place them in the corresponding reading frame
TransPipe provides bulk translation and reading frame identification
for a set of fasta formatted sequences. The identification of reading
frames and translated protein sequences are crucial steps for codon
based analyses, such as testing for evidence of accelerated amino acid
evolution, Ka/Ks ratios, or protein guided DNA alignments. Using
GeneWise's HMM algorithm, the TransPipe leaves only the regions of
genes or EST reads that align to the protein sequence and starts them
in-frame. An added benefit of this approach is that any missed vectors
and adapters are also removed, as well as all non-coding DNA from the
data set, such as introns and UTRs, simplifying the comparison of
transcriptome, whole genome shotgun, and genespace data.
The reading frame and protein translation for each sequence is
identified by comparison to available protein sequences provided by the
user or available on GenBank (Wheeler et al. 2007). Using BlastX
(Altschul et al.1997), best hit proteins are paired with each gene at a
minimum cutoff of 30% sequence similarity over at least 150 sites.
Genes that do not have a best hit protein at this level are removed. To
determine reading frame and generate estimated amino acid sequences,
each gene is aligned against its best hit protein by Genewise 2.2.2
(Birney et al. 2004). Using the highest scoring Genewise DNA-protein
alignments, custom Perl scripts are used to remove stop and 'N'
containing codons and produce estimated amino acid sequences for each
gene. Output includes paired DNA and protein sequences with the DNA
sequence's reading frame corresponding to the protein sequence.
Run
the TransPipe on the EvoPipes Server Now!
Output
Files
- genewise_dnas.fasta:
Translated DNAs placed in reading frame by alignment to best blast hit
to known proteins with genewise. Sequences are also cleaned of any
non-aligned regions and stop codons, and are ready for use in other
software tools such as PAML.
- genewise_prots.fasta:
Estimated protein sequences that correspond to the reading frame of the
same sequence number in genewise_dnas.fasta.
- sequence:
Original DNA fasta file.
- fasta_index:
The index of each fasta sequence used by the TransPipe with the
original corresponding header in column 2.
- dna_protein_index:
The index of each fasta sequence (column 1) with the corresponding best
hit protein index and header (column 2 and 3).
Citing the TransPipe
- The citation for the
TransPipe itself:
- The pipeline uses the
following programs, which should also be cited:
- Altschul SF,
Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. 1997.
Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res. 25:3389-3402.
- Birney E, Clamp M, Durbin R. 2004.
GeneWise and Genomewise. Genome Res. 14:988-995.
- Wheeler DL, Barrett T, Benson DA, Bryant
SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S,
et al. (30 co-authors). 2007. Database resources of the National Center
for Biotechnology Information. Nucl. Acids Res. 35:D512.
|