The urgent need for identifying functional
variants in the human genome motivated a number of genetic variant prioritization
approaches. We present a novel supervised algorithm, PAFA, that prioritizes and
assesses the functionality of genetic variants by introducing population
differentiation measures and recalibrating training variants with multiple
filtration strategies. Comprehensive
performance evaluations demonstrate that PAFA exhibits much higher sensitivity
and specificity in prioritizing noncoding risk variants than existing methods
and provides a more objective assessment for both coding and noncoding
variants. Leave-one-out cross validation of feature selection reveals that
population differentiation measures contribute to an observed significant outperformance
of PAFA in prioritizing common risk noncoding variants in any situation
compared to the current supervised and unsupervised methods. In addition, PAFA
achieves improved performance in distinguishing both common and rare recurrent variants
from non-recurrent variants by integrating multiple annotations and metrics. An
integrated online platform for PAFA was developed, which provides comprehensive
functional annotations for noncoding variants by integrating abundant
functional genomic data.
developers: Peng G
Currently, most gene prediction methods detect coding
sequences (CDSs) from transcriptome assembly when lacking of closely related
reference genomes. However, these methods are of limited application due to highly
fragmented transcripts and extensive assembly errors, which may lead to redundant
or false CDS predictions. Here we present a novel algorithm, inGAP-CDG, for
effective construction of full-length and non-redundant CDSs from unassembled
transcriptomes. inGAP-CDG achieves this by combining a newly developed
codon-basedde bruijngraph to simplify
the assembly process and a machine learning based approach to filter false
positives. Compared with other methods, inGAP-CDG exhibits significantly
increased predicted CDS length and robustness to sequencing errors and varied read
length. These advantages greatly facilitate downstream genomic analyses, including
phylogenetic tree and gene model construction, which will improve our ability to
explore the functional potential of novel species.
Three data sets of 200 bp paired-end RNA-seq reads were simulated from the chr3 of hg19 with an error rate of 0.5%, 1% and 2%, respectively.
Four RNA-seq data sets were simulated from the chr3 Consensus Coding Sequence (CCDS) of hg19 with read length of 100, 300, 500 and 800, respectively.
developers: Gao Y
Although previous studies demonstrated circular RNAs (circRNAs)
are not exclusively comprised of mRNA exons, no study has extensively explored their
internal structure. By combining a novel algorithm with long-read sequencing data
and experimental validation, we for the first time comprehensively investigate internal
components of circRNAs in 10 human cell lines and 62 fruit fly samples, and reveal
the prevalence of alternative splicing (AS) events within circRNAs.
Significantly, a large proportion of circRNA AS exons can hardly be detected in
mRNAs, and are enriched with binding sites of distinct splicing factors from those
enriched in mRNA exons. We find that AS events in circRNAs have a preference
towards nucleus localization, and exhibit tissue- and developmental
stage-specific expression patterns. This study suggests an independent regulation
on the biogenesis or decay of AS events in circRNAs, and the identified
circular AS isoforms provide novel targets for future studies on circRNA formation
developers: Ji P
Most current approaches analyze metagenomic data with the participation of reference genomes. However, novel microbial communities extend far beyond the coverage of reference databases and de novo metagenome assembly from complex microbial communities still remains a great challenge. Here we present a novel experimental and bioinformatic framework, metaSort, for effective construction of bacterial genomes from metagenomic samples. MetaSort provides a sorted mini-metagenome approach based on flow cytometry and single-cell sequencing methodologies, and employs new computational algorithms to efficiently recover high-quality genomes from the sorted mini-metagenome by the complementary of the original metagenome. Through extensive evaluations on simulated dataset, salivary and gut microbiomes, we demonstrated that metaSort has an excellent and unbiased performance on genome recovery and assembly.
developers: Zhang Y
16S rRNA amplicon analysis and shotgun metagenome sequencing are two main culture-independent strategies to explore the genetic landscape of various microbial communities. Recently, numerous studies have employed these two approaches together, but downstream data analyses were performed separately, which always generated incongruent or conflict signals on both taxonomic and functional classifications. Here we propose a novel approach, RiboFR-Seq (Ribosomal RNA gene Flanking Region Sequencing), for capturing both ribosomal RNA variable regions and their flanking protein-coding genes simultaneously. We demonstrated that RiboFR-Seq could detect the vast majority of bacteria not only in well-studied microbiomes but also in novel communities with limited reference genomes. Combined with classical amplicon sequencing and shotgun metagenome sequencing, RiboFR-Seq can link the annotations of 16S rRNA and metagenomic contigs to make a consensus classification.
developers: Zhao H
Although recent developed algorithms have integrated multiple signals to improve sensitivity for INDEL detection, they are far from being perfect and still have great limitations in detecting a full size range of INDELs. Here we present BreakSeek, a novel breakpoint-based algorithm, which can unbiasedly and efficiently detect both homozygous and heterozygous INDELs, ranging from several base pairs to over thousands of base pairs, with accurate breakpoint and heterozygosity rate estimations. Comprehensive evaluations on both simulated and real data sets revealed that BreakSeek outperformed other existing methods on both sensitivity and specificity in detecting both small and large INDELs, and uncovered a significant amount of novel INDELs that were missed before. In addition, by incorporating sophisticated statistic models, we for the first time investigated and demonstrated the importance of handling false and conflicting signals for multi-signal integrated methods.
developers: Gao Y, Wang J
To detect CIRSPR direct repeats (DRs) and spacers from NGS reads;
To annotate DRs and spacers using publicly available WGS bacterial and phage genomes;
To build phage-bacteria interaction network using DR-spacer connections
developers: Gao Y
CIRI (circRNA identifier) is a novel chiastic clipping signal based algorithm, which can unbiasedly and accurately detect circRNAs from transcriptome data by employing multiple filtration strategies. In particular, CIRI has the following indispensable advantages over annotation-dependent algorithms: (i) it is able to detect circRNAs transcribed from intronic or intergenic genomic regions; (ii) and it is applicable to sequencing data of eukaryotes that are not well annotated and or even with no annotation.
ENCODE.zip (circular RNAs detected from ENCODE data)
developers: Qi J & Zhao F
Mining genetic variation from personal genomes is a crucial step towards investigating the relationship between genotype and phenotype. However, compared to the detection of SNPs and small indels, characterizing large and particularly complex structural variation is much more difficult and less intuitive. We present a new scheme to detect and visualize structural variation from paired-end mapping data. Under this scheme, abnormally mapped read pairs are clustered based on the location of a gap signature. Several important features, including local depth of coverage, mapping quality and associated tandem repeat, are used to evaluate the quality of predicted structural variation. Compared with other approaches, it can detect many more large insertions and complex variants with lower false discovery rate.
developers: Qi J & Zhao F
An Integrative Next-generation Genome Analysis Pipeline, guided by a Bayesian principle to detect single nucleotide polymorphisms (SNPs), insertion/deletions (indels). inGAP can be applied to the mapping of both Roche/454 and Illumina reads with no restriction of read length. inGAP also provides functions of multiple genomes comparison and assistance of bacterial genome assembly.
developers: Zhu E
A comprehensive web server mirTools is developed to allow researchers to comprehensively characterize small RNA transcriptome. With the aid of mirTools, users can: (i) filter low-quality reads and 3/5′ adapters from raw sequenced data; (ii) align large-scale short reads to the reference genome and explore their length distribution; (iii) classify small RNA candidates into known categories, such as known miRNAs, non-coding RNA, genomic repeats and coding sequences; (iv) provide detailed annotation information for known miRNAs, such as miRNA/miRNA*, absolute/relative reads count and the most abundant tag; (v) predict novel miRNAs that have not been characterized before; and (vi) identify differentially expressed miRNAs between samples.
developers: Zhao F
Gap closing is considered one of the most challenging and time-consuming tasks in bacterial genome sequencing projects, especially with the emergence of new sequencing technologies, such as pyrosequencing, which may result in large amounts of data without the benefit of large insert libraries for contig scaffolding. We propose a novel algorithm to align contigs with more than one reference genome at a time. This approach can successfully overcome the limitations of low degrees of conserved gene order for the reference and target genomes. A pheromone trail-based genetic algorithm (PGA) was used to search globally for the optimal placement for each contig.
developers: Hou L
An integrated software MagicViewer is developed to easily visualize short read mapping, identify and annotate genetic variation based on the reference genome. MagicViewer provides a user-friendly environment in which large-scale short reads can be displayed in a zoomable interface under user-defined color scheme through an operating system-independent manner. Meanwhile, it also holds a versatile computational pipeline for genetic variation detection, filtration, annotation and visualization, providing details of search option, functional classification, subset selection, sequence association and primer design.
developers: Zhao F & Wu J
A web server to help users automate gap closing based on comparative genomic syntenies. Extensive evaluations showed that it significantly outperforms previous methods and can produce highly accurate layout result, especially when assembling genomes that are only moderately related. The availability of such a platform would greatly benefit the research community working on bacterial genomics.