Genetic variation and Precision medicine


Extensive studies have shown that genomic structural variation (SV) is involved in various human genetic disorders. As a key technique in precision medicine, SV detection has been proven to be one of the most efficient way to screen candidate genes related to diseases. However, current SV detection algorithms are far from being perfect and have limits in terms of low frequency and heterozygous SVs, especially for those adjacent to repetitive regions. We aim at developing new computational algorithms for identifying SVs associated with repetitive sequences and recognizing their precise breakpoints, by employing machine learning and statistical approaches. We will focus on the detection of SVs from paired and family trios data, and will employ a multi-signal based strategy to build sophisticated statistical models to estimate heterozygosity rate and to filter false positives, which will help detect de novo SVs and homozygous deletion variants from personal genomes with inherited diseases. In addition, we will set up a distributed system for SV detection and annotation, and using this platform we will explore SV patterns in human personal genomes.


遗传变异不仅是人类表型变化的基础,也是疾病易感性的基础。根据发生突变的碱基数目,遗传变异可分为单核苷酸多态性(SNP)和结构变异。SNP和串联重复序列曾被认为是人类遗传变异最主要的形式,但最新的研究表明基因组结构变异广泛存在于健康和病患人体中,影响着基因的表达和表型的变化,甚至引发疾病或增加复杂性状疾病的发病风险。近年来,以基因组测序为核心技术的精准医学研究成为大家关注的热点,基因突变信息的识别是精准医学的核心关键技术。然而,目前对海量基因组数据中遗传变异,尤其是结构变异的挖掘算法远未成熟。 我们重点关注基因组结构变异挖掘中的关键性问题(如结构变异中断点的精准定位、重复序列区域附近的结构变异识别等),提出新的计算方法,建立较为完善的统计学模型及质量评估标准,以便快速、准确的从海量数据中挖掘出基因组结构变异。随着高通量测序技术的进步以及越来越多的个人基因组数据的出现,深度挖掘和分析其中的遗传变异,将对我们深入理解复杂性状疾病的分子机制、鉴定易感基因和认识遗传变异和疾病表型的关系具有重要意义。


Bioinformatics in Circular RNAs


Recent studies reveal that circular RNAs are a novel class of abundant, stable and ubiquitous noncoding RNA molecules in eukaryotes. A comprehensive analysis of circRNAs from high throughput RNA transcriptome data is an initial and crucial step to study the biogenesis and function of circular RNAs. We proposed a novel chiastic clipping signal based algorithm to unbiasedly and accurately detect circRNAs from transcriptome data by employing multiple filtration strategies (Genome Biology, 2015; Briefings in Bioinformatics, 2018). We further described a new feature, reverse overlap, for circRNA detection, which outperforms back-splice junction-based methods in identifying low-abundance circRNAs. By combining both features, we developed a novel approach for the effective reconstruction of full-length circRNAs and isoform-level quantification from the transcriptome (Genome Medicine, 2019). To comprehensively understand the diversity of circRNAs and prioritize their significance, we presented a large-scale study of circRNA repertoires from multiple tissues of human, macaque, and mouse. We delineated genome-wide expression patterns and evolutionary conservation of circRNAs and unveiled that they are highly tissue-specific and exhibited distinct expression patterns compared with linear transcripts (Cell Reports, 2019). We also developed CIRIquant for accurate circRNA quantification and differential expression analysis, which provides more accurate expression values for circRNAs with a significantly reduced FDR (Nature Communications, 2020). We built a Java command-line tool for quantifying and visualizing circRNAs by integrating the alignments and junctions of circular transcripts (Bioinformatics, 2020). We aim to develop bioinformatics tools for exploring the landscape of circRNAs and expanding our knowledge of circRNAs on a genome-wide scale.


我们建立了环形RNA识别、转录本组装、可变剪接检测、表达定量和功能注释等一系列方法和工具。(1) 提出不依赖参考基因组注释信息的环形RNA识别技术,可以高效无偏差的从测序数据中识别环形RNA(Genome Biology, 2015); (2) 首次提出基于多重种子匹配策略的算法,建立最大似然估计模型,排除来自线性转录本或剪接副产物的干扰,极大提高环形RNA识别精度(Briefings in Bioinformatics, 2018); (3) 首次利用环形RNA接合位点测序读段对的分段比对特征,精确识别其外显子结构和可变剪接事件,并发现环形RNA与线性mRNA相比,具有特殊的剪接机制且对特定剪接因子的偏好性(Nature Communications, 2016); (4) 提出全新的环形转录本重构与定量方法,通过环形转录本测序中的反向重叠区特征获得全长序列,既有效解决了环形转录本内部结构的重构难题,也为环形转录本中不同剪接产物的定量提供了新思路(Genome Medicine, 2019); (5) 建立了环形RNA精确定量和差异表达分析方法,发现两类线性-环形比例和成环位点使用偏好发生变化的环形RNA(Nature Communications, 2020); (6) 通过对环形RNA的多样性、保守性、剪接模式以及与线性RNA成环比差异进行多方位分析,建立了环形RNA功能注释和保守性评估的新方法(Cell Reports,2019); (7) 通过对多物种多组织的转录组深度测序和分析,鉴定获得超过一百万个高可信度环形RNA,筛选出高度保守的环形RNA(Genome Biology, 2020)。这些研究丰富了我们对环形RNA的形成机制及功能的认识,为深入解析这一崭新类型的非编码RNA分子提供了重要工具。


Metagenomics and Human health


High throughput sequencing technologies enable us to sequence uncultured microbes sampled directly from their habitats, which are expanding and transforming our view of the microbial world. However, extracting meaningful information from tens of millions of very short sequences brings a serious challenge to computational biologists. Current available computational methods for metagenomics are developed based on either low throughput data or a few well-studied microbiomes, which encounter extensive difficulties when applied to novel environmental communities. One of the major challenges is how to assemble and functionally annotate metagenomic sequences without closely related reference genomes. We aim to develop a new strategy to assemble metagenomic sequences by combining shotgun sequencing and single-cell based sequencing approaches, and also to design new algorithms to annotate metagenomes without closely related reference sequences. In addition, we will use parallel computing technologies to set up an integrated platform for metagenomic studies, and to combine the power of genomics, bioinformatics and systems biology to understand human microbiomes.


微生物广泛存在于各种生态环境中,与我们的生产、生活及自身健康密切相关。基于高通量测序的宏基因组学技术,已成为研究微生物群落组成、结构及功能最主要的技术手段。然而受高通量测序技术的限制,宏基因组研究中所利用的实验技术和计算方法遇到了很多困难。我们建立了一系列微生物组结构解析及功能挖掘的新技术和新方法,分别针对微生物组分析中的拼接、编码基因重建和注释,以及微生物间相互作用等问题,为微生物组大数据的深入挖掘奠定了方法学基础。(1)metaSort技术首次将单细胞测序和全基因组随机测序技术相结合,通过建立新的序列归类和图论算法,重建微生物群落中不同物种的全基因组序列(Nature Communications, 2017)。该方法有效解决了新环境中未知微生物的基因组结构解析的难题,在复杂微生物群落物种和功能谱的研究中具有重要的应用价值。(2) inGAP-cdg技术不再依赖于拼接技术,可以直接从未拼接序列中识别和重建完整的编码基因,极大提高了基因重建的效率(Genome Biology, 2016)。该方法为包括微生物组学在内的基因组学研究提供了极为便利的工具,将有助于我们深入解析非模式生物的基因组成及进化规律。(3)riboFR-seq技术首次建立了物种谱和功能基因谱的有效关联,为深入解析环境微生物的组成与功能提供了重要的工具(Nucleic Acids Res, 2016)。由于该技术可以对16S rRNA的拷贝数进行测定,因此它可以修正由16S rRNA拷贝数差异导致的菌群丰度估计偏差,能够真实反映环境微生物的多样性及组成。(4)建立了精确的菌群溯源研究方法,利用该方法揭示了孕期健康导致的微生物扰动对新生儿菌群的影响,首次发现孕妇和新生儿微生物群落组成在患病状态下呈现出趋同发展的特征,阐明了母婴之间的菌群交互以及对健康的影响(Gut, 2018; Gut, 2019)。