PREFACE


Image
Twenty-three curated genetic and genomic resources were integrated into the PAFA online platform, and they can be classified into five categories, namely, non-pathogenic and pathogenic variants, various annotations, disease-associated genes and pathways. All annotations are represented in the GRCh37 assembly of the human genome. Based on integrated resources, the PAFA online platform has five functions:
  • 1000 GENOMES: it provides population-level information for variants found in 1000 Genomes, including FST index, allele frequency spectrum and associated genomic elements;
  • ANNOTATION: it annotates variants with known risk variants, genomic annotations, diseases-associated genes and pathways recoded in curated databases;
  • VSEA: it provides enrichment analysis for a set of variants;
  • PAFA: users can search and download PAFA scores;
  • SEARCH: it provides an integrated web browser for genes and their annotations.


PAFA


Image
The PAFA workflow contains the construction of the PAFA classifier and the gene-centric annotation . The PAFA classifier is based on sparse logistic regression with L1 regularization. We label the variants used in the training stage of PAFA as the functional and control sets. Gene-centric annotation is based on curated genomic databases, including ENCODE, GENCODE and UTRdb. The features used in PAFA include three categories: population-level metrics, evolutionary conservation and genomic annotations.

For population differentiation measures, we introduced allele frequencies of five super populations, including African, American, East Asian, European and South Asian. We calculated FST and dispersion score (DS) based on allele frequencies and sample sizes of the five super populations. We introduced four evolutionary conservation scores, including 46 and 100 ways of phastCons and phyloP measures. For genomic annotations, we introduced eight types of feature groups from ENCODE, including histone modifications (ChIP-Seq), RNA contigs (Long RNA-seq), transcription factor binding sites (TFBS PeakSeq and SPP), open chromatin (DNase-Seq and FAIRE), and transcript start site (TSS). The user can get these annotations by accessing the download page .

The PAFA online platform facilitates the use of PAFA. Precomputed PAFA scores for 2.68 billion single nucleotide variants throughout the human genome are integrated into this online platform, where users can access these precomputed PAFA scores through batch download or can submit a list of genomic locations or variants of interest to obtain target PAFA scores as well as gene-centric annotations, if you check "Include gene-level annotations in output". PAFA provides gene-centric annotations to allow users an interpretation of their scoring results in gene level, including genomic elements, like exons, regulators and TSS, overlapped with target variants.


1000 GENOMES


Users can 1) search relevant information in 1000 Genomes by variant id or position; 2) obtain variants with similar allele frequency spectra.

Image
1) Entering a chromosome region (e.g. chr1:159173683-159174783) or a variant id that is found in 1000 Genomes (e.g. rs2814778), users can obtain genomic elements that are located in the region or overlapped with a target variant as well as variants of 1000 Genomes involved in these genomic elements. The browser will exhibit population level information of these variants, including allele frequency spectra, population differentiation indices and their rankings according to FST index.

Image
2) Entering allele frequencies of five super population, including African (AFR), American (AMR), East Asian (EAS), European (EUR) and South Asian (SAS), users can obtain variants of 1000 Genomes with similar allele frequency spectra. The similarity offsets should be ranging from 0 to 1.


ANNOTATION


Image

Type or paste variants or upload a file containing variants by dragging, users can annotate variants with known risk variants, genomic annotations, disease-associated genes and pathways recorded in curated databases (e.g. ClinVar, COSMIC and ENCODE) and evaluate variants based on their associated genes’ occurrence frequency in disease-related databases. PAFA will accept input format like “TYPE CHROM START END CHROM_BP BREAKPOINT” (e.g.Translocation chr5 43320488 43320498 chr11 108153462 or “TYPE CHROM START END” (e.g. SNP chr10 51602168 51602168 ) without header. PAFA provides downloadable Excel file containing all analysis results and present them in a very visible and interactive way.


  1. PAFA arranges and colors variants according to their TYPE, SNP, Insertion or Deletion. It will show the length of a variant if users set its TYPE as Deletion.

  2. PAFA labels variants and marks them in red dots if they exist in curated database, such as 1000 Genomes, ESP and dbSNP.

  3. If a variant is overlapped with disease-associated variants in curated databases, like COSMIC and ClinVar, PAFA will show the number of risk variants.

  4. PAFA lists all involved genes of these variants. If a variant overlaps a gene’s protein coding region, it will be shown in blue; if a variant overlaps a gene’s noncoding region (e.g. TSS, UTR and regulator), it will be shown in pink.

  5. PAFA shows the number of times that a gene occurs in gene-disease databases, such as OMIM and GAD. Deeper colored grip means a gene occurs more often. It also provides a score for the gene by accumulating the occurrence time of the gene in all gene-disease databases.

  6. PAFA provides a quantitative value for a variant according to the occurrence frequency of its associated gene in current gene-disease databases.

VSEA


Image

By typing or pasting variants or uploading a file containing variants by dragging, users can carry out enrichment analysis on a set of variants. PAFA will accept input format like “TYPE CHROM START END CHROM_BP BREAKPOINT” (e.g.Translocation chr5 43320488 43320498 chr11 108153462 or “TYPE CHROM START END” (e.g. SNP chr10 51602168 51602168 ) without header.


To provide enrichment analysis for the target variant set, PAFA included background variant sets, such as variants from 1000 Genomes, genomic annotations from GENCODE and ENCODE and canonical pathways in the Molecular Signatures Database (MSigDB). First, it maps test variants and background variants (user uploaded or selected) to a range of annotated elements. Then, it will obtain genes related to the test and background variants. Next, PAFA extracts the related pathways of these genes in MSigDB. Finally, according to the relationships among variants, genes and pathways, PAFA calculates the p value to estimate the enrichment degree in relevant pathways using Fisher’s exact test.


PAFA provides a downloadable Excel file containing all analysis results and presents them in a visible way with five sections, including 1) variant and overlapped genes listed in a tabular format; 2) gene and corresponding pathways listed in a tabular format; 3) relationship among variants, genes and pathways presented in network graph; 4) enrichment pathways listed in table format; 5) relationship among pathways and genes presented in a network graph.



CONTACT


If you have technical problems using PAFA, please check the information provided here. If it does not resolve your issues, please contact us at bioinfo@biols.ac.cn.


PAFA is currently developed by Fangqing Zhao and Zhou Lin.


If you are planning on using PAFA in a commercial application, please contact Fangqing Zhao.