Summary

VarCoPP stands for Variant Combination Pathogenicity Predictor. It is a machine-learning method that predicts the potential pathogenicity of any bi-locus variant combination (i.e. a combination of two to four variant alleles between two genes). It has been trained on digenic disease data present in the Digenic Diseases Database (DIDA) and variant data derived from control individuals of the 1000 Genomes Project (1KGP).

According to VarCoPP, a bi-locus variant combination can be either neutral (i.e. probably neutral) or disease-causing (i.e. candidate or probably pathogenic).

VarCoPP uses a VCF file or a a provided variant list with SNVs and small insertions/deletions of known zygosity from a single individual, creates all possible variant combinations (from di-allelic to tetra-allelic) between all pairs of genes present in the list, annotates the data and predicts their pathogenicity. It then provides an output file with information on bi-locus combinations and their prediction scores, ranked from those with the highest pathogenicity score to those with the lowest.

To provide predictions with a reasonable amount of False Positive (FP) results, a recommended use of VarCoPP is with variants from up to 150 genes. An initial variant filtering procedure is highly recommended.


× PLEASE NOTE:
VarCoPP is a machine-learning method that makes predictions based on probabilities.
It is provided for research purposes only and the pathogenicity predictions should be subject to further research and clinical investigation.
It is not in any way intended to be used as a substitute for professional medical advice, diagnosis, treatment or care.



Citing VarCoPP

If you are using VarCoPP for your analysis, you can cite the following manuscript:

Papadimitriou S., Gazzo A., Versbraegen N., Nachtegael C., Aerts J., Moreau Y., Van Dooren S., Nowé A., Smits G., Lenaerts T. Predicting disease-causing variant combinations. Proceedings of the National Academy of Sciences. May 2019. DOI: https://doi.org/10.1073/pnas.1815601116.




What is VarCoPP?

VarCoPP is an ensemble predictor consisted of 500 individual predictors, and more specifically Random Forest (RF) predictors. Each predictor provides a probability on whether a particular bi-locus combination is disease-causing.

Each RF predictor was trained on the bi-locus variant combinations from DIDA against a different random subset of bi-locus combinations from the 1KGP. Exonic and splicing variants of up to 3% MAF were used as training data from both data sets, while all genes were protein coding genes, according to the gene types present in DIDA.

The final prediction for each bi-locus combination is based on a majority vote among all predictors; that means if more than half or more of the predictors agree that a bi-locus combination is pathogenic, then the final prediction is that it belongs to the disease-causing class. The probability threshold for each predictor, above of which a bi-locus combination is disease-causing, was set to 0.489.

Therefore, a bi-locus combination is predicted as disease-causing if >50% of the RFs agree that it is pathogenic and the median disease-causing probability among all predictors is, consequently, >0.489.




What should be used as variant input?

The input should be a variant list of SNVs and indels from a single individual. In general, it is highly recommended to perform beforehand an initial variant filtering procedure (exonic, nonsynonymous variants of MAF <3%) and it is also recommended to restrict the analysis to variants from up to 150 genes, in order to limit the amount of non-relevant combinations that will be tested. VarCoPP does not provide filtering options, so if you want to filter a VCF of a complete exome and you do not know how, please contact us and we will provide you with guidelines.

In the right panel of the "Submit" page, you can provide a VCF file. From this VCF file, VarCoPP uses the header line and all information present in the columns: CHR, POS, REF, ALT, FORMAT, NAME_OF_PATIENT (the column of the patient information). Any other meta-information lines or extra fields will be ignored. For information on how a VCF should be constructed, you can consult the specification page.

Alternatively, you can also copy-paste a variant list directly in the white box on the left panel of the "Submit" page, with every variant being on a different line. Each line should contain tab-delimited information about the CHROMOSOME, POSITION, ID ('.' if not available), REFERENCE ALLELE, ALTERNATIVE ALLELE and ZYGOSITY (values: Heterozygous or Homozygous) of the variant. No headers are needed. You can also further manually insert a particular variant in that list using the column boxes: CHR, POS, ID, REFERENCE, ALTERNATIVE, ZYGOSITY (values: Heterozygous or Homozygous) and click on the (+) button.

In either case, you should always specify the gender of the patient, if available. The default genome version of the tool and the complete annotation is available for the GRCh37/hg19 version. VarCoPP does not make conversions of genomic coordinates from different genome versions.




How does VarCoPP annotate the data?

VarCoPP annotates the variants, genes and gene pairs present in the user's list with the biological features needed for predictions:

 
VarCoPP first annotates variants and genes with dbNSFP v3.5 and Ensembl (GRCh37/hg19 genome version) using only the canonical transcript of each gene . These transcripts were defined based on the guidelines of Ensembl. In case a CADD score is not available for a particular variant, this variant is excluded from the analysis. Aminoacid flexibility and hydrophobicity differences between the wild-type and the mutated protein sequence are computed using an in-house script. For small insertions and deletions, we also obtain protein sequences from Uniprot using at first the canonical ENSEMBL transcript identifiers. However, we also check whether the reference aminoacid is indeed present in the correct position in the protein canonical sequence.

All possible gene pairs are created from the variant set and annotated with the biological distance feature. For each bi-locus variant combination, gene A is the gene with the lowest Gene Damage Index (GDI) score thus, the one with a higher probability to be associated with a disease. For more information you can consult the publication: Itan Y, et al. (2015). The human gene damage index as a gene-level approach to prioritizing exome variants. Proc Natl Acad Sci USA 112(44):13615–13620, doi: 10.1073/pnas.1518646112.

Then, all possible bi-locus variant combinations are created from these pairs, including tri-allelic and tetra-allelic variant combinations, including the presence of two different heterozygous variants in one of the two genes (heterozygous compound). If two different variant alleles are present in the same gene for a combination, these are ordered so that the first variant allele is the one with the highest CADD score. At the moment, tetra-allelic digenic combinations with four different heterozygous variants are not created, as these types of bi-locus combinations were missing from DIDA at the moment VarCoPP was created.




Output files

After the predictions, an output file is created and available for download at the Results page. This file contains the prediction for each bi-locus combination and further evaluation scores (see below). The bi-locus combinations are ranked with those most probable to be disease-causing being on top.


Allele representation in the output file

Based on the zygosity and gender information, each variant at each gene in the output file is represented with two alleles separated with a "/". You can find below some examples:
2:177054850:C:G/..: Heterozygous variant, so the second allele is wild-type.
2:177054850:C:G/2:177054850:C:G: Homozygous variant, so both alleles are mutated.
21:45719992:G:A/21:45730895:G:C: Heterozygous compound variant (i.e. two different heterozygous variants in the same gene).
X:107841975:C:A/..: Heterozygous X-linked variant in a female, so the second allele is wild-type.
X:107841975:C:A/na: X-linked variant in a male, so the second allele is absent.


Confidence zone information

If a combination is predicted as disease-causing and also falls into the 95% or the 99% confidence zone (see below on how these are calculated), the corresponding information will be present in this column for that combination. Otherwise, this field will be empty.




Evaluation scores

For each bi-locus combination we provide two prediction scores that are also used for ranking.


Classification Score (CS)

The classification score (CS) of a bi-locus combination is defined as the median probability of that combination being disease-causing among all RFs. It can take values between 0 and 1. For disease-causing combinations, CS is always larger than 0.489.


Support Score (SS)

The Support score (SS) of a bi-locus combination indicates the percentage of RFs that agree on the disease-causing label. It can therefore take values between 0 (no RF agrees for the disease-causing class) to 100 (all RFs agree for the disease-causing class). For disease-causing combinations, SS is always larger than 50.




95%- and 99%- confidence zones

We have generally defined 95%- and 99% confidence zones, delimited by minimal CS and SS scores, for further evaluation and filtering of the predictions. These were created by testing neutral bi-locus combinations from the 1000 Genomes Project and obtaining the minimal CS and SS scores that gave 5% and 1% False Positives. If a combination predicted as disease-causing also falls into either one of the two zones, an indication will appear in the corresponding column in the output file, otherwise this field will be empty.
95%-confidence zone : requires CS≥0.55 and SS≥75. If a digenic combination falls inside this zone, it has 5% probability of being a FP result.
99%-confidence zone : requires CS≥0.74 and SS=100. If a digenic combination falls inside this zone, it has 1% probability of being a FP result.




S-plot

The S-plot represents each bi-locus combination according to their SS and CS. The grey dots represent the 10,000 neutral variants used as negative control during the development of this tool. Bi-locus combinations are classified in four different categories :

  1. 99% confidence disease-causing bi-locus combinations (dark red colour), with a support score of 100 and a classification score of at least 0.74.
  2. 95% confidence disease-causing bi-locus combinations (red colour), with a support score of at least 75 and a classification score of at least 0.55.
  3. The rest of the predicted disease-causing bi-locus combinations (orange colour) that do not belong to either one of the two confidence zones.
  4. The bi-locus combinations predicted as neutral (blue colour).

Features of the S-plot:




Frequently Asked Questions

If answers to your questions are not provided here or have not been proved to be helpful, you can contact us at: varcopp@ibsquare.be

How can I cite VarCoPP?
If you are using VarCoPP for your analysis, you can cite the following manuscript:

Papadimitriou S., Gazzo A., Versbraegen N., Nachtegael C., Aerts J., Moreau Y., Van Dooren S., Nowé A., Smits G., Lenaerts T. Predicting disease-causing variant combinations. Proceedings of the National Academy of Sciences. May 2019. DOI: https://doi.org/10.1073/pnas.1815601116.

How can I filter my variants for a frequency and/or gene panel?
This website does not offer a possibility to filter your variants based on their frequency or the genes they belong to. You should either prefilter your data yourself prior to using VarCoPP, or use our ORVAL platform: https://orval.ibsquare.be that incorporates VarCoPP and provides automated prefiltering options for your VCF file or variant list before the predictions.

I keep seeing the same page after I have submitted my data
After you submit your variants with VarCoPP you will stay on the same page until the prediction process is finished. During this time the Submit page of VarCoPP will be in "loading" mode. The loading time depends on the number of variants you have submitted.
Once the results are finished, you will be re-directed to a new results page.

The VarCoPP page after the data submission is loading for a long time, is it normal?
Yes, it is normal. Depending on the number of variants you have submitted, the loading time can range between a couple of seconds to several minutes. If the loading time exceeds several hours or ends with an error message, please contact us to resolve the issue.
Also, it is generally not recommended to upload a complete exome in VarCoPP, but to rather pre-filter your variants based on a selected gene panel of 150 - 300 genes, in order to avoid testing of a lot of unrelated combinations that can lead to a large number of False Positives.
If you have questions regarding how this filtering could be done, you can contact us to provide you with guidelines.

Some variants are missing from the results
Some variants may be missing from the database, a CADD score may not be provided for these variants or the genome version for those variants may not be correct. In case of manual insertion of variants by using the left panel of the "Submit" page, make sure that the zygosity values are not misspelled ("Heterozygous" or "Homozygous" values are accepted).

All variants are missing from the results
Most probably the genome version is incorrect, you have included variants only for one gene or there is some problem with the format of your VCF file/variant list. VarCoPP automatically annotates variants using the GRCh37/hg19 genome version. Also, make sure that your variant list in the left submission box panel contains tab or space delimited columns with CHR,POS,ID('.' if not available),REF,ALT,ZYGOSITY information. You can consult the "What should be used as input" section of this page to get detailed information.

I see that the "95-99%_confidence" column is empty for all or some of my results
If a combination is predicted as disease-causing and also falls into the 95% or the 99% confidence zone, the corresponding information will be present in this column. Otherwise, this field will be empty.

I see that some types of bi-locus combinations are missing from the results
At the moment, tetra-allelic bi-locus combinations with four different heterozygous variants in two different genes are not created. These types of bi-locus variant combinations were missing from DIDA when VarCoPP was created and are not included in the analysis to avoid bias in the predictions.

Is there a limit for the uploaded variants?
We do not restrict the amount of variants but we strongly recommend the use of variants from up to 150 genes , as well as an initial variant filtering procedure, in order to limit the amount non-relevant combinations that will be tested.

Can I include variants from multiple patients in my VCF or variant list?
No, you should strictly use variants from a single individual as VarCoPP creates all possible variant combinations from the variants list assuming they are present in one individual only. If your VCF file contains variants from multiple individuals, you should separate them beforehand into multiple VCF files and use them separately in the tool.

Which versions of VarCoPP and the annotation tools are you using?
You can Download the current version of VarCoPP and databases/tools we are using to annotate and predict your variant combinations.




Data Privacy

Data privacy declaration


We hereby declare that:

  1. Any variant and prediction results are deleted 7 days after the original data submission date. We do not use this information in the meanwhile.
  2. We track user traffic (e.g. IP addresses, successful sessions) for job monitoring purposes and we also use Google Analytics to track general visitor traffic information. We do not take any responsibility for the data stored by this third party application.
  3. The personally identifiable data of the users collected through this website, is only accessible for selected researchers of Vrije Universiteit Brussel (VUB) and Université Libre de Bruxelles (ULB) who manage the VarCoPP site. The data will not be shared with any person.

ULB and VUB have data protection officers who are responsible for matters related to privacy and data protection.
To reach the DPOs of the universities you can send an email to:
ULB DPO: rgpd@ulb.ac.be (Université Libre de Bruxelles, Data protection officer, Avenue Franklin Roosevelt 50, CP 130, 1050 Brussels).
VUB DPO: dpo@vub.be (Vrije Universiteit Brussel, Data Protection Officier, Pleinlaan 2, 1050 Brussels).


Subject's right with regard to personal data:


On May 25th 2018, the "General Data Protection Regulation" takes effect. The GDPR is a European regulation which grants individuals rights with respect to the way their personal data is handled and protected. Individuals may, for example - depending on the legal basis for the processing of their personal data and dependent on the fulfilment of certain conditions - exercise a right to:

  1. inquire as to what personal data is processed and, when the data is provided to the VUB by a third party
  2. inquire into the source of this information
  3. request the correction of data insofar as it is incorrect
  4. object to the processing of his or her data
  5. know of the existence of possible automated decision making processes, and, when these are used to create profiles, inquire into the logic underlying these processes, the purposes they serve, and their consequences ‘be forgotten’ by an institution that has processed their personal data

The competent national authority concerning privacy and data protection is the Data Protection Authority (or GBA: "Gegevensbeschermingsauthoriteit", (https://www.dataprotectionauthority.be/). This is the authority that monitors privacy law compliance and where any individual can file a complaint regarding privacy and the processing of personal data.