GHDDI Computational Effort for COVID-19

We will continuously publish our computational drug discovery efforts including AI-based prediction, physics-based virtual screening, molecular dynamics simulations, and other cheminformatics and bioinformatics related inferences on this page. This effort is to facilitate community wide experimental effort. Please be noted that for any data-driven approach, the interpretation scope of prediction has to be aligned with the original scientific scope of the training set. In addition, for various type of predictive models, noises and biases inherited from training set or empirical parameters, limitations of different mathematical approximations have to be considered. Therefore, these computational results have to be further analyzed by experienced medicinal chemists and biologists and eventually to be backed up by related wet-lab experiments before any rigorous scientific conclusion. The web-based service based on these predictive models is open to facilitate your own screening efforts using your own compound libraries. The models and results are constantly updated upon the collection of new evidence.

GHDDI Web Virtual Screening Service for COVID-19

Our web service is now available and can be used as a virtual screening tool.

Drug Repurposing Effort

A. Ligand based AI models

We have tried different training sets containing different virus species and their targets to build target specific or phenotype based classification AI models using GHDDI self-developed HAG-net deep learning system. We only selected models showing 5-fold cross-validation AUC>0.8 as qualification for further predictive practice and the results are ensemble predictions. Viral targets, including RDRP, Helicase, 3C-like protease of SARS-CoV-2 showing relatively higher cross-species conservation are prioritized in this effort. We use these models to predict different bioactivities of approved or investigational stage drug molecules (~12K) in GHDDI stock as part of the drug repurposing effort. As we are constantly improving our algorithm and expanding our training data, the results will be updated periodically.

1. Heterogeneous antiviral AI model

Training Data: Using heterogeneous records of antiviral bioactivity data including target based and phenotype based records from various species and in vitro assays, a total of 76247 compounds with 37332 active and 38915 inactive molecules (EC50 <=100nM for at least one viral species as active). Performance (5-fold cross-validation): AUC avg. = 0.94

2. Phenotypic antiviral AI model

Training Data: Using heterogeneous records of antiviral bioactivity data of phenotype based records from various species and in vitro assays, a total of 7305 compounds with 3751 active and 3554 inactive molecules (EC50 <=100nM for at least one viral species as active). Performance (5-fold cross-validation): AUC avg. = 0.908

3. RNA-dependent RNA polymerase AI model

Training Data: Using heterogeneous records of RNA-dependent RNA polymerase related bioactivity data from various species and in vitro assays, a total of 583 compounds with 306 active and 277 inactive molecules (IC50 <=1μM as active).
Performance (5-fold cross-validation): AUC avg. = 0.952

4. Helicase AI model

Training Data: Using heterogeneous records of Helicase related bioactivity data from various species and in vitro assays, a total of 878 compounds with 127 active and 751 inactive molecules (IC50 <=1μM as active). Performance (5-fold cross-validation): AUC avg. = 0.926

5. 3C-like protease AI model

Training Data: Using heterogeneous records of 3C-like protease related bioactivity data from various species and in vitro assays, a total of 457 compounds with 132 active and 325 inactive molecules (IC50 <=1μM as active). Performance (5-fold cross-validation): AUC avg. = 0.89

B. Structure based (none-docking) AI model

The structure based AI model was constructed based on GHDDI developed HAG-net. The model was trained based on all existing drug targets 3D information and their related biochemical data for up to 2 million molecules. The model is universal for all targets with 3D structures. The model performance is evaluated using DUD.E set with average AUC of 0.98 and true negative internal benchmark set with average AUC of 0.8. Given a target 3D structure, the center coordinate(x, y, z) of the binding pocket, and screening library SMILES list as input. We are able to screen every 10K compounds in 4 minutes, which is exponentially faster than traditional docking screening methods. This is a beta testing version of this model, the results will be constantly updated upon each model upgrade. The sample prediction results for various targets of SARS-CoV-2 and related host targets are listed below. Homology model is used if crystal structure is not available for specific target.

1. SARS-CoV-2 RNA-dependent RNA polymerase(RDRP) (NTP binding site)

2. SARS-CoV-2 Helicase (NTP binding site)

3. SARS-CoV-2 3C-like protease (catalytic site)

4. SARS-CoV-2 Papain-like protease (catalytic site)

5. Human TMPRSS2 (catalytic site)

Benchmark

Conventional Docking results using Autodock Vina over Drugbank released version 5.15 library 8764 compounds for all above targets can be download here. Computational time for screening each target is about ~36 hours on 12 CPU in parallel.


Last update: July 24, 2020