About This Project
This web application predicts molecular toxicity for the Tox21 benchmark using trained graph neural networks. It combines a Flask inference backend, RDKit chemistry tooling, PyTorch Geometric models, and a modern interactive interface for single-molecule and batch analysis.
Technology Stack
Website Capabilities
The website supports practical end-to-end toxicity analysis workflows:
1. Single SMILES inference with 2D molecule rendering and interactive 3D structure.
2. Batch CSV inference for multiple compounds (requires SMILES column).
3. Model selection between GINE, GCN, and GATv2, including all-model comparison mode.
4. Endpoint-level toxicity output for all 12 Tox21 assays with probability scores and labels.
5. CSV export of latest prediction results from backend.
Dataset And Split Strategy
Training notebook summary (start-to-end):
1. Dataset: MoleculeNet Tox21 (~7,831 molecules, 12 toxicity tasks).
2. Split: Scaffold-aware 80/10/10 split using Morgan fingerprint key (radius=2, 2048 bits).
3. Features: Node features normalized globally; edge features normalized when available.
4. Stability controls: feature clamping to [-10, 10], NaN/inf replacement, gradient clipping.
Where This Project Can Be Used
- Early-stage toxicity screening in drug discovery pipelines before expensive wet-lab assays.
- Academic and industry research for structure-toxicity relationship exploration on new compounds.
- Educational demonstrations for graph neural networks in cheminformatics and computational toxicology.
- Pre-filtering of chemical libraries to prioritize safer candidates for synthesis and testing.
- Model comparison studies (GINE vs GCN vs GATv2) under scaffold-aware generalization settings.
System Overview
1. Frontend collects SMILES / CSV and model choice.
2. Flask backend converts molecules to graph data.
3. Loaded checkpoints generate 12-endpoint toxicity probabilities.
4. Results are rendered with 2D/3D molecular visualization and downloadable CSV.
Model Architectures And Training
| Model | Graph Layers | Pooling Head | Dropout | Optimizer / LR | Schedulers |
|---|---|---|---|---|---|
| GINE | 4x GINEConv + BatchNorm | Mean + Sum pool -> MLP | 0.25 | Adam, 5e-4 | CosineWarmRestarts + ReduceLROnPlateau |
| GCN | 4x GCNConv + BatchNorm | Mean + Sum pool -> MLP | 0.25 | Adam, 5e-4 | CosineWarmRestarts + ReduceLROnPlateau |
| GATv2 | 4x GATv2Conv (4 heads) + BatchNorm | Mean + Sum pool -> MLP | 0.30 | Adam, 3e-4 | CosineWarmRestarts + ReduceLROnPlateau |
Reported Notebook Results
From the attached training notebook summary:
| Experiment | Macro ROC-AUC | Status |
|---|---|---|
| GCN (scaffold split) | 0.8300 | Best in notebook run |
| GINE (scaffold split) | 0.8076 | Strong baseline |
| GATv2 (scaffold split) | 0.7982 | Competitive |
| Random split reference | 0.7928 | Notebook comparison value |
Production Inference Stack
Backend endpoint behavior in this website:
1. Loads pretrained checkpoints for GINE, GCN, and GATv2 at startup.
2. Rebuilds graph features from SMILES with RDKit + PyG Data objects.
3. Applies training-aligned normalization and sigmoid-based endpoint probabilities.
4. Generates 2D image and 3D SDF for interactive molecular visualization.
5. Stores latest prediction rows for backend CSV export.
Team Members
Team members who contributed to this project.