Spaces:

ml-jku
/

tox21_xgboost_classifier

Sleeping

App Files Files Community

antoniaebner commited on Nov 13

Commit

87e7d05

1 Parent(s): b0daa87

add Sohvi's code

Browse files

Files changed (12) hide show

.gitignore +1 -0
README.md +103 -0
app.py +78 -0
assets/tox_smarts.json +0 -0
predict.py +63 -0
requirements.txt +10 -0
src/__init__.py +0 -0
src/model.py +90 -0
src/preprocess.py +263 -0
src/push_assets.py +12 -0
src/train.py +294 -0
src/utils.py +432 -0

.gitignore ADDED Viewed

	@@ -0,0 +1 @@


1	+ /__pycache__/

README.md ADDED Viewed

	@@ -0,0 +1,103 @@

+---
+title: Tox21 XGBoost Classifier
+emoji: 🚀
+colorFrom: green
+colorTo: purple
+sdk: docker
+pinned: false
+license: apache-2.0
+short_description: XGBoost baseline classifier for Tox21
+---
+# Tox21 XGBoost Classifier
+This repository hosts a Hugging Face Space that provides an examplary API for submitting models to the [Tox21 Leaderboard](https://huggingface.co/spaces/tschouis/tox21_leaderboard).
+In this example, we train a XGBoost classifier on the Tox21 targets and save the trained model in the `assets/` folder.
+**Important:** For leaderboard submission, your Space does not need to include training code. It only needs to implement inference in the `predict()` function inside `predict.py`. The `predict()` function must keep the provided skeleton: it should take a list of SMILES strings as input and return a prediction dictionary as output, with SMILES and targets as keys. Therefore, any preprocessing of SMILES strings must be executed on-the-fly during inference.
+# Repository Structure
+- `predict.py` - Defines the `predict()` function required by the leaderboard (entry point for inference).
+- `app.py` - FastAPI application wrapper (can be used as-is).
+- `src/` - Core model & preprocessing logic:
+    - `data.py` - SMILES preprocessing pipeline
+    - `model.py` - XGBoost classifier wrapper
+    - `train.py` - Script to train the classifier
+    - `utils.py` – Constants and Helper functions
+# Quickstart with Spaces
+You can easily adapt this project in your own Hugging Face account:
+- Open this Space on Hugging Face.
+- Click "Duplicate this Space" (top-right corner).
+- Modify `src/` for your preprocessing pipeline and model class
+- Modify `predict()` inside `predict.py` to perform model inference while keeping the function skeleton unchanged to remain compatible with the leaderboard.
+That’s it, your model will be available as an API endpoint for the Tox21 Leaderboard.
+# Installation
+To run (and train) the XGBoost, clone the repository and install dependencies:
+```bash
+git clone https://huggingface.co/spaces/tschouis/tox21_xgboost_classifier
+cd tox_21_xgb_classifier
+conda create -n tox21_xgb_cls python=3.11
+conda activate tox21_xgb_cls
+pip install -r requirements.txt
+```
+# Training
+To train the XGBoost model from scratch:
+```bash
+python -m src/train.py
+```
+This will:
+1. Load and preprocess the Tox21 training dataset.
+2. Train a XGBoost classifier.
+3. Save the trained model to the assets/ folder.
+4. Evaluate the trained XGBoost classifier on the validation split.
+# Inference
+For inference, you only need `predict.py`.
+Example usage inside Python:
+```python
+from predict import predict
+smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
+results = predict(smiles_list)
+print(results)
+```
+The output will be a nested dictionary in the format:
+```python
+{
+    "CCO": {"target1": 0, "target2": 1, ..., "target12": 0},
+    "c1ccccc1": {"target1": 1, "target2": 0, ..., "target12": 1},
+    "CC(=O)O": {"target1": 0, "target2": 0, ..., "target12": 0}
+}
+```
+# Notes
+- Only adapting `predict.py` for your model inference is required for leaderboard submission.
+- Training (`src/train.py`) is provided for reproducibility.
+- Preprocessing (here inside `src/data.py`) must be applied at inference time, not just training.

app.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""
+This is the main entry point for the FastAPI application.
+The app handles the request to predict toxicity for a list of SMILES strings.
+"""
+# ---------------------------------------------------------------------------------------
+# Dependencies and global variable definition
+import os
+from typing import List, Dict, Optional
+from fastapi import FastAPI, Header, HTTPException
+from pydantic import BaseModel, Field
+from predict import predict as predict_func
+API_KEY = os.getenv("API_KEY")  # set via Space Secrets
+# ---------------------------------------------------------------------------------------
+class Request(BaseModel):
+    smiles: List[str] = Field(min_items=1, max_items=1000)
+class Response(BaseModel):
+    predictions: dict
+    model_info: Dict[str, str] = {}
+app = FastAPI(title="toxicity-api")
+@app.get("/")
+def root():
+    return {
+        "message": "Toxicity Prediction API",
+        "endpoints": {
+            "/metadata": "GET - API metadata and capabilities",
+            "/healthz": "GET - Health check",
+            "/predict": "POST - Predict toxicity for SMILES",
+        },
+        "usage": "Send POST to /predict with {'smiles': ['your_smiles_here']} and Authorization header",
+    }
+@app.get("/metadata")
+def metadata():
+    return {
+        "name": "AwesomeTox",
+        "version": "1.0.0",
+        "max_batch_size": 256,
+        "tox_endpoints": [
+            "NR-AR",
+            "NR-AR-LBD",
+            "NR-AhR",
+            "NR-Aromatase",
+            "NR-ER",
+            "NR-ER-LBD",
+            "NR-PPAR-gamma",
+            "SR-ARE",
+            "SR-ATAD5",
+            "SR-HSE",
+            "SR-MMP",
+            "SR-p53",
+        ],
+    }
+@app.get("/healthz")
+def healthz():
+    return {"ok": True}
+@app.post("/predict", response_model=Response)
+def predict(request: Request):
+    predictions = predict_func(request.smiles)
+    return {
+        "predictions": predictions,
+        "model_info": {"name": "random_clf", "version": "1.0.0"},
+    }

assets/tox_smarts.json ADDED Viewed

The diff for this file is too large to render. See raw diff

predict.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""
+This files includes a predict function for the Tox21.
+As an input it takes a list of SMILES and it outputs a nested dictionary with
+SMILES and target names as keys.
+"""
+# ---------------------------------------------------------------------------------------
+# Dependencies
+from collections import defaultdict
+import numpy as np
+from src.model import Tox21XGBClassifier
+from src.preprocess import create_descriptors
+# ---------------------------------------------------------------------------------------
+def predict(smiles_list: list[str]) -> dict[str, dict[str, float]]:
+    """Applies the classifier to a list of SMILES strings. Returns prediction=0.0 for
+    any molecule that could not be cleaned.
+    Args:
+        smiles_list (list[str]): list of SMILES strings
+    Returns:
+        dict: nested prediction dictionary, following {'<smiles>': {'<target>': <pred>}}
+    """
+    print(f"Received {len(smiles_list)} SMILES strings")
+    # preprocessing pipeline
+    features, mol_mask = create_descriptors(
+        smiles_list,
+    )
+    print(f"Created {features.shape[1]} descriptors for the molecules.")
+    print(f"{len(mol_mask) - sum(mol_mask)} molecules removed during cleaning. All predictions for these will be set to 0.0.")
+    # setup model
+    model = Tox21XGBClassifier(seed=42)
+    model_dir = "assets/"
+    model.load_model(model_dir)
+    print(f"Loaded model and feature processors from {model_dir}")
+    # make predictions
+    predictions = defaultdict(dict)
+    feat_indices = np.cumsum(mol_mask) - 1
+    for target in model.tasks:
+        feature_processors = model.feature_processors[target]
+        task_features = feature_processors['selector'].transform(features)
+        task_features = feature_processors['scaler'].transform(task_features)
+        target_pred = model.predict(target, task_features)
+        for smiles, is_clean, i in zip(smiles_list, mol_mask, feat_indices):
+            predictions[smiles][target] = float(target_pred[i]) if is_clean else 0.0
+    return predictions
+if __name__ == "__main__":
+    # simple test
+    test_smiles = [
+        "CCO",
+        "CCN",
+        "invalid_smiles",
+    ]
+    preds = predict(test_smiles)
+    print(preds)

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+fastapi
+uvicorn[standard]
+statsmodels
+rdkit
+numpy
+scikit-learn==1.7.1
+joblib
+tabulate
+datasets
+xgboost==3.0.5

src/__init__.py ADDED Viewed

File without changes

src/model.py ADDED Viewed

	@@ -0,0 +1,90 @@

+"""
+This files includes a XGBoost model for Tox21.
+As an input it takes a list of SMILES and it outputs a nested dictionary with
+SMILES and target names as keys.
+"""
+# ---------------------------------------------------------------------------------------
+# Dependencies
+import os
+import joblib
+import numpy as np
+from xgboost import XGBClassifier
+from utils import TASKS
+# ---------------------------------------------------------------------------------------
+class Tox21XGBClassifier:
+    """A XGBoost classifier that assigns a toxicity score to a given SMILES string."""
+    def __init__(self, seed: int = 42, task_configs: dict | None = None) -> None:
+        """Initialize an XGBoost classifier for each of the 12 Tox21 tasks.
+        Args:
+            seed (int, optional): seed for XGBoost to ensure reproducibility. Defaults to 42.
+            task_configs (dict | None, optional): dictionary containing task-specific
+                hyperparameters. If None, default hyperparameters are used for all tasks.
+                Defaults to None.
+        """
+        self.tasks = TASKS
+        self.model = {
+            task: XGBClassifier(random_state=seed, n_jobs=8) if task_configs is None
+            else XGBClassifier(
+                **{k: v for k, v in task_configs[task].items() if k != 'var_threshold'},
+                random_state=seed, n_jobs=8
+            )
+            for task in self.tasks
+        }
+        self.feature_processors = {}
+    def load_model(self, dir: str) -> None:
+        """Loads the model from a given directory
+        Args:
+            dir (str): directory to load model from
+        """
+        self.model = joblib.load(os.path.join(dir, "xgb_alltasks.joblib"))
+        self.feature_processors = joblib.load(os.path.join(dir, "feature_processors.pkl"))
+    def save_model(self, dir: str) -> None:
+        """Saves the model to a given directory
+        Args:
+            dir (str): directory to save model to
+        """
+        model_path = os.path.join(dir, "xgb_alltasks.joblib")
+        feature_processor_path = os.path.join(dir, "feature_processors.pkl")
+        os.makedirs(dir, exist_ok=True)
+        joblib.dump(self.model, model_path)
+        joblib.dump(self.feature_processors, feature_processor_path)
+    def fit(self, task: str, input_features: np.ndarray, labels: np.ndarray, **kwargs) -> None:
+        """Train XGBoost for a given task
+        Args:
+            task (str): task to train
+            input_features (np.ndarray): training features
+            labels (np.ndarray): training labels
+        """
+        assert task in self.tasks, f"Unknown task: {task}"
+        self.model[task].fit(input_features, labels, **kwargs)
+    def predict(self, task: str, features: np.ndarray) -> np.ndarray:
+        """Predicts labels for a given Tox21 target using molecule features
+        Args:
+            task (str): the Tox21 target to predict for
+            features (np.ndarray): molecule features used for prediction
+        Returns:
+            np.ndarray: predicted probability for positive class
+        """
+        assert task in self.tasks, f"Unknown task: {task}"
+        assert (
+            len(features.shape) == 2
+        ), f"Function expects 2D np.array. Current shape: {features.shape}"
+        preds = self.model[task].predict_proba(features)
+        return preds[:, 1]

src/preprocess.py ADDED Viewed

	@@ -0,0 +1,263 @@

+"""
+This files includes functions to create molecular descriptors.
+As an input it takes a list of SMILES and it outputs a numpy array of descriptors.
+"""
+import json
+import argparse
+import numpy as np
+from datasets import load_dataset
+from rdkit import Chem, DataStructs
+from rdkit.Chem import Descriptors, rdFingerprintGenerator, MACCSkeys
+from rdkit.Chem.rdchem import Mol
+from utils import (
+    TASKS,
+    KNOWN_DESCR,
+    HF_TOKEN,
+    USED_200_DESCR,
+    Standardizer,
+)
+parser = argparse.ArgumentParser(
+    description="Data preprocessing script for the Tox21 dataset"
+)
+parser.add_argument(
+    "--save_folder",
+    type=str,
+    default="data/",
+)
+parser.add_argument(
+    "--use_hf",
+    type=int,
+    default=0,
+)
+parser.add_argument(
+    "--tox_smarts_filepath",
+    type=str,
+    default="assets/tox_smarts.json",
+)
+def create_cleaned_mol_objects(smiles: list[str]) -> tuple[list[Mol], np.ndarray]:
+    """This function creates cleaned RDKit mol objects from a list of SMILES.
+    Args:
+        smiles (list[str]): list of SMILES
+    Returns:
+        list[Mol]: list of cleaned molecules
+        np.ndarray[bool]: mask that contains False at index `i`, if molecule in `smiles` atindex `i` could not be cleaned and was removed.
+    """
+    sm = Standardizer(canon_taut=True)
+    clean_mol_mask = list()
+    mols = list()
+    for i, smile in enumerate(smiles):
+        mol = Chem.MolFromSmiles(smile)
+        standardized_mol, _ = sm.standardize_mol(mol)
+        is_cleaned = standardized_mol is not None
+        clean_mol_mask.append(is_cleaned)
+        if not is_cleaned:
+            continue
+        can_mol = Chem.MolFromSmiles(Chem.MolToSmiles(standardized_mol))
+        mols.append(can_mol)
+    return mols, np.array(clean_mol_mask)
+def create_ecfp_fps(mols: list[Mol], radius=None, fpsize=None) -> np.ndarray:
+    """This function ECFP fingerprints for a list of molecules.
+    Args:
+        mols (list[Mol]): list of molecules
+    Returns:
+        np.ndarray: ECFP fingerprints of molecules
+    """
+    ecfps = list()
+    kwargs = {}
+    if not fpsize is None:
+        kwargs["fpSize"] = fpsize
+    if not radius is None:
+        kwargs["radius"] = radius
+    for mol in mols:
+        gen = rdFingerprintGenerator.GetMorganGenerator(countSimulation=True, **kwargs)
+        fp_sparse_vec = gen.GetCountFingerprint(mol)
+        fp = np.zeros((0,), np.int8)
+        DataStructs.ConvertToNumpyArray(fp_sparse_vec, fp)
+        ecfps.append(fp)
+    return np.array(ecfps)
+def create_maccs_keys(mols: list[Mol]) -> np.ndarray:
+    maccs = [MACCSkeys.GenMACCSKeys(x) for x in mols]
+    return np.array(maccs)
+def get_tox_patterns(filepath: str):
+    """This calculates tox features defined in tox_smarts.json.
+    Args:
+        mols: A list of Mol
+        n_jobs: If >1 multiprocessing is used
+    """
+    # load patterns
+    with open(filepath) as f:
+        smarts_list = [s[1] for s in json.load(f)]
+    # Code does not work for this case
+    assert len([s for s in smarts_list if ("AND" in s) and ("OR" in s)]) == 0
+    # Chem.MolFromSmarts takes a long time so it pays of to parse all the smarts first
+    # and then use them for all molecules. This gives a huge speedup over existing code.
+    # a list of patterns, whether to negate the match result and how to join them to obtain one boolean value
+    all_patterns = []
+    for smarts in smarts_list:
+        patterns = []  # list of smarts-patterns
+        # value for each of the patterns above. Negates the values of the above later.
+        negations = []
+        if " AND " in smarts:
+            smarts = smarts.split(" AND ")
+            merge_any = False  # If an ' AND ' is found all 'subsmarts' have to match
+        else:
+            # If there is an ' OR ' present it's enough is any of the 'subsmarts' match.
+            # This also accumulates smarts where neither ' OR ' nor ' AND ' occur
+            smarts = smarts.split(" OR ")
+            merge_any = True
+        # for all subsmarts check if they are preceded by 'NOT '
+        for s in smarts:
+            neg = s.startswith("NOT ")
+            if neg:
+                s = s[4:]
+            patterns.append(Chem.MolFromSmarts(s))
+            negations.append(neg)
+        all_patterns.append((patterns, negations, merge_any))
+    return all_patterns
+def create_tox_features(mols: list[Mol], patterns: list) -> np.ndarray:
+    """Matches the tox patterns against a molecule. Returns a boolean array"""
+    tox_data = []
+    for mol in mols:
+        mol_features = []
+        for patts, negations, merge_any in patterns:
+            matches = [mol.HasSubstructMatch(p) for p in patts]
+            matches = [m != n for m, n in zip(matches, negations)]
+            if merge_any:
+                pres = any(matches)
+            else:
+                pres = all(matches)
+            mol_features.append(pres)
+        tox_data.append(np.array(mol_features))
+    return np.array(tox_data)
+def create_rdkit_descriptors(mols: list[Mol]) -> np.ndarray:
+    """This function creates RDKit descriptors for a list of molecules.
+    Args:
+        mols (list[Mol]): list of molecules
+    Returns:
+        np.ndarray: RDKit descriptors of molecules
+    """
+    rdkit_descriptors = list()
+    for mol in mols:
+        descrs = []
+        for _, descr_calc_fn in Descriptors._descList:
+            descrs.append(descr_calc_fn(mol))
+        descrs = np.array(descrs)
+        descrs = descrs[USED_200_DESCR]
+        rdkit_descriptors.append(descrs)
+    return np.array(rdkit_descriptors)
+def create_descriptors(
+    smiles,
+):
+    print(f"Preprocess {len(smiles)} molecules")
+    # Create cleanded rdkit mol objects
+    mols, clean_mol_mask = create_cleaned_mol_objects(smiles)
+    print("Cleaned molecules")
+    tox_patterns = get_tox_patterns("assets/tox_smarts.json")
+    # Create fingerprints and descriptors
+    ecfps = create_ecfp_fps(mols, radius=3, fpsize=8192)
+    print("Created ECFP fingerprints")
+    tox = create_tox_features(mols, tox_patterns)
+    print("Created Tox features")
+    maccs = create_maccs_keys(mols)
+    print("Created MACCS keys")
+    rdkit_descrs = create_rdkit_descriptors(mols)
+    print("Created RDKit descriptors")
+    features = np.concatenate((ecfps, tox, maccs, rdkit_descrs), axis=1)
+    return features, clean_mol_mask
+def fill(features, mask, value=np.nan):
+    n_mols = len(mask)
+    n_features = features.shape[1]
+    data = np.zeros(shape=(n_mols, n_features))
+    data.fill(value)
+    data[~mask] = features
+    return data
+def preprocess_tox21():
+    splits = ["train", "validation"]
+    ds = load_dataset("tschouis/tox21", token=HF_TOKEN)
+    all_features, all_labels, all_split = [], [], []
+    for split in splits:
+        print(f"Preprocess {split} molecules")
+        smiles = list(ds[split]["smiles"])
+        features, mol_mask = create_descriptors(
+            smiles,
+        )
+        print(f"Created {features.shape[1]} descriptors for {len(smiles)} molecules.")
+        print(f"{len(mol_mask) - sum(mol_mask)} molecules removed during cleaning.")
+        labels = []
+        for task in TASKS:
+            datasplit = ds[split].to_pandas() if args.use_hf else ds[split]
+            labels.append(datasplit[task].to_numpy())
+        labels = np.stack(labels, axis=1)
+        all_features.append(features)
+        all_labels.append(labels)
+        all_split.append([split] * len(smiles))
+    save_path = f"{args.save_folder}/tox21_data.npz"
+    with open(save_path, "wb") as f:
+        np.savez_compressed(
+            f,
+            features=all_features,
+            labels=all_labels,
+            splits=all_split,
+        )
+    print(f"Saved preprocessed data to {save_path}")
+if __name__ == "__main__":
+    args = parser.parse_args()
+    preprocess_tox21()

src/push_assets.py ADDED Viewed

	@@ -0,0 +1,12 @@

+from huggingface_hub import HfApi
+from .utils import HF_TOKEN
+api = HfApi()
+api.upload_folder(
+    folder_path="assets/",
+    path_in_repo="assets",
+    repo_id="tschouis/tox21_xgboost_classifier",
+    repo_type="space",
+    token=HF_TOKEN,
+)

src/train.py ADDED Viewed

	@@ -0,0 +1,294 @@

+"""
+Script for fitting and saving any preprocessing assets, as well as the fitted XGBoost model
+"""
+import os
+import argparse
+import numpy as np
+from tabulate import tabulate
+from sklearn.feature_selection import VarianceThreshold
+from sklearn.metrics import roc_auc_score
+from sklearn.preprocessing import StandardScaler
+from model import Tox21XGBClassifier
+SEED = 999
+DATA_FOLDER = "data/"
+parser = argparse.ArgumentParser(description="XGBoost Training script for Tox21 dataset")
+parser.add_argument(
+    "--model_dir",
+    type=str,
+    default="assets",
+)
+def main(args):
+    print("Preprocess train molecules")
+    data_path = os.path.join(DATA_FOLDER, "tox21_data.npz")
+    data_path = "/system/user/studentwork/anebner/tox21_leaderboard/tox21_baselines/data_featset_new/tox21_descriptors.npz" # TMP override
+    full_data = np.load(data_path, allow_pickle=True)
+    features = full_data["features"]
+    labels = full_data["labels"]
+    sets = full_data["sets"]
+    # Handle inf/nan features: instead of dropping columns, zero-out entire affected columns
+    # so that VarianceThreshold will remove them later, keeping indices aligned.
+    bad_entries = np.isinf(features) | np.isnan(features)
+    bad_cols = np.any(bad_entries, axis=0)
+    if np.any(bad_cols):
+        features[:, bad_cols] = 0.0
+    train_val_mask = sets != "test" # TMP fix should be "validation" ?
+    train_X = features[train_val_mask]
+    train_y = labels[train_val_mask]
+    test_mask = sets == "test"
+    val_X = features[test_mask]
+    val_y = labels[test_mask]
+    task_config = {
+        "NR-AR": {
+            "max_depth": 4,
+            "min_child_weight": 1.1005779061921914,
+            "gamma": 0.1317988706679324,
+            "learning_rate": 0.039645108160965156,
+            "subsample": 0.7296241662412439,
+            "colsample_bytree": 0.8021365422870282,
+            "reg_alpha": 3.3237336705963336e-06,
+            "reg_lambda": 0.5602005185114373,
+            "colsample_bylevel": 0.6436881915714322,
+            "max_bin": 320,
+            "grow_policy": "depthwise",
+            "var_threshold": 0.007666987709838448
+        },
+        "NR-AR-LBD": {
+            "max_depth": 4,
+            "min_child_weight": 4.1987212703698695,
+            "gamma": 1.2762015931613548,
+            "learning_rate": 0.15154599977311695,
+            "subsample": 0.6695940698634157,
+            "colsample_bytree": 0.7739932636137854,
+            "reg_alpha": 0.07898626960219088,
+            "reg_lambda": 8.571012949754111,
+            "colsample_bylevel": 0.9853057670318977,
+            "max_bin": 512,
+            "grow_policy": "lossguide",
+            "var_threshold": 0.00037667540735397795
+        },
+        "NR-AhR": {
+            "max_depth": 5,
+            "min_child_weight": 6.689827023187083,
+            "gamma": 0.05246277760115231,
+            "learning_rate": 0.04756606141238733,
+            "subsample": 0.8679211962117436,
+            "colsample_bytree": 0.6095873089337578,
+            "reg_alpha": 2.9267916989096844e-05,
+            "reg_lambda": 0.16597411475484836,
+            "colsample_bylevel": 0.6109587378961451,
+            "max_bin": 192,
+            "grow_policy": "lossguide",
+            "var_threshold": 0.006450426707708987
+        },
+        "NR-Aromatase": {
+            "max_depth": 3,
+            "min_child_weight": 3.2876314247596152,
+            "gamma": 0.19699266508924895,
+            "learning_rate": 0.05088088932843542,
+            "subsample": 0.7865649204014827,
+            "colsample_bytree": 0.7251861382401115,
+            "reg_alpha": 1.5663141562519894e-05,
+            "reg_lambda": 0.8079227014059855,
+            "colsample_bylevel": 0.6264563203168154,
+            "max_bin": 320,
+            "grow_policy": "lossguide",
+            "var_threshold": 0.008210794229202779
+        },
+        "NR-ER": {
+            "max_depth": 4,
+            "min_child_weight": 5.780102015649284,
+            "gamma": 1.4129142474001934,
+            "learning_rate": 0.030962338755374925,
+            "subsample": 0.6495287204129598,
+            "colsample_bytree": 0.6052286799267346,
+            "reg_alpha": 2.350761568396455e-08,
+            "reg_lambda": 0.09630529926179951,
+            "colsample_bylevel": 0.7431813327243276,
+            "max_bin": 384,
+            "grow_policy": "lossguide",
+            "var_threshold": 0.0023810780862365695
+        },
+        "NR-ER-LBD": {
+            "max_depth": 5,
+            "min_child_weight": 9.173052917805649,
+            "gamma": 1.0722539699322629,
+            "learning_rate": 0.04237749698413915,
+            "subsample": 0.7066072339657229,
+            "colsample_bytree": 0.6813795582720684,
+            "reg_alpha": 0.00023207537137377197,
+            "reg_lambda": 15.088634424806914,
+            "colsample_bylevel": 0.7799437417755278,
+            "max_bin": 384,
+            "grow_policy": "depthwise",
+            "var_threshold": 0.0019169350680113165
+        },
+        "NR-PPAR-gamma": {
+            "max_depth": 6,
+            "min_child_weight": 5.174007598815524,
+            "gamma": 1.9912192366255241,
+            "learning_rate": 0.05540828755212913,
+            "subsample": 0.6903953157523113,
+            "colsample_bytree": 0.8663027348173384,
+            "reg_alpha": 2.083339410970234e-08,
+            "reg_lambda": 0.015396790332761562,
+            "colsample_bylevel": 0.9751745752733803,
+            "max_bin": 320,
+            "grow_policy": "lossguide",
+            "var_threshold": 0.0029616070252124786
+        },
+        "SR-ARE": {
+            "max_depth": 7,
+            "min_child_weight": 9.1659526731455,
+            "gamma": 0.697265411436678,
+            "learning_rate": 0.06570769871964029,
+            "subsample": 0.9905868520803529,
+            "colsample_bytree": 0.9320468198902392,
+            "reg_alpha": 0.0015832053017691588,
+            "reg_lambda": 0.05920338550334178,
+            "colsample_bylevel": 0.9881491817036743,
+            "max_bin": 128,
+            "grow_policy": "lossguide",
+            "var_threshold": 0.002817440527458996
+        },
+        "SR-ATAD5": {
+            "max_depth": 8,
+            "min_child_weight": 3.840348891355251,
+            "gamma": 1.6154505675458388,
+            "learning_rate": 0.13247082849598005,
+            "subsample": 0.8051455662822469,
+            "colsample_bytree": 0.8812075918541051,
+            "reg_alpha": 1.0831755964182738e-08,
+            "reg_lambda": 27.095693383578947,
+            "colsample_bylevel": 0.636617995280427,
+            "max_bin": 256,
+            "grow_policy": "depthwise",
+            "var_threshold": 0.009669430411280284
+        },
+        "SR-HSE": {
+            "max_depth": 9,
+            "min_child_weight": 6.413184249228777,
+            "gamma": 1.033704331418744,
+            "learning_rate": 0.05274739499143931,
+            "subsample": 0.8865620043291726,
+            "colsample_bytree": 0.6816866072800449,
+            "reg_alpha": 0.058835365152010946,
+            "reg_lambda": 0.020754661410877756,
+            "colsample_bylevel": 0.9110208090854688,
+            "max_bin": 512,
+            "grow_policy": "lossguide",
+            "var_threshold": 0.005674926071804129
+        },
+        "SR-MMP": {
+            "max_depth": 5,
+            "min_child_weight": 9.817728618387365,
+            "gamma": 1.174192311657815,
+            "learning_rate": 0.0469463693712702,
+            "subsample": 0.7551958380501903,
+            "colsample_bytree": 0.7909988895785574,
+            "reg_alpha": 0.00015815798249652454,
+            "reg_lambda": 0.07975430070894152,
+            "colsample_bylevel": 0.6649592956153568,
+            "max_bin": 128,
+            "grow_policy": "depthwise",
+            "var_threshold": 0.006024127982297082
+        },
+        "SR-p53": {
+            "max_depth": 8,
+            "min_child_weight": 5.038486734836349,
+            "gamma": 1.807085258740345,
+            "learning_rate": 0.1096533837056875,
+            "subsample": 0.71588646279992,
+            "colsample_bytree": 0.8086559814485024,
+            "reg_alpha": 3.864250735509029e-08,
+            "reg_lambda": 0.03548737332001143,
+            "colsample_bylevel": 0.7740614694930106,
+            "max_bin": 128,
+            "grow_policy": "depthwise",
+            "var_threshold": 0.008637178477182731
+        },
+    }
+    results = {}
+    for i, task in enumerate(task_config.keys()):
+        npos = np.nansum(train_y[:, i])
+        nneg = np.sum(~np.isnan(train_y[:, i])) - npos
+        task_config[task].update({
+            "tree_method": "hist",
+            "n_estimators": 10_000,
+            "early_stopping_rounds": 50,
+            "eval_metric": "auc",
+            "scale_pos_weight": nneg / max(npos, 1),
+            "device": "cpu",
+        })
+    model = Tox21XGBClassifier(seed=SEED, task_configs=task_config)
+    print("Start training.")
+    for i, task in enumerate(model.tasks):
+        #print(model.model[task])
+        # Training -----------------------
+        task_labels = train_y[:, i]
+        label_mask = ~np.isnan(task_labels)
+        task_data = train_X[label_mask]
+        task_labels = task_labels[label_mask].astype(int)
+        # Remove low variance features and scale
+        var_thresh = VarianceThreshold(threshold=task_config[task]["var_threshold"])
+        task_data = var_thresh.fit_transform(task_data)
+        scaler = StandardScaler()
+        task_data = scaler.fit_transform(task_data)
+        model.feature_processors[task] = {
+            "selector": var_thresh,
+            "scaler": scaler,
+        }
+        # From X_train split 10% for an early stopping validation set
+        np.random.seed(SEED)
+        random_numbers = np.random.rand(task_data.shape[0])
+        es_val_mask = random_numbers < 0.1
+        es_train_mask = random_numbers >= 0.1
+        X_es_val, y_es_val = task_data[es_val_mask], task_labels[es_val_mask]
+        X_es_train, y_es_train = task_data[es_train_mask], task_labels[es_train_mask]
+        print(f"Fit task {task} using {sum(label_mask)} samples and {task_data.shape[1]} features")
+        model.fit(task, X_es_train, y_es_train, eval_set=[(X_es_val, y_es_val)], verbose=False)
+        # Evaluation -----------------------
+        val_task_labels = val_y[:, i]
+        val_label_mask = ~np.isnan(val_task_labels)
+        val_task_labels = val_task_labels[val_label_mask].astype(int)
+        val_task_data = val_X[val_label_mask]
+        val_task_data = model.feature_processors[task]["selector"].transform(val_task_data)
+        val_task_data = model.feature_processors[task]["scaler"].transform(val_task_data)
+        # Evaluate model
+        pred = model.predict(task, val_task_data)
+        results[task] = [roc_auc_score(y_true=val_task_labels, y_score=pred)]
+    print(f"Save model under {args.model_dir}")
+    model.save_model(args.model_dir)
+    print("Results:")
+    print(tabulate(results, headers="keys"))
+    print("Average: ", sum([val[0] for val in results.values()]) / len(results))
+if __name__ == "__main__":
+    args = parser.parse_args()
+    main(args)

src/utils.py ADDED Viewed

	@@ -0,0 +1,432 @@

+## These MolStandardizer classes are due to Paolo Tosco
+## It was taken from the FS-Mol github
+## (https://github.com/microsoft/FS-Mol/blob/main/fs_mol/preprocessing/utils/
+##  standardizer.py)
+## They ensure that a sequence of standardization operations are applied
+## https://gist.github.com/ptosco/7e6b9ab9cc3e44ba0919060beaed198e
+import os
+import pickle
+from rdkit import Chem
+from rdkit.Chem.MolStandardize import rdMolStandardize
+HF_TOKEN = os.environ.get("HF_TOKEN")
+TASKS = [
+    "NR-AR",
+    "NR-AR-LBD",
+    "NR-AhR",
+    "NR-Aromatase",
+    "NR-ER",
+    "NR-ER-LBD",
+    "NR-PPAR-gamma",
+    "SR-ARE",
+    "SR-ATAD5",
+    "SR-HSE",
+    "SR-MMP",
+    "SR-p53",
+]
+KNOWN_DESCR = ["ecfps", "rdkit_descr_quantiles", "maccs", "tox"]
+USED_200_DESCR = [
+    0,
+    1,
+    2,
+    3,
+    4,
+    5,
+    6,
+    7,
+    8,
+    9,
+    10,
+    11,
+    12,
+    13,
+    14,
+    15,
+    16,
+    25,
+    26,
+    27,
+    28,
+    29,
+    30,
+    31,
+    32,
+    33,
+    34,
+    35,
+    36,
+    37,
+    38,
+    39,
+    40,
+    41,
+    42,
+    43,
+    44,
+    45,
+    46,
+    47,
+    48,
+    49,
+    50,
+    51,
+    52,
+    53,
+    54,
+    55,
+    56,
+    57,
+    58,
+    59,
+    60,
+    61,
+    62,
+    63,
+    64,
+    65,
+    66,
+    67,
+    68,
+    69,
+    70,
+    71,
+    72,
+    73,
+    74,
+    75,
+    76,
+    77,
+    78,
+    79,
+    80,
+    81,
+    82,
+    83,
+    84,
+    85,
+    86,
+    87,
+    88,
+    89,
+    90,
+    91,
+    92,
+    93,
+    94,
+    95,
+    96,
+    97,
+    98,
+    99,
+    100,
+    101,
+    102,
+    103,
+    104,
+    105,
+    106,
+    107,
+    108,
+    109,
+    110,
+    111,
+    112,
+    113,
+    114,
+    115,
+    116,
+    117,
+    118,
+    119,
+    120,
+    121,
+    122,
+    123,
+    124,
+    125,
+    126,
+    127,
+    128,
+    129,
+    130,
+    131,
+    132,
+    133,
+    134,
+    135,
+    136,
+    137,
+    138,
+    139,
+    140,
+    141,
+    142,
+    143,
+    144,
+    145,
+    146,
+    147,
+    148,
+    149,
+    150,
+    151,
+    152,
+    153,
+    154,
+    155,
+    156,
+    157,
+    158,
+    159,
+    160,
+    161,
+    162,
+    163,
+    164,
+    165,
+    166,
+    167,
+    168,
+    169,
+    170,
+    171,
+    172,
+    173,
+    174,
+    175,
+    176,
+    177,
+    178,
+    179,
+    180,
+    181,
+    182,
+    183,
+    184,
+    185,
+    186,
+    187,
+    188,
+    189,
+    190,
+    191,
+    192,
+    193,
+    194,
+    195,
+    196,
+    197,
+    198,
+    199,
+    200,
+    201,
+    202,
+    203,
+    204,
+    205,
+    206,
+    207,
+]
+class Standardizer:
+    """
+    Simple wrapper class around rdkit Standardizer.
+    """
+    DEFAULT_CANON_TAUT = False
+    DEFAULT_METAL_DISCONNECT = False
+    MAX_TAUTOMERS = 100
+    MAX_TRANSFORMS = 100
+    MAX_RESTARTS = 200
+    PREFER_ORGANIC = True
+    def __init__(
+        self,
+        metal_disconnect=None,
+        canon_taut=None,
+    ):
+        """
+        Constructor.
+        All parameters are optional.
+        :param metal_disconnect:    if True, metallorganic complexes are
+                                    disconnected
+        :param canon_taut:          if True, molecules are converted to their
+                                    canonical tautomer
+        """
+        super().__init__()
+        if metal_disconnect is None:
+            metal_disconnect = self.DEFAULT_METAL_DISCONNECT
+        if canon_taut is None:
+            canon_taut = self.DEFAULT_CANON_TAUT
+        self._canon_taut = canon_taut
+        self._metal_disconnect = metal_disconnect
+        self._taut_enumerator = None
+        self._uncharger = None
+        self._lfrag_chooser = None
+        self._metal_disconnector = None
+        self._normalizer = None
+        self._reionizer = None
+        self._params = None
+    @property
+    def params(self):
+        """Return the MolStandardize CleanupParameters."""
+        if self._params is None:
+            self._params = rdMolStandardize.CleanupParameters()
+            self._params.maxTautomers = self.MAX_TAUTOMERS
+            self._params.maxTransforms = self.MAX_TRANSFORMS
+            self._params.maxRestarts = self.MAX_RESTARTS
+            self._params.preferOrganic = self.PREFER_ORGANIC
+            self._params.tautomerRemoveSp3Stereo = False
+        return self._params
+    @property
+    def canon_taut(self):
+        """Return whether tautomer canonicalization will be done."""
+        return self._canon_taut
+    @property
+    def metal_disconnect(self):
+        """Return whether metallorganic complexes will be disconnected."""
+        return self._metal_disconnect
+    @property
+    def taut_enumerator(self):
+        """Return the TautomerEnumerator object."""
+        if self._taut_enumerator is None:
+            self._taut_enumerator = rdMolStandardize.TautomerEnumerator(self.params)
+        return self._taut_enumerator
+    @property
+    def uncharger(self):
+        """Return the Uncharger object."""
+        if self._uncharger is None:
+            self._uncharger = rdMolStandardize.Uncharger()
+        return self._uncharger
+    @property
+    def lfrag_chooser(self):
+        """Return the LargestFragmentChooser object."""
+        if self._lfrag_chooser is None:
+            self._lfrag_chooser = rdMolStandardize.LargestFragmentChooser(
+                self.params.preferOrganic
+            )
+        return self._lfrag_chooser
+    @property
+    def metal_disconnector(self):
+        """Return the MetalDisconnector object."""
+        if self._metal_disconnector is None:
+            self._metal_disconnector = rdMolStandardize.MetalDisconnector()
+        return self._metal_disconnector
+    @property
+    def normalizer(self):
+        """Return the Normalizer object."""
+        if self._normalizer is None:
+            self._normalizer = rdMolStandardize.Normalizer(
+                self.params.normalizationsFile, self.params.maxRestarts
+            )
+        return self._normalizer
+    @property
+    def reionizer(self):
+        """Return the Reionizer object."""
+        if self._reionizer is None:
+            self._reionizer = rdMolStandardize.Reionizer(self.params.acidbaseFile)
+        return self._reionizer
+    def charge_parent(self, mol_in):
+        """Sequentially apply a series of MolStandardize operations:
+        * MetalDisconnector
+        * Normalizer
+        * Reionizer
+        * LargestFragmentChooser
+        * Uncharger
+        The net result is that a desalted, normalized, neutral
+        molecule with implicit Hs is returned.
+        """
+        params = Chem.RemoveHsParameters()
+        params.removeAndTrackIsotopes = True
+        mol_in = Chem.RemoveHs(mol_in, params, sanitize=False)
+        if self._metal_disconnect:
+            mol_in = self.metal_disconnector.Disconnect(mol_in)
+        normalized = self.normalizer.normalize(mol_in)
+        Chem.SanitizeMol(normalized)
+        normalized = self.reionizer.reionize(normalized)
+        Chem.AssignStereochemistry(normalized)
+        normalized = self.lfrag_chooser.choose(normalized)
+        normalized = self.uncharger.uncharge(normalized)
+        # need this to reassess aromaticity on things like
+        # cyclopentadienyl, tropylium, azolium, etc.
+        Chem.SanitizeMol(normalized)
+        return Chem.RemoveHs(Chem.AddHs(normalized))
+    def standardize_mol(self, mol_in):
+        """
+        Standardize a single molecule.
+        :param mol_in:  a Chem.Mol
+        :return:        * (standardized Chem.Mol, n_taut) tuple
+                          if success. n_taut will be negative if
+                          tautomer enumeration was aborted due
+                          to reaching a limit
+                        * (None, error_msg) if failure
+        This calls self.charge_parent() and, if self._canon_taut
+        is True, runs tautomer canonicalization.
+        """
+        n_tautomers = 0
+        if isinstance(mol_in, Chem.Mol):
+            name = None
+            try:
+                name = mol_in.GetProp("_Name")
+            except KeyError:
+                pass
+            if not name:
+                name = "NONAME"
+        else:
+            error = f"Expected SMILES or Chem.Mol as input, got {str(type(mol_in))}"
+            return None, error
+        try:
+            mol_out = self.charge_parent(mol_in)
+        except Exception as e:
+            error = f"charge_parent FAILED: {str(e).strip()}"
+            return None, error
+        if self._canon_taut:
+            try:
+                res = self.taut_enumerator.Enumerate(mol_out, False)
+            except TypeError:
+                # we are still on the pre-2021 RDKit API
+                res = self.taut_enumerator.Enumerate(mol_out)
+            except Exception as e:
+                # something else went wrong
+                error = f"canon_taut FAILED: {str(e).strip()}"
+                return None, error
+            n_tautomers = len(res)
+            if hasattr(res, "status"):
+                completed = (
+                    res.status == rdMolStandardize.TautomerEnumeratorStatus.Completed
+                )
+            else:
+                # we are still on the pre-2021 RDKit API
+                completed = len(res) < 1000
+            if not completed:
+                n_tautomers = -n_tautomers
+            try:
+                mol_out = self.taut_enumerator.PickCanonical(res)
+            except AttributeError:
+                # we are still on the pre-2021 RDKit API
+                mol_out = max(
+                    [(self.taut_enumerator.ScoreTautomer(m), m) for m in res]
+                )[1]
+            except Exception as e:
+                # something else went wrong
+                error = f"canon_taut FAILED: {str(e).strip()}"
+                return None, error
+        mol_out.SetProp("_Name", name)
+        return mol_out, n_tautomers