Spaces:

ml-jku
/

tox21_xgboost_classifier

Sleeping

App Files Files Community

antoniaebner commited on Nov 13

Commit

b0daa87

1 Parent(s): 06a41f1

purge

Browse files

Files changed (20) hide show

.gitignore +0 -1
README.md +0 -103
app.py +0 -78
assets/ecdfs.pkl +0 -3
assets/scaler.pkl +0 -3
assets/tox_smarts.json +0 -0
assets/xgb_alltasks.joblib +0 -3
assets_old/ecdfs.pkl +0 -3
assets_old/scaler.pkl +0 -3
assets_old/xgb_alltasks.joblib +0 -3
data/ecdfs.pkl +0 -3
predict.py +0 -64
requirements.txt +0 -10
src/__init__.py +0 -0
src/data.py +0 -90
src/model.py +0 -79
src/preprocess.py +0 -405
src/push_assets.py +0 -12
src/train.py +0 -199
src/utils.py +0 -443

.gitignore DELETED Viewed

	@@ -1 +0,0 @@
1	- __pycache__/

README.md DELETED Viewed

@@ -1,103 +0,0 @@
----
-title: Tox21 XGBoost Classifier
-emoji: 🚀
-colorFrom: green
-colorTo: purple
-sdk: docker
-pinned: false
-license: apache-2.0
-short_description: XGBoost baseline classifier for Tox21
----
-# Tox21 XGBoost Classifier
-This repository hosts a Hugging Face Space that provides an examplary API for submitting models to the [Tox21 Leaderboard](https://huggingface.co/spaces/tschouis/tox21_leaderboard).
-In this example, we train a XGBoost classifier on the Tox21 targets and save the trained model in the `assets/` folder.
-**Important:** For leaderboard submission, your Space does not need to include training code. It only needs to implement inference in the `predict()` function inside `predict.py`. The `predict()` function must keep the provided skeleton: it should take a list of SMILES strings as input and return a prediction dictionary as output, with SMILES and targets as keys. Therefore, any preprocessing of SMILES strings must be executed on-the-fly during inference.
-# Repository Structure
-- `predict.py` - Defines the `predict()` function required by the leaderboard (entry point for inference).
-- `app.py` - FastAPI application wrapper (can be used as-is).
-- `src/` - Core model & preprocessing logic:
-    - `data.py` - SMILES preprocessing pipeline
-    - `model.py` - XGBoost classifier wrapper
-    - `train.py` - Script to train the classifier
-    - `utils.py` – Constants and Helper functions
-# Quickstart with Spaces
-You can easily adapt this project in your own Hugging Face account:
-- Open this Space on Hugging Face.
-- Click "Duplicate this Space" (top-right corner).
-- Modify `src/` for your preprocessing pipeline and model class
-- Modify `predict()` inside `predict.py` to perform model inference while keeping the function skeleton unchanged to remain compatible with the leaderboard.
-That’s it, your model will be available as an API endpoint for the Tox21 Leaderboard.
-# Installation
-To run (and train) the XGBoost, clone the repository and install dependencies:
-```bash
-git clone https://huggingface.co/spaces/tschouis/tox21_xgboost_classifier
-cd tox_21_xgb_classifier
-conda create -n tox21_xgb_cls python=3.11
-conda activate tox21_xgb_cls
-pip install -r requirements.txt
-```
-# Training
-To train the XGBoost model from scratch:
-```bash
-python -m src/train.py
-```
-This will:
-1. Load and preprocess the Tox21 training dataset.
-2. Train a XGBoost classifier.
-3. Save the trained model to the assets/ folder.
-4. Evaluate the trained XGBoost classifier on the validation split.
-# Inference
-For inference, you only need `predict.py`.
-Example usage inside Python:
-```python
-from predict import predict
-smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
-results = predict(smiles_list)
-print(results)
-```
-The output will be a nested dictionary in the format:
-```python
-{
-    "CCO": {"target1": 0, "target2": 1, ..., "target12": 0},
-    "c1ccccc1": {"target1": 1, "target2": 0, ..., "target12": 1},
-    "CC(=O)O": {"target1": 0, "target2": 0, ..., "target12": 0}
-}
-```
-# Notes
-- Only adapting `predict.py` for your model inference is required for leaderboard submission.
-- Training (`src/train.py`) is provided for reproducibility.
-- Preprocessing (here inside `src/data.py`) must be applied at inference time, not just training.

app.py DELETED Viewed

@@ -1,78 +0,0 @@
-"""
-This is the main entry point for the FastAPI application.
-The app handles the request to predict toxicity for a list of SMILES strings.
-"""
-# ---------------------------------------------------------------------------------------
-# Dependencies and global variable definition
-import os
-from typing import List, Dict, Optional
-from fastapi import FastAPI, Header, HTTPException
-from pydantic import BaseModel, Field
-from predict import predict as predict_func
-API_KEY = os.getenv("API_KEY")  # set via Space Secrets
-# ---------------------------------------------------------------------------------------
-class Request(BaseModel):
-    smiles: List[str] = Field(min_items=1, max_items=1000)
-class Response(BaseModel):
-    predictions: dict
-    model_info: Dict[str, str] = {}
-app = FastAPI(title="toxicity-api")
-@app.get("/")
-def root():
-    return {
-        "message": "Toxicity Prediction API",
-        "endpoints": {
-            "/metadata": "GET - API metadata and capabilities",
-            "/healthz": "GET - Health check",
-            "/predict": "POST - Predict toxicity for SMILES",
-        },
-        "usage": "Send POST to /predict with {'smiles': ['your_smiles_here']} and Authorization header",
-    }
-@app.get("/metadata")
-def metadata():
-    return {
-        "name": "AwesomeTox",
-        "version": "1.0.0",
-        "max_batch_size": 256,
-        "tox_endpoints": [
-            "NR-AR",
-            "NR-AR-LBD",
-            "NR-AhR",
-            "NR-Aromatase",
-            "NR-ER",
-            "NR-ER-LBD",
-            "NR-PPAR-gamma",
-            "SR-ARE",
-            "SR-ATAD5",
-            "SR-HSE",
-            "SR-MMP",
-            "SR-p53",
-        ],
-    }
-@app.get("/healthz")
-def healthz():
-    return {"ok": True}
-@app.post("/predict", response_model=Response)
-def predict(request: Request):
-    predictions = predict_func(request.smiles)
-    return {
-        "predictions": predictions,
-        "model_info": {"name": "random_clf", "version": "1.0.0"},
-    }

assets/ecdfs.pkl DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:9d1f0b5753af1e5aa697bd3e0fc4155d6a96bdfd083139f96ce140cb3d47f127
-size 37660397

assets/scaler.pkl DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:4b11f9ff54fd099e05a8423dcb2f3cf059c6bdfca9d068de46bf0ce0f727e136
-size 78256

assets/tox_smarts.json DELETED Viewed

The diff for this file is too large to render. See raw diff

assets/xgb_alltasks.joblib DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:c987ccf417df7c3e458512ffb71d3c052efd5f091f426299644544a8971b2bb6
-size 34793787

assets_old/ecdfs.pkl DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:9d1f0b5753af1e5aa697bd3e0fc4155d6a96bdfd083139f96ce140cb3d47f127
-size 37660397

assets_old/scaler.pkl DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:cec4650acf6ecc7dd7b820459acc6b6a1bc1f78852ee2328798d6754465c95d0
-size 54415

assets_old/xgb_alltasks.joblib DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:22748188c9bffbdd15febc4caf2daf9d00d660670025fc5b4371aaf36a0e8fea
-size 19718840

data/ecdfs.pkl DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:782174f4353d7342d9d74ad672f00aeeeef57f35a4dad4b3e20b08a35079adf7
-size 33743597

predict.py DELETED Viewed

@@ -1,64 +0,0 @@
-"""
-This files includes a predict function for the Tox21.
-As an input it takes a list of SMILES and it outputs a nested dictionary with
-SMILES and target names as keys.
-"""
-# ---------------------------------------------------------------------------------------
-# Dependencies
-from collections import defaultdict
-import numpy as np
-from src.model import Tox21XGBClassifier
-from src.preprocess import create_descriptors
-from src.utils import load_pickle, KNOWN_DESCR
-# ---------------------------------------------------------------------------------------
-def predict(smiles_list: list[str]) -> dict[str, dict[str, float]]:
-    """Applies the classifier to a list of SMILES strings. Returns prediction=0.0 for
-    any molecule that could not be cleaned.
-    Args:
-        smiles_list (list[str]): list of SMILES strings
-    Returns:
-        dict: nested prediction dictionary, following {'<smiles>': {'<target>': <pred>}}
-    """
-    print(f"Received {len(smiles_list)} SMILES strings")
-    # preprocessing pipeline
-    ecdfs_path = "assets/ecdfs.pkl"
-    scaler_path = "assets/scaler.pkl"
-    ecdfs = load_pickle(ecdfs_path)
-    scaler = load_pickle(scaler_path)
-    print(f"Loaded ecdfs from {ecdfs_path}")
-    print(f"Loaded scaler from {scaler_path}")
-    descriptors = KNOWN_DESCR
-    features, mol_mask = create_descriptors(
-        smiles_list,
-        ecdfs=ecdfs,
-        scaler=scaler,
-        descriptors=descriptors,
-    )
-    print(f"Created descriptors {descriptors} for molecules.")
-    print(f"{len(mol_mask) - sum(mol_mask)} molecules removed during cleaning")
-    # setup model
-    model = Tox21XGBClassifier(seed=42)
-    model_path = "assets/xgb_alltasks.joblib"
-    model.load_model(model_path)
-    print(f"Loaded model from {model_path}")
-    # make predicitons
-    predictions = defaultdict(dict)
-    # create a list with same length as smiles_list to obtain indices for respective features
-    feat_indices = np.cumsum(mol_mask) - 1
-    for target in model.tasks:
-        target_pred = model.predict(target, features)
-        for smiles, is_clean, i in zip(smiles_list, mol_mask, feat_indices):
-            predictions[smiles][target] = float(target_pred[i]) if is_clean else 0.0
-    return predictions

requirements.txt DELETED Viewed

@@ -1,10 +0,0 @@
-fastapi
-uvicorn[standard]
-statsmodels
-rdkit
-numpy
-scikit-learn==1.7.1
-joblib
-tabulate
-datasets
-xgboost==3.0.5

src/__init__.py DELETED Viewed

File without changes

src/data.py DELETED Viewed

@@ -1,90 +0,0 @@
-# pipeline taken from https://huggingface.co/spaces/ml-jku/mhnfs/blob/main/src/data_preprocessing/create_descriptors.py
-"""
-This files includes a the data processing for Tox21.
-As an input it takes a list of SMILES and it outputs a nested dictionary with
-SMILES and target names as keys.
-"""
-from typing import Iterable, Literal
-import numpy as np
-import torch
-from .preprocess import normalize_features
-KNOWN_DESCR = ["ecfps", "rdkit_descr_quantiles", "maccs", "tox"]
-def get_descriptor_dataset(
-    data_path: str,
-    descriptors: Iterable[str] | Literal["all"],
-    scaler=None,
-    save_scaler_path: str = "data/scaler.pkl",
-    verbose=True,
-    normalize=True,
-):
-    if descriptors == "all":
-        descriptors = KNOWN_DESCR
-    assert isinstance(descriptors, Iterable), "Passed descriptors are not iterable!"
-    assert all(
-        [descr in KNOWN_DESCR for descr in descriptors]
-    ), f"Passed descriptors contains unknown descriptor types. Allowed descriptors: {KNOWN_DESCR}"
-    datafile = np.load(data_path)
-    if not isinstance(datafile, np.ndarray):
-        # concatenate all descriptors and normalize
-        data = np.concatenate([datafile[descr] for descr in descriptors], axis=1)
-        labels = datafile["labels"]
-    else:
-        print("NPY file passed, cannot select specific descriptors")
-        data, labels = datafile[:, :-12], datafile[:, -12:]
-    if normalize:
-        data, scaler = normalize_features(
-            data,
-            scaler=scaler,
-            save_scaler_path=save_scaler_path,
-            verbose=verbose,
-        )
-    # filter out unsanitized molecules
-    mask = ~np.isnan(data).any(axis=1)
-    data = data[mask]
-    labels = labels[mask]
-    assert data.shape[0] == labels.shape[0], (
-        f"Mismatch between data and labels: "
-        f"data has {data.shape[0]} samples, but labels has {labels.shape[0]} samples."
-    )
-    return (data, labels, scaler)
-def get_torch_descriptor_dataset(
-    data_path: str,
-    descriptors: list[str],
-    scaler=None,
-    save_scaler_path: str = "data/scaler.pkl",
-    nan_to_num: int = -100,
-    verbose=True,
-    normalize=True,
-) -> torch.utils.data.TensorDataset:
-    data, labels, scaler = get_descriptor_dataset(
-        data_path,
-        descriptors,
-        scaler,
-        save_scaler_path,
-        verbose=verbose,
-        normalize=normalize,
-    )
-    labels = np.nan_to_num(labels, nan=nan_to_num)
-    dataset = torch.utils.data.TensorDataset(
-        torch.FloatTensor(data), torch.LongTensor(labels)
-    )
-    return dataset, scaler

src/model.py DELETED Viewed

@@ -1,79 +0,0 @@
-"""
-This files includes a XGBoost model for Tox21.
-As an input it takes a list of SMILES and it outputs a nested dictionary with
-SMILES and target names as keys.
-"""
-# ---------------------------------------------------------------------------------------
-# Dependencies
-import os
-import joblib
-import numpy as np
-from xgboost import XGBClassifier
-from .utils import TASKS
-# ---------------------------------------------------------------------------------------
-class Tox21XGBClassifier:
-    """A XGBoost classifier that assigns a toxicity score to a given SMILES string."""
-    def __init__(self, seed: int = 42):
-        """Initialize an XGBoost classifier for each of the 12 Tox21 tasks.
-        Args:
-            seed (int, optional): seed for XGBoost to ensure reproducibility. Defaults to 42.
-        """
-        self.tasks = TASKS
-        self.model = {
-            task: XGBClassifier(n_estimators=1000, random_state=seed, n_jobs=8)
-            for task in self.tasks
-        }
-    def load_model(self, path: str) -> None:
-        """Loads the model from a given path
-        Args:
-            path (str): path to model checkpoint
-        """
-        self.model = joblib.load(path)
-    def save_model(self, path: str) -> None:
-        """Saves the model to a given path
-        Args:
-            path (str): path to save model to
-        """
-        if not os.path.exists(os.path.dirname(path)):
-            os.makedirs(os.path.dirname(path))
-        joblib.dump(self.model, path)
-    def fit(self, task: str, input_features: np.ndarray, labels: np.ndarray) -> None:
-        """Train XGBoost for a given task
-        Args:
-            task (str): task to train
-            input_features (np.ndarray): training features
-            labels (np.ndarray): training labels
-        """
-        assert task in self.tasks, f"Unknown task: {task}"
-        self.model[task].fit(input_features, labels)
-    def predict(self, task: str, features: np.ndarray) -> np.ndarray:
-        """Predicts labels for a given Tox21 target using molecule features
-        Args:
-            task (str): the Tox21 target to predict for
-            features (np.ndarray): molecule features used for prediction
-        Returns:
-            np.ndarray: predicted probability for positive class
-        """
-        assert task in self.tasks, f"Unknown task: {task}"
-        assert (
-            len(features.shape) == 2
-        ), f"Function expects 2D np.array. Current shape: {features.shape}"
-        preds = self.model[task].predict_proba(features)
-        return preds[:, 1]

src/preprocess.py DELETED Viewed

@@ -1,405 +0,0 @@
-# pipeline taken from https://huggingface.co/spaces/ml-jku/mhnfs/blob/main/src/data_preprocessing/create_descriptors.py
-"""
-This files includes a the data processing for Tox21.
-As an input it takes a list of SMILES and it outputs a nested dictionary with
-SMILES and target names as keys.
-"""
-import os
-import argparse
-import json
-from typing import Iterable
-import numpy as np
-import pandas as pd
-from sklearn.preprocessing import StandardScaler
-from statsmodels.distributions.empirical_distribution import ECDF
-from datasets import load_dataset
-from rdkit import Chem, DataStructs
-from rdkit.Chem import Descriptors, rdFingerprintGenerator, MACCSkeys
-from rdkit.Chem.rdchem import Mol
-from src.utils import (
-    TASKS,
-    KNOWN_DESCR,
-    HF_TOKEN,
-    USED_200_DESCR,
-    Standardizer,
-    load_pickle,
-    write_pickle,
-)
-parser = argparse.ArgumentParser(
-    description="Data preprocessing script for the Tox21 dataset"
-)
-parser.add_argument(
-    "--save_folder",
-    type=str,
-    default="data/",
-)
-parser.add_argument(
-    "--use_hf",
-    type=int,
-    default=0,
-)
-parser.add_argument(
-    "--path_ecdfs",
-    type=str,
-    default="data/ecdfs.pkl",
-)
-parser.add_argument(
-    "--tox_smarts_filepath",
-    type=str,
-    default="data/tox_smarts.json",
-)
-def create_cleaned_mol_objects(smiles: list[str]) -> tuple[list[Mol], np.ndarray]:
-    """This function creates cleaned RDKit mol objects from a list of SMILES.
-    Args:
-        smiles (list[str]): list of SMILES
-    Returns:
-        list[Mol]: list of cleaned molecules
-        np.ndarray[bool]: mask that contains False at index `i`, if molecule in `smiles` at
-            index `i` could not be cleaned and was removed.
-    """
-    sm = Standardizer(canon_taut=True)
-    clean_mol_mask = list()
-    mols = list()
-    for i, smile in enumerate(smiles):
-        mol = Chem.MolFromSmiles(smile)
-        standardized_mol, _ = sm.standardize_mol(mol)
-        is_cleaned = standardized_mol is not None
-        clean_mol_mask.append(is_cleaned)
-        if not is_cleaned:
-            continue
-        can_mol = Chem.MolFromSmiles(Chem.MolToSmiles(standardized_mol))
-        mols.append(can_mol)
-    return mols, np.array(clean_mol_mask)
-def create_ecfp_fps(mols: list[Mol]) -> np.ndarray:
-    """This function ECFP fingerprints for a list of molecules.
-    Args:
-        mols (list[Mol]): list of molecules
-    Returns:
-        np.ndarray: ECFP fingerprints of molecules
-    """
-    ecfps = list()
-    for mol in mols:
-        fp_sparse_vec = rdFingerprintGenerator.GetCountFPs(
-            [mol], fpType=rdFingerprintGenerator.MorganFP
-        )[0]
-        fp = np.zeros((0,), np.int8)
-        DataStructs.ConvertToNumpyArray(fp_sparse_vec, fp)
-        ecfps.append(fp)
-    return np.array(ecfps)
-def create_maccs_keys(mols: list[Mol]) -> np.ndarray:
-    maccs = [MACCSkeys.GenMACCSKeys(x) for x in mols]
-    return np.array(maccs)
-def get_tox_patterns(filepath: str):
-    """This calculates tox features defined in tox_smarts.json.
-    Args:
-        mols: A list of Mol
-        n_jobs: If >1 multiprocessing is used
-    """
-    # load patterns
-    with open(filepath) as f:
-        smarts_list = [s[1] for s in json.load(f)]
-    # Code does not work for this case
-    assert len([s for s in smarts_list if ("AND" in s) and ("OR" in s)]) == 0
-    # Chem.MolFromSmarts takes a long time so it pays of to parse all the smarts first
-    # and then use them for all molecules. This gives a huge speedup over existing code.
-    # a list of patterns, whether to negate the match result and how to join them to obtain one boolean value
-    all_patterns = []
-    for smarts in smarts_list:
-        patterns = []  # list of smarts-patterns
-        # value for each of the patterns above. Negates the values of the above later.
-        negations = []
-        if " AND " in smarts:
-            smarts = smarts.split(" AND ")
-            merge_any = False  # If an ' AND ' is found all 'subsmarts' have to match
-        else:
-            # If there is an ' OR ' present it's enough is any of the 'subsmarts' match.
-            # This also accumulates smarts where neither ' OR ' nor ' AND ' occur
-            smarts = smarts.split(" OR ")
-            merge_any = True
-        # for all subsmarts check if they are preceded by 'NOT '
-        for s in smarts:
-            neg = s.startswith("NOT ")
-            if neg:
-                s = s[4:]
-            patterns.append(Chem.MolFromSmarts(s))
-            negations.append(neg)
-        all_patterns.append((patterns, negations, merge_any))
-    return all_patterns
-def create_tox_features(mols: list[Mol], patterns: list) -> np.ndarray:
-    """Matches the tox patterns against a molecule. Returns a boolean array"""
-    tox_data = []
-    for mol in mols:
-        mol_features = []
-        for patts, negations, merge_any in patterns:
-            matches = [mol.HasSubstructMatch(p) for p in patts]
-            matches = [m != n for m, n in zip(matches, negations)]
-            if merge_any:
-                pres = any(matches)
-            else:
-                pres = all(matches)
-            mol_features.append(pres)
-        tox_data.append(np.array(mol_features))
-    return np.array(tox_data)
-def create_rdkit_descriptors(mols: list[Mol]) -> np.ndarray:
-    """This function creates RDKit descriptors for a list of molecules.
-    Args:
-        mols (list[Mol]): list of molecules
-    Returns:
-        np.ndarray: RDKit descriptors of molecules
-    """
-    rdkit_descriptors = list()
-    for mol in mols:
-        descrs = []
-        for _, descr_calc_fn in Descriptors._descList:
-            descrs.append(descr_calc_fn(mol))
-        descrs = np.array(descrs)
-        descrs = descrs[USED_200_DESCR]
-        rdkit_descriptors.append(descrs)
-    return np.array(rdkit_descriptors)
-def create_quantiles(raw_features: np.ndarray, ecdfs: list) -> np.ndarray:
-    """Create quantile values for given features using the columns
-    Args:
-        raw_features (np.ndarray): values to put into quantiles
-        ecdfs (list): ECDFs to use
-    Returns:
-        np.ndarray: computed quantiles
-    """
-    quantiles = np.zeros_like(raw_features)
-    for column in range(raw_features.shape[1]):
-        raw_values = raw_features[:, column].reshape(-1)
-        ecdf = ecdfs[column]
-        q = ecdf(raw_values)
-        quantiles[:, column] = q
-    return quantiles
-def fill(features, mask, value=np.nan):
-    n_mols = len(mask)
-    n_features = features.shape[1]
-    data = np.zeros(shape=(n_mols, n_features))
-    data.fill(value)
-    data[~mask] = features
-    return data
-def normalize_features(
-    raw_features,
-    scaler=None,
-    save_scaler_path: str = "",
-    verbose=True,
-):
-    if scaler is None:
-        scaler = StandardScaler()
-        scaler.fit(raw_features)
-        if verbose:
-            print("Fitted the StandardScaler")
-        if save_scaler_path:
-            write_pickle(save_scaler_path, scaler)
-            if verbose:
-                print(f"Saved the StandardScaler under {save_scaler_path}")
-    # Normalize feature vectors
-    normalized_features = scaler.transform(raw_features)
-    if verbose:
-        print("Normalized molecule features")
-    return normalized_features, scaler
-def create_descriptors(
-    smiles,
-    ecdfs=None,
-    scaler=None,
-    descriptors: Iterable = KNOWN_DESCR,
-):
-    # Create cleanded rdkit mol objects
-    mols, clean_mol_mask = create_cleaned_mol_objects(smiles)
-    print("Cleaned molecules")
-    features = []
-    if "ecfps" in descriptors:
-        # Create fingerprints and descriptors
-        ecfps = create_ecfp_fps(mols)
-        # expand using mol_mask
-        ecfps = fill(ecfps, ~clean_mol_mask)
-        features.append(ecfps)
-        print("Created ECFP fingerprints")
-    if "rdkit_descr_quantiles" in descriptors:
-        rdkit_descrs = create_rdkit_descriptors(mols)
-        print("Created RDKit descriptors")
-        # Create and save ecdfs
-        if ecdfs is None:
-            print("Create ECDFs")
-            ecdfs = []
-            for column in range(rdkit_descrs.shape[1]):
-                raw_values = rdkit_descrs[:, column].reshape(-1)
-                ecdfs.append(ECDF(raw_values))
-        # Create quantiles
-        rdkit_descr_quantiles = create_quantiles(rdkit_descrs, ecdfs)
-        # expand using mol_mask
-        rdkit_descr_quantiles = fill(rdkit_descr_quantiles, ~clean_mol_mask)
-        features.append(rdkit_descr_quantiles)
-        print("Created quantiles of RDKit descriptors")
-    if "maccs" in descriptors:
-        maccs = create_maccs_keys(mols)
-        maccs = fill(maccs, ~clean_mol_mask)
-        features.append(maccs)
-        print("Created MACCS keys")
-    if "tox" in descriptors:
-        tox_patterns = get_tox_patterns("assets/tox_smarts.json")
-        tox = create_tox_features(mols, tox_patterns)
-        tox = fill(tox, ~clean_mol_mask)
-        features.append(tox)
-        print("Created Tox features")
-    # concatenate features
-    raw_features = np.concatenate(features, axis=1)
-    # normalize with scaler if scaler is passed, else create scaler
-    features, _ = normalize_features(
-        raw_features,
-        scaler=scaler,
-        verbose=True,
-    )
-    return features, clean_mol_mask
-def main(args):
-    splits = ["train", "validation"]
-    ds = load_dataset("tschouis/tox21", token=HF_TOKEN)
-    for split in splits:
-        print(f"Preprocess {split} molecules")
-        smiles = list(ds[split]["smiles"])
-        # Create cleanded rdkit mol objects
-        mols, clean_mol_mask = create_cleaned_mol_objects(smiles)
-        print("Cleaned molecules")
-        tox_patterns = get_tox_patterns(args.tox_smarts_filepath)
-        # Create fingerprints and descriptors
-        ecfps = create_ecfp_fps(mols)
-        # expand using mol_mask
-        ecfps = fill(ecfps, ~clean_mol_mask)
-        print("Created ECFP fingerprints")
-        rdkit_descrs = create_rdkit_descriptors(mols)
-        print("Created RDKit descriptors")
-        # Create and save ecdfs
-        if split == "train":
-            print("Create ECDFs")
-            ecdfs = []
-            for column in range(rdkit_descrs.shape[1]):
-                raw_values = rdkit_descrs[:, column].reshape(-1)
-                ecdfs.append(ECDF(raw_values))
-            write_pickle(args.path_ecdfs, ecdfs)
-            print(f"Saved ECDFs under {args.path_ecdfs}")
-        else:
-            print(f"Load ECDFs from {args.path_ecdfs}")
-            ecdfs = load_pickle(args.path_ecdfs)
-        # Create quantiles
-        rdkit_descr_quantiles = create_quantiles(rdkit_descrs, ecdfs)
-        # expand using mol_mask
-        rdkit_descr_quantiles = fill(rdkit_descr_quantiles, ~clean_mol_mask)
-        print("Created quantiles of RDKit descriptors")
-        maccs = create_maccs_keys(mols)
-        maccs = fill(maccs, ~clean_mol_mask)
-        print("Created MACCS keys")
-        tox = create_tox_features(mols, tox_patterns)
-        tox = fill(tox, ~clean_mol_mask)
-        print("Created Tox features")
-        labels = []
-        for task in TASKS:
-            datasplit = ds[split].to_pandas() if args.use_hf else ds[split]
-            labels.append(datasplit[task].to_numpy())
-        labels = np.stack(labels, axis=1)
-        save_path = os.path.join(args.save_folder, f"tox21_{split}.npz")
-        with open(save_path, "wb") as f:
-            np.savez(
-                f,
-                labels=labels,
-                ecfps=ecfps,
-                rdkit_descr_quantiles=rdkit_descr_quantiles,
-                maccs=maccs,
-                tox=tox,
-            )
-            print(f"Saved preprocessed {split} split under {save_path}")
-    print("Preprocessing finished successfully")
-if __name__ == "__main__":
-    args = parser.parse_args()
-    if not os.path.exists(args.save_folder):
-        os.makedirs(args.save_folder)
-    if not os.path.exists(os.path.dirname(args.path_ecdfs)):
-        os.makedirs(os.path.dirname(args.path_ecdfs))
-    main(args)

src/push_assets.py DELETED Viewed

@@ -1,12 +0,0 @@
-from huggingface_hub import HfApi
-from .utils import HF_TOKEN
-api = HfApi()
-api.upload_folder(
-    folder_path="assets/",
-    path_in_repo="assets",
-    repo_id="tschouis/tox21_xgboost_classifier",
-    repo_type="space",
-    token=HF_TOKEN,
-)

src/train.py DELETED Viewed

@@ -1,199 +0,0 @@
-"""
-Script for fitting and saving any preprocessing assets, as well as the fitted XGBoost model
-"""
-import os
-import argparse
-import numpy as np
-from tabulate import tabulate
-from sklearn.metrics import roc_auc_score
-from .data import get_descriptor_dataset
-from .model import Tox21XGBClassifier
-SEED = 42
-DATA_FOLDER = "data/"
-parser = argparse.ArgumentParser(description="XGBoost Trainig script for Tox21 dataset")
-parser.add_argument(
-    "--save_path_model",
-    type=str,
-    default="assets/xgb_alltasks.joblib",
-)
-parser.add_argument(
-    "--path_ecdfs",
-    type=str,
-    default="assets/ecdfs.pkl",
-)
-parser.add_argument(
-    "--path_scaler",
-    type=str,
-    default="assets/scaler.pkl",
-)
-def main(args):
-    print("Preprocess train molecules")
-    # load datasets
-    train_X, train_y, scaler = get_descriptor_dataset(
-        os.path.join(DATA_FOLDER, "tox21_train.npz"),
-        descriptors="all",
-        save_scaler_path="data/scaler.pkl",
-    )
-    val_X, val_y, _ = get_descriptor_dataset(
-        os.path.join(DATA_FOLDER, "tox21_validation.npz"),
-        descriptors="all",
-        scaler=scaler,
-    )
-    task_config = {
-        "NR-AR": {
-            "colsample_bytree": 0.5,
-            "learning_rate": 0.05,
-            "max_depth": 12,
-            "min_child_weight": 2,
-            "n_estimators": 1000,
-            "scale_pos_weight": 80,
-            "subsample": 0.4,
-        },
-        "NR-AR-LBD": {
-            "colsample_bytree": 0.8,
-            "learning_rate": 0.04,
-            "max_depth": 10,
-            "min_child_weight": 8,
-            "n_estimators": 1000,
-            "scale_pos_weight": 10,
-            "subsample": 0.4,
-        },
-        "NR-AhR": {
-            "colsample_bytree": 0.8,
-            "learning_rate": 0.05,
-            "max_depth": 16,
-            "min_child_weight": 2,
-            "n_estimators": 1000,
-            "scale_pos_weight": 80,
-            "subsample": 1,
-        },
-        "NR-Aromatase": {
-            "colsample_bytree": 0.7,
-            "learning_rate": 0.05,
-            "max_depth": 16,
-            "min_child_weight": 1,
-            "n_estimators": 1000,
-            "scale_pos_weight": 50,
-            "subsample": 0.7,
-        },
-        "NR-ER": {
-            "colsample_bytree": 0.7,
-            "learning_rate": 0.05,
-            "max_depth": 10,
-            "min_child_weight": 4,
-            "n_estimators": 1000,
-            "scale_pos_weight": 25,
-            "subsample": 0.4,
-        },
-        "NR-ER-LBD": {
-            "colsample_bytree": 0.7,
-            "learning_rate": 0.05,
-            "max_depth": 16,
-            "min_child_weight": 4,
-            "n_estimators": 1000,
-            "scale_pos_weight": 10,
-            "subsample": 0.4,
-        },
-        "NR-PPAR-gamma": {
-            "colsample_bytree": 0.8,
-            "learning_rate": 0.01,
-            "max_depth": 12,
-            "min_child_weight": 2,
-            "n_estimators": 1000,
-            "scale_pos_weight": 80,
-            "subsample": 0.4,
-        },
-        "SR-ARE": {
-            "colsample_bytree": 0.7,
-            "learning_rate": 0.05,
-            "max_depth": 16,
-            "min_child_weight": 8,
-            "n_estimators": 1000,
-            "scale_pos_weight": 10,
-            "subsample": 0.7,
-        },
-        "SR-ATAD5": {
-            "colsample_bytree": 0.5,
-            "learning_rate": 0.02,
-            "max_depth": 12,
-            "min_child_weight": 8,
-            "n_estimators": 1000,
-            "scale_pos_weight": 10,
-            "subsample": 0.4,
-        },
-        "SR-HSE": {
-            "colsample_bytree": 0.8,
-            "learning_rate": 0.02,
-            "max_depth": 6,
-            "min_child_weight": 1,
-            "n_estimators": 1000,
-            "scale_pos_weight": 25,
-            "subsample": 1,
-        },
-        "SR-MMP": {
-            "colsample_bytree": 0.5,
-            "learning_rate": 0.02,
-            "max_depth": 16,
-            "min_child_weight": 2,
-            "n_estimators": 1000,
-            "scale_pos_weight": 10,
-            "subsample": 0.7,
-        },
-        "SR-p53": {
-            "colsample_bytree": 0.5,
-            "learning_rate": 0.02,
-            "max_depth": 12,
-            "min_child_weight": 8,
-            "n_estimators": 1000,
-            "scale_pos_weight": 10,
-            "subsample": 0.4,
-        },
-    }
-    model = Tox21XGBClassifier(seed=42, task_config=task_config)
-    print("Start training.")
-    for i, task in enumerate(model.tasks):
-        task_labels = train_y[:, i]
-        label_mask = ~np.isnan(task_labels)
-        task_data = train_X[label_mask]
-        task_labels = task_labels[label_mask].astype(int)
-        print(f"Fit task {task} using {sum(label_mask)} samples")
-        model.fit(task, task_data, task_labels)
-    print(f"Save model under {args.save_path_model}")
-    model.save_model(args.save_path_model)
-    print("Evaluate model")
-    results = {}
-    for i, task in enumerate(model.tasks):
-        task_labels = val_y[:, i]
-        label_mask = ~np.isnan(task_labels)
-        task_data = val_X[label_mask]
-        task_labels = task_labels[label_mask].astype(int)
-        pred = model.predict(task, task_data)
-        results[task] = [roc_auc_score(y_true=task_labels, y_score=pred)]
-    print("Results:")
-    print(tabulate(results, headers="keys"))
-    print("Average: ", sum([val[0] for val in results.values()]) / len(results))
-if __name__ == "__main__":
-    args = parser.parse_args()
-    main(args)

src/utils.py DELETED Viewed

@@ -1,443 +0,0 @@
-## These MolStandardizer classes are due to Paolo Tosco
-## It was taken from the FS-Mol github
-## (https://github.com/microsoft/FS-Mol/blob/main/fs_mol/preprocessing/utils/
-##  standardizer.py)
-## They ensure that a sequence of standardization operations are applied
-## https://gist.github.com/ptosco/7e6b9ab9cc3e44ba0919060beaed198e
-import os
-import pickle
-from rdkit import Chem
-from rdkit.Chem.MolStandardize import rdMolStandardize
-HF_TOKEN = os.environ.get("HF_TOKEN")
-TASKS = [
-    "NR-AR",
-    "NR-AR-LBD",
-    "NR-AhR",
-    "NR-Aromatase",
-    "NR-ER",
-    "NR-ER-LBD",
-    "NR-PPAR-gamma",
-    "SR-ARE",
-    "SR-ATAD5",
-    "SR-HSE",
-    "SR-MMP",
-    "SR-p53",
-]
-KNOWN_DESCR = ["ecfps", "rdkit_descr_quantiles", "maccs", "tox"]
-USED_200_DESCR = [
-    0,
-    1,
-    2,
-    3,
-    4,
-    5,
-    6,
-    7,
-    8,
-    9,
-    10,
-    11,
-    12,
-    13,
-    14,
-    15,
-    16,
-    25,
-    26,
-    27,
-    28,
-    29,
-    30,
-    31,
-    32,
-    33,
-    34,
-    35,
-    36,
-    37,
-    38,
-    39,
-    40,
-    41,
-    42,
-    43,
-    44,
-    45,
-    46,
-    47,
-    48,
-    49,
-    50,
-    51,
-    52,
-    53,
-    54,
-    55,
-    56,
-    57,
-    58,
-    59,
-    60,
-    61,
-    62,
-    63,
-    64,
-    65,
-    66,
-    67,
-    68,
-    69,
-    70,
-    71,
-    72,
-    73,
-    74,
-    75,
-    76,
-    77,
-    78,
-    79,
-    80,
-    81,
-    82,
-    83,
-    84,
-    85,
-    86,
-    87,
-    88,
-    89,
-    90,
-    91,
-    92,
-    93,
-    94,
-    95,
-    96,
-    97,
-    98,
-    99,
-    100,
-    101,
-    102,
-    103,
-    104,
-    105,
-    106,
-    107,
-    108,
-    109,
-    110,
-    111,
-    112,
-    113,
-    114,
-    115,
-    116,
-    117,
-    118,
-    119,
-    120,
-    121,
-    122,
-    123,
-    124,
-    125,
-    126,
-    127,
-    128,
-    129,
-    130,
-    131,
-    132,
-    133,
-    134,
-    135,
-    136,
-    137,
-    138,
-    139,
-    140,
-    141,
-    142,
-    143,
-    144,
-    145,
-    146,
-    147,
-    148,
-    149,
-    150,
-    151,
-    152,
-    153,
-    154,
-    155,
-    156,
-    157,
-    158,
-    159,
-    160,
-    161,
-    162,
-    163,
-    164,
-    165,
-    166,
-    167,
-    168,
-    169,
-    170,
-    171,
-    172,
-    173,
-    174,
-    175,
-    176,
-    177,
-    178,
-    179,
-    180,
-    181,
-    182,
-    183,
-    184,
-    185,
-    186,
-    187,
-    188,
-    189,
-    190,
-    191,
-    192,
-    193,
-    194,
-    195,
-    196,
-    197,
-    198,
-    199,
-    200,
-    201,
-    202,
-    203,
-    204,
-    205,
-    206,
-    207,
-]
-class Standardizer:
-    """
-    Simple wrapper class around rdkit Standardizer.
-    """
-    DEFAULT_CANON_TAUT = False
-    DEFAULT_METAL_DISCONNECT = False
-    MAX_TAUTOMERS = 100
-    MAX_TRANSFORMS = 100
-    MAX_RESTARTS = 200
-    PREFER_ORGANIC = True
-    def __init__(
-        self,
-        metal_disconnect=None,
-        canon_taut=None,
-    ):
-        """
-        Constructor.
-        All parameters are optional.
-        :param metal_disconnect:    if True, metallorganic complexes are
-                                    disconnected
-        :param canon_taut:          if True, molecules are converted to their
-                                    canonical tautomer
-        """
-        super().__init__()
-        if metal_disconnect is None:
-            metal_disconnect = self.DEFAULT_METAL_DISCONNECT
-        if canon_taut is None:
-            canon_taut = self.DEFAULT_CANON_TAUT
-        self._canon_taut = canon_taut
-        self._metal_disconnect = metal_disconnect
-        self._taut_enumerator = None
-        self._uncharger = None
-        self._lfrag_chooser = None
-        self._metal_disconnector = None
-        self._normalizer = None
-        self._reionizer = None
-        self._params = None
-    @property
-    def params(self):
-        """Return the MolStandardize CleanupParameters."""
-        if self._params is None:
-            self._params = rdMolStandardize.CleanupParameters()
-            self._params.maxTautomers = self.MAX_TAUTOMERS
-            self._params.maxTransforms = self.MAX_TRANSFORMS
-            self._params.maxRestarts = self.MAX_RESTARTS
-            self._params.preferOrganic = self.PREFER_ORGANIC
-            self._params.tautomerRemoveSp3Stereo = False
-        return self._params
-    @property
-    def canon_taut(self):
-        """Return whether tautomer canonicalization will be done."""
-        return self._canon_taut
-    @property
-    def metal_disconnect(self):
-        """Return whether metallorganic complexes will be disconnected."""
-        return self._metal_disconnect
-    @property
-    def taut_enumerator(self):
-        """Return the TautomerEnumerator object."""
-        if self._taut_enumerator is None:
-            self._taut_enumerator = rdMolStandardize.TautomerEnumerator(self.params)
-        return self._taut_enumerator
-    @property
-    def uncharger(self):
-        """Return the Uncharger object."""
-        if self._uncharger is None:
-            self._uncharger = rdMolStandardize.Uncharger()
-        return self._uncharger
-    @property
-    def lfrag_chooser(self):
-        """Return the LargestFragmentChooser object."""
-        if self._lfrag_chooser is None:
-            self._lfrag_chooser = rdMolStandardize.LargestFragmentChooser(
-                self.params.preferOrganic
-            )
-        return self._lfrag_chooser
-    @property
-    def metal_disconnector(self):
-        """Return the MetalDisconnector object."""
-        if self._metal_disconnector is None:
-            self._metal_disconnector = rdMolStandardize.MetalDisconnector()
-        return self._metal_disconnector
-    @property
-    def normalizer(self):
-        """Return the Normalizer object."""
-        if self._normalizer is None:
-            self._normalizer = rdMolStandardize.Normalizer(
-                self.params.normalizationsFile, self.params.maxRestarts
-            )
-        return self._normalizer
-    @property
-    def reionizer(self):
-        """Return the Reionizer object."""
-        if self._reionizer is None:
-            self._reionizer = rdMolStandardize.Reionizer(self.params.acidbaseFile)
-        return self._reionizer
-    def charge_parent(self, mol_in):
-        """Sequentially apply a series of MolStandardize operations:
-        * MetalDisconnector
-        * Normalizer
-        * Reionizer
-        * LargestFragmentChooser
-        * Uncharger
-        The net result is that a desalted, normalized, neutral
-        molecule with implicit Hs is returned.
-        """
-        params = Chem.RemoveHsParameters()
-        params.removeAndTrackIsotopes = True
-        mol_in = Chem.RemoveHs(mol_in, params, sanitize=False)
-        if self._metal_disconnect:
-            mol_in = self.metal_disconnector.Disconnect(mol_in)
-        normalized = self.normalizer.normalize(mol_in)
-        Chem.SanitizeMol(normalized)
-        normalized = self.reionizer.reionize(normalized)
-        Chem.AssignStereochemistry(normalized)
-        normalized = self.lfrag_chooser.choose(normalized)
-        normalized = self.uncharger.uncharge(normalized)
-        # need this to reassess aromaticity on things like
-        # cyclopentadienyl, tropylium, azolium, etc.
-        Chem.SanitizeMol(normalized)
-        return Chem.RemoveHs(Chem.AddHs(normalized))
-    def standardize_mol(self, mol_in):
-        """
-        Standardize a single molecule.
-        :param mol_in:  a Chem.Mol
-        :return:        * (standardized Chem.Mol, n_taut) tuple
-                          if success. n_taut will be negative if
-                          tautomer enumeration was aborted due
-                          to reaching a limit
-                        * (None, error_msg) if failure
-        This calls self.charge_parent() and, if self._canon_taut
-        is True, runs tautomer canonicalization.
-        """
-        n_tautomers = 0
-        if isinstance(mol_in, Chem.Mol):
-            name = None
-            try:
-                name = mol_in.GetProp("_Name")
-            except KeyError:
-                pass
-            if not name:
-                name = "NONAME"
-        else:
-            error = f"Expected SMILES or Chem.Mol as input, got {str(type(mol_in))}"
-            return None, error
-        try:
-            mol_out = self.charge_parent(mol_in)
-        except Exception as e:
-            error = f"charge_parent FAILED: {str(e).strip()}"
-            return None, error
-        if self._canon_taut:
-            try:
-                res = self.taut_enumerator.Enumerate(mol_out, False)
-            except TypeError:
-                # we are still on the pre-2021 RDKit API
-                res = self.taut_enumerator.Enumerate(mol_out)
-            except Exception as e:
-                # something else went wrong
-                error = f"canon_taut FAILED: {str(e).strip()}"
-                return None, error
-            n_tautomers = len(res)
-            if hasattr(res, "status"):
-                completed = (
-                    res.status == rdMolStandardize.TautomerEnumeratorStatus.Completed
-                )
-            else:
-                # we are still on the pre-2021 RDKit API
-                completed = len(res) < 1000
-            if not completed:
-                n_tautomers = -n_tautomers
-            try:
-                mol_out = self.taut_enumerator.PickCanonical(res)
-            except AttributeError:
-                # we are still on the pre-2021 RDKit API
-                mol_out = max(
-                    [(self.taut_enumerator.ScoreTautomer(m), m) for m in res]
-                )[1]
-            except Exception as e:
-                # something else went wrong
-                error = f"canon_taut FAILED: {str(e).strip()}"
-                return None, error
-        mol_out.SetProp("_Name", name)
-        return mol_out, n_tautomers
-def load_pickle(path: str):
-    with open(path, "rb") as file:
-        content = pickle.load(file)
-    return content
-def write_pickle(path: str, obj: object):
-    with open(path, "wb") as file:
-        pickle.dump(obj, file)