# README from review project folder

**Project Title**: Exo-Tox predictor review items
**Creation Date**: June 30th, 2025
**Authors**: Tanja Krueger, Luisa F. Jimenez-Soto


# General information for the repository : Exo-Tox predictor - Review process data

## Project

This project contains all files related to the creation of the Exo-Tox predictor, including a guide on how to use it and code used for its creation and analysis of data

### Core Team

Tanja Krüger, ORCID 0009-0007-4575-4534
Damla Ayse Durmaz, ORCID 0009-0006-8693-1616
Luisa Fernanda Jimenez Soto, ORCID 0000-0001-8551-5019

**Institution**: Walther-Straub Institute of Pharmacology and Toxicology
**University**: Ludwig-Maximillians University (LMU) - Munich, Germany

#### Responsibilities

- The creation of the predictor, curation of bacterial control data, and testing on all sets were performed by Tanja Krüger.
- The bacteriophage dataset was curated b Damla Durmaz.
- The data analysis for the hits was done by Luisa F. Jimenez-Soto. 
- The "testing_the_first_predictor_on_data" demo was created by Tanja Krüger and Luisa F. Jimenez-Soto.


### Contact

Tanja Krüger: tanja.krueger@lrz.uni-muenchen.de
Luisa F. Jimenez-Soto: l.jimenez@lmu.de


# Overview of the full project folder.
This repository contains four main folders: `Data`, `Code`, `Predictor` and `Figures`; and one file called overview.md which contains the description of the pipelines used for the analysis included in the manuscript as part of the review process.

## Folder Structure

### 1. Data
Contains all data used in the project during the revision process. For the original data, see [Exo-Tox repository](https://doi.org/10.5282/ubm/data.576).

#### 1.1 `Data/raw`
**Description**: Contains the raw data used for the revision process.

##### 1.2.1 `all_folds`
**Description**: Folds of the exotoxins and control proteins, used for the foldseek similarity baseline.

##### 1.2.2 `raw_from_multitox`
**Description**: Return from MultiToxPred 1 from our Test set, used as second benchmark. Part of the review process.

##### 1.2.3 `exotoxins`
**Description**: Folds of the exotoxins proteins under 'rank_1' folder and metadata (see Data/raw/README.md for more details).

##### 1.2.4 `secreted`
**Description**: Folds of the control proteins under 'rank_1' folder and metadata (see Data/raw/README.md for more details).

#### 1.2 `Data/derived`
**Description**: Contains data used for the revision process.

##### 1.2.1 `foldseek_baseline`
**Description**: Data involved in generating the Foldseek results.

- `foldseek_results_reasonable.tsv` : Contains all hits found by Foldseek between query (test data) and target (training data)
- `foldseek_top1_per_query_foldseek_results_reasonable.tsv`: Contains best hits for each query, using prob as evaluation metric. 
- `all files beginning with train_fodls_BB...`: Created after running the database creating command of Foldseek using the training folds.

##### 1.2.2 `test_dir_folds`
**Description**: Folds of the test data.

##### 1.2.3 `train_dir_folds`
**Description**: Folds of the training data.

##### 1.2.4 train_val_test 
**Description**: Training and test datasets for the predictors- taken form the main project prior to revision. 
 All sets are redundancy reduced to 30% (SST30). 

##### 1.2.5 multi_tox_baseline
**Description**: Data involved in running the data though the prediction tool MulitToxPred 1

###### 1.2.5.1  derived_multitox
Results received from MuliToxPred 1

- `classification_log.txt` : Metrics calculated from the MultiToxPred results
- `classified_hits.tsv` : MultiTox hits separated into True Positives, True Negatives, False Positives, False Negatives

###### 1.2.5.2 passed_to_multitox
Test data was split into four parts to allow for batch processing by MultToxPred 1 which allows up to 100 sequencers at once. 

- `X_test_SST30_part1.fasta` : part one of the test sequences
- `X_test_SST30_part2.fasta` : part two of the test sequences
- `X_test_SST30_part3.fasta` : part three of the test sequences
- `X_test_SST30_part4.fasta` : part four of the test sequences

###### 1.2.5.3 raw_from_multitox
Results received from MuliToxPred 1


##### 1.2.5 pseudoAAC
**Description**: Data involved in training and testing a trained predictor based on the APAAC protein representation

- `new_parameters` : Hyperparameter needed by to generate the APAAC representation.
- `sklearn_30_balanced_pred_log.log` : Results of the predictor logging file.
- `X_test_SST30_APAAC.csv` : APAAC protein representation for the test set generated by https://github.com/Superzchen/iFeatureOmega-CLI/.
- `X_train_SST30_APAAC.csv` :APAAC protein representation for the training set by https://github.com/Superzchen/iFeatureOmega-CLI/.

### 2. Code

**Description**: Contains all code that was involved in the paper revision. For the original code for Exo-Tox, see [Exo-Tox repository](https://doi.org/10.5282/ubm/data.576).

#### 2.1 `Code/Python/`

##### Foldseek pipeline:

- `check_pdb_consistency.py`: Evaluates if splitting of folds was done correctly. 
- `evaluate_foldseek_hits.py`:  Finds Foldseek hits between each query and target.
- `extract_foldseek_top_hits.py`: Extracts the best hit for each query using the Foldseek parameter prop.
- `folds_into_test_and_train.py` : Splits the folds into training and testing folds

##### MulitToxPred1 pipeline:

- `mulittoxpred1.ipynb`: Evaluates the results of the MultiToxPred predictor

##### APAAC pipeline :

- `APACC_predictor_first_part.py`: to train a model using APAAC protein representation as input
- `support_functions_splitting_predictor.py`: supporting functions

### 3. Predictor
**Description**: Contains all predictors trained on the APAAC protein representation. For the full predictor Exo-Tox and guide how to use it, please go to the [Exo-Tox repository](https://doi.org/10.5282/ubm/data.576)

- `SST`: Sequence Similarity Threshold.
- `Model Architecture`: Random Forest, XGBoost, SVC, Logistic Regression, k-Nearest Neighbors.
- `Performance Metrics`: MCC, Precision-Recall, ROC-AUC, etc. 
- `CV`: Cross Validation split parameter
- `APAAC`: Chosen representation of proteins chosen as input (Amphiphilic Pseudo-Amino Acid Composition)

### 4. Figures
**Description**: Contains all figures of testing the APAAC trained predictor on the test set