# Exo-Tox predictor

## Project

This project contains all files related to the creation of the Exo-Tox predictor, including a guide on how to use it and code used for its creation and analysis of data

### Core Team

Tanja Krüger, ORCID 0009-0007-4575-4534
Damla Ayse Durmaz, ORCID 0009-0006-8693-1616
Luisa Fernanda Jimenez Soto, ORCID 0000-0001-8551-5019

**Institution**: Walther-Straub Institute of Pharmacology and Toxicology
**University**: Ludwig-Maximillians University (LMU) - Munich, Germany

#### Responsibilities

- The creation of the predictor, curation of bacterial control data, and testing on all sets were performed by Tanja Krüger.
- The bacteriophage dataset was curated b Damla Durmaz.
- The data analysis for the hits was done by Luisa F. Jimenez-Soto. 
- The "testing_the_first_predictor_on_data" demo was created by Tanja Krüger and Luisa F. Jimenez-Soto.


### Contact

Tanja Krüger: tanja.krueger@lrz.uni-muenchen.de
Luisa F. Jimenez-Soto: l.jimenez@lmu.de

## Data

## Overview of the full project folder.
This repository contains four main folders: `Data`, `Code`, `Predictor`, `Figures` and `Scope_Exploration`. Additionally, it includes the following files:

- **Makefile**: Describes the complete pipeline for training the predictors.
- **Environment file**: For reproducing the training and testing of the predictors.
- **License file**: Specifies the project's license.
All published data and code is accessible under Creative Commons Licensing - Share Alike (CC-BY-SA 4.0), with the exclusion of the Toxin Predictor and
the Code for the creation of the predictor. They are available upon request from the corresponding author. This restricted access is necessary due to
ethical risk of misuse associated with the discovery of novel toxins. 
- **Testing_first_predictor_on_data**: Folder containing all information and demos on how to use the predictor and describing the analysis of the proteins classified as toxins by Exo-Tox.

---

## Folder Structure

### 1. Data
Contains all data used in the project.
#### 1.1 `Data/raw`
**Description**: Contains raw data files, including embeddings and FASTA sequences. None of the datasets in this folder are redundancy reduced.

- `allscrProtT5_MergedExotoxinsDB_Jun2023.h5`: Embeddings for scrambled sequences generated by ProtT5.
- `noSignalP_ProtT5_SST30_allProteins_Aug2023.h5`: Embeddings for sequences with signal peptides removed, generated by ProtT5.
- `ProtT5_MergedExotoxinsDB_Jun2023.h5`: Raw embeddings generated by ProtT5.
- `Benchmarking_CSM`: Data for the CSM-toxin benchmark predictor.
- `fasta_clean_controlProteins`: FASTA sequences of control proteins.
- `fasta_clean_mergedExotoxins_updated`: FASTA sequences of toxins.
- `MergedExotoxinsDB_Jun2023`: Combined FASTA sequences of toxins and controls.

#### 1.2 `Data/derived`
**Description**: Contains data derived from the raw data.

##### 1.2.1 `benchmark_predictions`
**Description**: Results from benchmark predictors (e.g., CSM-toxin).

- `non_toxins_results.csv`: Results of only the non-toxins from the test set.
- `toxins_results.csv`: Results of only the toxins from the test set.
- `results_CSM_overlapped_test_set.csv`: Combined results for the full test set.

##### 1.2.2 `blastp_results`
**Description**: Contains BLASTP results.

##### 1.2.3 `clean_fasta_files`
**Description**: Contains FASTA files reduced to 30% sequence similarity.

- `mergedExotoxins_SST30.fasta`: Reduced exotoxin sequences.
- `controlProteins_SST30.fasta`: Reduced non-toxin sequences.
- `allBacterialProteins_SST30_concatenated.fasta`: Combined reduced sequences.

##### 1.2.4 `mmseqs_results`
**Description**: Contains the output from MMseqs2.

- `all_seqs.fasta`: Input sequences in FASTA format.
- `cluster.tsv`: Cluster representatives and members.
- `rep_seq.fasta`: Cluster representative sequences in FASTA format.

##### 1.2.5 `log`
**Description**: Logs from various predictors.

- `baseline_30_no_weights_pred_log`: Log for the amino acid composition-based predictor.
- `baseline_blastp_pred_log`: Performance log for the BLAST baseline model.
- `baseline_length_only_30_no_weights_pred_log`: Performance log for the baseline predictor using only sequence length.
- `CSM_other_predictor_logfile`: Performance log for the CSM-toxin predictor.
- `sklearn_30_balanced_pred_log`: Performance log for the embeddings-based predictor.

##### 1.2.6 `signalP`
**Description**: Results from the SignalP website, which predicts the location of secretion signals.

##### 1.2.7 `test_set_for_other_predictors`
**Description**: Intermediate datasets created during redundancy reduction of the test set against the training set of the CSM-toxin predictor.

##### 1.2.8 `train_val_test`
**Description**: Training and test datasets for the predictors. All sets are redundancy reduced to 30% (SST30).

#### 1.2.9 control_proteins_putative_toxins_.fasta
Putative toxins found by regex search within bacterial control of both secreted and non-secreted proteins. 

###### Labels:
- `Y_train_SST30.csv`: Labels for the training set.
- `y_test_SST30.csv`: Labels for the test set.

###### Sequences:
- `X_train_SST30.fasta`: FASTA sequences for the training set.
- `X_test_SST30.fasta`: FASTA sequences for the test set.
- `mergedExotoxins_X_test_SST30.fasta`: FASTA sequences of toxins only.
- `controlProteins_X_test_SST30.fasta`: FASTA sequences of non-toxins only.

###### Amino Acid Composition:
- `aac_X_train_SST30.tsv`: Amino acid composition of the training set.
- `aac_X_test_SST30.tsv`: Amino acid composition of the test set.

###### Embeddings:
- `embeddings_ProtT5_X_train_SST30.tsv`: Embeddings from the unaltered training sequences.
- `embeddings_ProtT5_X_test_SST30.tsv`: Embeddings from the unaltered test sequences.
- `embeddings_noSignalP_ProtT5_SST30_X_train_SST30.tsv`: Embeddings from training sequences with signal peptides removed.
- `embeddings_noSignalP_ProtT5_X_test_SST30.tsv`: Embeddings from test sequences with signal peptides removed.
- `embeddings_allscrProtT5_X_train_SST30.tsv`: Embeddings from scrambled training sequences.
- `embeddings_allscrProtT5_X_test_SST30.tsv`: Embeddings from scrambled test sequences.

---

### 2. Code
**Description**: Contains all code that can be run from the command line for the analysis. 
#### 2.1 `Code/Python/`

##### Predictor Training:
- `Predictor_w_embeddings_noweights.py`: Trains a model with embeddings as input.
- `baseline_predictor_random_split_no_weights.py`: Similar to the above, using amino acid composition as input.

##### Predictor Testing:
- `reusing_predictor.py`: Reuses the trained predictor on the test set to generate performance metrics.

##### Predictor Reuse for End-Users:
- `classifying_unknown_proteins.py`: Predicts the toxicity of unknown proteins.

##### Baseline Performance of BLAST and CSM-toxin:
- `baseline_predictor_length_only.py`: Builds a baseline model using sequence length as the feature.
- `BLAST_part2.py`: Uses BLASTP to estimate toxicity of the test set.
- `investigate_CSM_training_set.py`: Explores the origins of proteins in the CSM-toxin predictor training set.
- `parsing_CSM_results.py`: Parses results from the CSM-toxin predictor on the test set.
- `parsing_data_overlap_foreign_training_data.py`: Examines overlaps between local datasets and external training data to prevent data leakage.

##### Corrupted Embeddings:
- `scramble_sequences.py`: Randomizes sequences to test the predictor.
- `signal_removal.py`: Removes signal peptides to test the predictor.

##### Visualization:
- `Visualization_aac_PCA_very_simple.py`: Generates a PCA plot based on amino acid composition data.
- `Visualizing_embeddings_PCA_very_simple.py`: Generates a PCA plot based on embeddings as input.
- `shared_visualization.py`: Shared visualizations used in the manuscript.

##### Data Preprocessing:
- `test_file_creation.py`: Tests file generation.
- `aac_calc_part2.py`: Computes amino acid compositions.
- `splitting_fastas.py`: Splits FASTA files into training and test sets.
- `embeddings_data_split_part1.py`: Splits embeddings files.
- `data_overlap_part5.py`: Checks for data leakage between toxins and controls.
 
---

### 3. Predictor
**Description**: Contains trained models with filenames indicating:

- **SST**: Sequence Similarity Threshold.
- **Model Architecture**: Random Forest, XGBoost, SVC, Logistic Regression, k-Nearest Neighbors.
- **Performance Metrics**: MCC, Precision-Recall, ROC-AUC, etc.

---

### 4. Figures
**Description**: Contains all visualizations. The following subfolders are included:

- `/baseline_length_only_predictor_hyperparameter`: Figures showing performance for length-only predictors.
- `/baseline_predictor_hyperparameter`: Performance figures for amino acid composition-based predictors.
- `/exploratory_Figures`: Exploratory PCA plots, distributions, and analyses.
- `/sklearn_predictor_hyperparameter`: Figures for embeddings-based predictors, including confusion matrices, PCA plots, and feature importance analyses.

### 5. Scope_Exploration
**Description**: An independent sub-project to explore the scope of the predictors. Contains its own readmes.

-----------

**Dates**

- The bacteriophage data set was accessed, collected, and curated between February and April 2023.
- The bacterial proteins data set was accessed, created and curated between December 2023 and February 2024
- The bacterial exotoxins and the secreted proteins (controls) were collected and curated between January to June 2021; Their analysis and comparisons with animal and fungal toxins are available under the DOI: https://doi.org/10.5282/ubm/data.423 and https://data.ub.uni-muenchen.de/555/