# README from review project folder **Project Title**: Exo-Tox predictor review items **Creation Date**: June 30th, 2025 **Authors**: Tanja Krueger, Luisa F. Jimenez-Soto # General information for the repository : Exo-Tox predictor - Review process data ## Project This project contains all files related to the creation of the Exo-Tox predictor, including a guide on how to use it and code used for its creation and analysis of data ### Core Team Tanja Krüger, ORCID 0009-0007-4575-4534 Damla Ayse Durmaz, ORCID 0009-0006-8693-1616 Luisa Fernanda Jimenez Soto, ORCID 0000-0001-8551-5019 **Institution**: Walther-Straub Institute of Pharmacology and Toxicology **University**: Ludwig-Maximillians University (LMU) - Munich, Germany #### Responsibilities - The creation of the predictor, curation of bacterial control data, and testing on all sets were performed by Tanja Krüger. - The bacteriophage dataset was curated b Damla Durmaz. - The data analysis for the hits was done by Luisa F. Jimenez-Soto. - The "testing_the_first_predictor_on_data" demo was created by Tanja Krüger and Luisa F. Jimenez-Soto. ### Contact Tanja Krüger: tanja.krueger@lrz.uni-muenchen.de Luisa F. Jimenez-Soto: l.jimenez@lmu.de # Overview of the full project folder. This repository contains four main folders: `Data`, `Code`, `Predictor` and `Figures`; and one file called overview.md which contains the description of the pipelines used for the analysis included in the manuscript as part of the review process. ## Folder Structure ### 1. Data Contains all data used in the project during the revision process. For the original data, see [Exo-Tox repository](https://doi.org/10.5282/ubm/data.576). #### 1.1 `Data/raw` **Description**: Contains the raw data used for the revision process. ##### 1.2.1 `all_folds` **Description**: Folds of the exotoxins and control proteins, used for the foldseek similarity baseline. ##### 1.2.2 `raw_from_multitox` **Description**: Return from MultiToxPred 1 from our Test set, used as second benchmark. Part of the review process. ##### 1.2.3 `exotoxins` **Description**: Folds of the exotoxins proteins under 'rank_1' folder and metadata (see Data/raw/README.md for more details). ##### 1.2.4 `secreted` **Description**: Folds of the control proteins under 'rank_1' folder and metadata (see Data/raw/README.md for more details). #### 1.2 `Data/derived` **Description**: Contains data used for the revision process. ##### 1.2.1 `foldseek_baseline` **Description**: Data involved in generating the Foldseek results. - `foldseek_results_reasonable.tsv` : Contains all hits found by Foldseek between query (test data) and target (training data) - `foldseek_top1_per_query_foldseek_results_reasonable.tsv`: Contains best hits for each query, using prob as evaluation metric. - `all files beginning with train_fodls_BB...`: Created after running the database creating command of Foldseek using the training folds. ##### 1.2.2 `test_dir_folds` **Description**: Folds of the test data. ##### 1.2.3 `train_dir_folds` **Description**: Folds of the training data. ##### 1.2.4 train_val_test **Description**: Training and test datasets for the predictors- taken form the main project prior to revision. All sets are redundancy reduced to 30% (SST30). ##### 1.2.5 multi_tox_baseline **Description**: Data involved in running the data though the prediction tool MulitToxPred 1 ###### 1.2.5.1 derived_multitox Results received from MuliToxPred 1 - `classification_log.txt` : Metrics calculated from the MultiToxPred results - `classified_hits.tsv` : MultiTox hits separated into True Positives, True Negatives, False Positives, False Negatives ###### 1.2.5.2 passed_to_multitox Test data was split into four parts to allow for batch processing by MultToxPred 1 which allows up to 100 sequencers at once. - `X_test_SST30_part1.fasta` : part one of the test sequences - `X_test_SST30_part2.fasta` : part two of the test sequences - `X_test_SST30_part3.fasta` : part three of the test sequences - `X_test_SST30_part4.fasta` : part four of the test sequences ###### 1.2.5.3 raw_from_multitox Results received from MuliToxPred 1 ##### 1.2.5 pseudoAAC **Description**: Data involved in training and testing a trained predictor based on the APAAC protein representation - `new_parameters` : Hyperparameter needed by to generate the APAAC representation. - `sklearn_30_balanced_pred_log.log` : Results of the predictor logging file. - `X_test_SST30_APAAC.csv` : APAAC protein representation for the test set generated by https://github.com/Superzchen/iFeatureOmega-CLI/. - `X_train_SST30_APAAC.csv` :APAAC protein representation for the training set by https://github.com/Superzchen/iFeatureOmega-CLI/. ### 2. Code **Description**: Contains all code that was involved in the paper revision. For the original code for Exo-Tox, see [Exo-Tox repository](https://doi.org/10.5282/ubm/data.576). #### 2.1 `Code/Python/` ##### Foldseek pipeline: - `check_pdb_consistency.py`: Evaluates if splitting of folds was done correctly. - `evaluate_foldseek_hits.py`: Finds Foldseek hits between each query and target. - `extract_foldseek_top_hits.py`: Extracts the best hit for each query using the Foldseek parameter prop. - `folds_into_test_and_train.py` : Splits the folds into training and testing folds ##### MulitToxPred1 pipeline: - `mulittoxpred1.ipynb`: Evaluates the results of the MultiToxPred predictor ##### APAAC pipeline : - `APACC_predictor_first_part.py`: to train a model using APAAC protein representation as input - `support_functions_splitting_predictor.py`: supporting functions ### 3. Predictor **Description**: Contains all predictors trained on the APAAC protein representation. For the full predictor Exo-Tox and guide how to use it, please go to the [Exo-Tox repository](https://doi.org/10.5282/ubm/data.576) - `SST`: Sequence Similarity Threshold. - `Model Architecture`: Random Forest, XGBoost, SVC, Logistic Regression, k-Nearest Neighbors. - `Performance Metrics`: MCC, Precision-Recall, ROC-AUC, etc. - `CV`: Cross Validation split parameter - `APAAC`: Chosen representation of proteins chosen as input (Amphiphilic Pseudo-Amino Acid Composition) ### 4. Figures **Description**: Contains all figures of testing the APAAC trained predictor on the test set