# README for Data/raw directory, Exo-Tox review process

**Created by**: Tanja Krüger
**Revised by**: Luisa F. Jimenez-Soto
**Date**: July 4th, 2025

Here are all files required during the review process for the submission to BioData Mining journal.


## Directories

This folder contains three directories:


## raw_from_multitox
It contains the csv file obtained from the classification of our test set using the MultiTox predictor. 
Columns: Sequence ID, Predicted Class, Confidence

## secreted

It contains the  directory 'rank_1', with the folds created using foldseek. 
The file *CF_avg_pLDDT_nMSA.csv* was created during the extraction of the folds for foldseek for all proteins in our "secreted_proteins" dataset (metadata originated from AlphaFold2) 
The columns names are based on the original paper from Alphafold2 (DOI: 10.1038/s41586-021-03819-2, supplementary data) : 
*ID*: Protein ID. String
*pAE*: Predicted Aligned Error, average pairwise Predicted Aligned Error (pAE) across the protein structure. Float 
*pLDDT*:Predicted Local Distance Difference Test, mean pLDDT score for all residues — a per-residue confidence metric . Float
*pTM*: Predicted TM-score,Predicted Template Modeling score — how well-folded the entire model is . Float
*nMSA*:Number of MSA Sequences, Number of sequences in the Multiple Sequence Alignment (MSA) used to predict the model. Integer

## exotoxins
It contains the  directory 'rank_1', with the folds created using foldseek. 
The file *CF_avg_pLDDT_nMSA.csv* created during the extraction of the folds for foldseek for all proteins in our "exotoxins" dataset (metadata originated from AlphaFold2). The columns names are based on the original paper from Alphafold2 (DOI: 10.1038/s41586-021-03819-2, supplementary data) : 
The columns are: 
*ID*: Protein ID. String
*pAE*: Predicted Aligned Error, average pairwise Predicted Aligned Error (pAE) across the protein structure. Float 
*pLDDT*:Predicted Local Distance Difference Test, mean pLDDT score for all residues — a per-residue confidence metric . Float
*pTM*: Predicted TM-score,Predicted Template Modeling score — how well-folded the entire model is . Float
*nMSA*:Number of MSA Sequences, Number of sequences in the Multiple Sequence Alignment (MSA) used to predict the model. Integer
