# README for Data/derived directory, Exo-Tox review process

**Created by**: Tanja Krüger

**Revised by**: Luisa F. Jimenez-Soto

**Date**: July 4th, 2025

Here are all files required during the review process for the submission to BioData Mining journal.

# Directories

This folder contains four directories and one file:

- foldseek_baseline
- mulittox_baseline
- pseudoAAC
- train_val_text
- test_dir_folds
- train_dir_folds


# files
## `foldseek_baseline`
**Description**: Data involved in generating the Foldseek results.

- `foldseek_results_reasonable.tsv` : Contains all hits found by Foldseek between query (test data) and target (training data)
- `foldseek_top1_per_query_foldseek_results_reasonable.tsv`: Contains best hits for each query, using prob as evaluation metric. 
- `all files beginning with train_folds_BB`: Created after running the database-creating command of Foldseek using the training folds.

## `test_dir_folds`
**Description**: Folds of the test data.

## `train_dir_folds`
**Description**: Folds of the training data.

##  train_val_test 
**Description**: Training and test data-sets for the predictors- taken form the main project prior to revision. 
 All sets are redundancy reduced to 30% (SST30). 
 
- `X_test_SST30.fasta` : Testing data protein sequences in fasta format
- `X_train_SST30.fasta`: Training data protein sequences in fasta format
- `y_test_SST30.csv`: Labels of the test data
- `y_train_SST30.csv`: Labels of the training data

##  multi_tox_baseline
**Description**: Data involved in running the data though the prediction tool MulitToxPred 1 this folder is subdivided further

### derived_multitox
**Description**:Results received from MuliToxPred 1

- `classification_log.txt` : Metrics calculated from the MultiToxPred results
- `classified_hits.tsv` : MultiTox hits separated into True Positives, True Negatives, False Positives, False Negatives

### passed_to_multitox
**Description**:Test data was split into four parts to allow for batch processing by MultToxPred 1 which allows up to 100 sequencers at once. 

- `X_test_SST30_part1.fasta` : part one of the test sequences
- `X_test_SST30_part2.fasta` : part two of the test sequences
- `X_test_SST30_part3.fasta` : part three of the test sequences
- `X_test_SST30_part4.fasta` : part four of the test sequences

### raw_from_multitox
**Description**: Results received from MuliToxPred 1 

## pseudoAAC
**Description**: Data involved in training a predictor using pseudo amino acid composition as input representation. Pseudo amino acid composition is generated following the instruction: https://github.com/Superzchen/iFeatureOmega-CLI/

- `new_parameters`: Set parameters to generate pseudo amino acid composition, this is passed to the iFeatureOmega pipeline as described in their manual: https://github.com/Superzchen/iFeatureOmega-CLI/
- `sklearn_30_balanced_pred_log.log`: Logged training and testing results of the predictors trained on the pseudo amino acid composition.
- `X_test_SST30_APAAC.csv`: Test data in pseudo amino acid composition format.
- `X_train_SST30_APAAC.csv` Training data in pseudo amino acid composition format. 

# `log.log`: log files from the APAAC20 classifiers.

