# Overview 

This file describes the pipelines used to establish the foldseek baseline (similarity based on folds), and the pipeline for the pseudo amino acid predictor (APAAC20)

# Pipeline how to carry out the foldseek pipeline, the training of the pseudo amino acid based predictor and MultiToxPred 1.0

# PART 1: FOLDSEEK PIPELINE 
## 1: split all folds into the training and the test set.
python folds_into_test_and_train.py --pdb_dir ../../Data/raw/all_folds/ --fasta_a ../../Data/derived/train_val_test/X_train_SST30.fasta --fasta_b ../../Data/derived/train_val_test/X_test_SST30.fasta --out_dir_a ../../Data/derived/train_dir_folds --out_dir_b ../../Data/derived/test_dir_folds

## 2: Double check the splitting
python check_pdb_consistency.py --fasta_a ../../Data/derived/train_val_test/X_train_SST30.fasta --folder_a ../../Data/derived/train_dir_folds --fasta_b ../../Data/derived/train_val_test/X_test_SST30.fasta --folder_b ../../Data/derived/test_dir_folds

## 3: create a Database from the training 
Data/derived/train_folds_DB: Data/derived/train_dir_folds
		foldseek createdb Data/derived/train_dir_folds Data/derived/train_folds_DB
		
## 4: Run the test folds against the training database from step 1
Data/derived/foldseek_results_reasonable.tsv: Data/derived/test_dir_folds Data/derived/train_folds_DB
		foldseek easy-search Data/derived/test_dir_folds Data/derived/train_folds_DB Data/derived/foldseek_results_reasonable.tsv Data/derived/temporary_files --format-output query,target,lddt,lddtfull,alntmscore,qtmscore,ttmscore,alnlen,qcov,tcov,prob,evalue

## 5: Parse the results to only retain the top hit for each test (query) protein
../../Data/derived/foldseek_baseline/foldseek_top1_per_query_foldseek_results_reasonable.tsv: ../../Data/derived/foldseek_baseline/foldseek_results_reasonable.tsv
		python extract_foldseek_top_hits.py --input ../../Data/derived/foldseek_baseline/foldseek_results_reasonable.tsv --output ../../Data/derived/foldseek_baseline/foldseek_top1_per_query_foldseek_results_reasonable.tsv

## 6: Calculate the metrics
python evaluate_foldseek_hits.py 


# PART 2: PSEUDO AMINO ACID TRAINED PREDICTOR 

## 1: Generating the APAAC representation of the proteins
follow the instruction on https://github.com/Superzchen/iFeatureOmega-CLI/

# first call python from command line
python

# import the right package
import iFeatureOmegaCLI

# open the fasta data use absolute path 
# import the right package
import iFeatureOmegaCLI

 
# open the fasta data use absolute path 
proteintrain = iFeatureOmegaCLI.iProtein("..Data/derived/train_val_test/X_train_SST30.fasta")
proteintest= iFeatureOmegaCLI.iProtein("..Data/derived/train_val_test/X_test_SST30.fasta")

 
# import the new parameters as json file
proteintrain.import_parameters(..Data/derived/pseudoAAC/new_parameters")
proteintest.import_parameters("..Data/derived/pseudoAAC/new_parameters")


# translate the proteins into the APAAC
proteintrain.get_descriptor("APAAC")
proteintest.get_descriptor("APAAC")

 
# save the results
proteintrain.to_csv("..Data/derived/pseudoAAC/X_train_SST30_APAAC.csv")
proteintest.to_csv("..Data/derived/pseudoAAC/X_test_SST30_APAAC.csv")

## 2: Training predictor on the APAAC representation of the proteins 
python APACC_predictor_first_part.py ../../Data/derived/pseudoAAC/X_train_SST30_APAAC.csv ../../Data/derived/pseudoAAC/X_test_SST30_APAAC.csv ../../Data/derived/train_val_test/y_train_SST30.csv ../../Data/derived/train_val_test/y_test_SST30.csv


