
Latest update: November 2023

Created by: Tanja Krueger(Analysis), Luisa F. Jimenez (Bacteria), Ivan Koludarov (Animal)

Preliminary title of the manuscript: "Not all toxins are the same"


# Overview:
This folder contains everything needed for the data analysis comparing 
animal toxins and bacterial toxins to their respective control proteins.
All published data and code is accessible under Creative Commons Attribution-Sharealike 4-0 International (CC-BY-SA 4.0). (See `License.txt`) 

For details of the license, see <http://creativecommons.org/licenses/by-sa/4.0/>.

# Folder structure: 
## Directories
* Code
* Data
* Figures
* Exploratory

## Files
* Makefile_analysis
* Makefile_control_data
* README.md
* License.txt
* environment.yml: Use this file to set the conda environment for Python.
* used_packages_* : Used packages needed to run the R code, as R requirements are not included in the conda environment.

## Code
The code folder is subdivided into the programming language of the code.
### R
#### Code/R/matrix_for_logomaker2.R
	This file creates a matrix with surprise values on its diagonal.
#### Code/R/visualization_length2.R
	This file visualizes the protein length of the original raw data.
#### Code/R/visualization_length_SignalP.R
	This file visualizes the protein length of SignalP-6.0 data as comparison.
	
### Python
#### Code/python/data_analysis_2.py
	This file creates SINGLE plots  of isoelectric point (pI),length, aromaticity, and amino acid use.
#### Code/python/data_analysis_3.py
	This file creates COMBINED plots of pI,length, aromaticity, and amino acid use. 
#### Code/python/data_analysis_4.py
	This file visualizes the results of a MMeqs2 fold cluster analysis in a COMBINED plot.
#### Code/python/data_analysis_6.py
	This code is an update to Code/python/data_analysis_3.py and creates alternative visualizations.
#### Code/python/data_collection_part1_3.py
	This file downloads sequences from UniProt, to create the non-toxin dataset using the information from a toxin-set.
#### Code/python/deletion_duplicates_fragments.py
	This file removes duplicates IDs and fragments from a multi fasta file.
#### Code/python/mmseqs2_cluster_analyis.py
	This file visualizes the results of a MMseqs2 fold cluster analysis in SINGLE plots.
#### Code/python/visualization_logomaker.py
	This file produces a single surprise plot.
#### Code/python/fasta_cleaning
	This folder contains code formatting raw fasta files into a single format, which is used throughout the project.

### bash
#### Code/bash/concat.sh
#### Code/bash/file_count.sh

## Data
### derived
	The derived data folder contains all data that has been modified the code and is the result of modifications of the raw files. 
	
	All files in this folder follow the naming convention:
	
	- all_toxic_proteins 		animal and bacterial toxins combined
	- all_animal_proteins		all animal proteins combined
	- all_bacterial_proteins 	all bacterial proteins combined
	- animal_toxins				toxins only from animals
	- bacterial toxins			toxins only from bacteria
	- animal_control			non-toxins from animals
	- bacterial_control			non-toxins from bacteria
	- SST						the level of sequence similartiy threshold in the MMseq2 algorithm
	- clean						the fasta file in the common format
	- IDgenus					a list of protein IDs and species origin
	- ID_Type					a list of protein IDs 
	- source					data source UniProt or NCBI
	- all_seqs					all sequences in MMseq2 clusters in fasta format
	- cluster					all sequences in MMseq2 cluster in tsv format
	- rep_seqs					only the representative sequences 
	- mode0						the modus of MMseqs2 when run from the command line

	
#### forLogoMaker
	 This folder contains matrices with the surprise values on the diagonal.
#### temporary_files
	This folder contains temporary files

	
	
### raw
#### animal 
	This folder contains the animal toxins picked with expert knowledge
#### bacteria
	This folder contains the bacterial toxins and non-toxins that were picked with expert knowledge.
##### Data/raw/bacterial/bacterial_control_proteins.fasta
	This file has the bacterial control sequences that are secreted but not toxins.
##### Data/raw/bacterial/bacterial_toxins_combined.fasta
	This file has all bacterial exotoxins.
#### animal
Data/raw/animal/3FTX_only_sp_and_tr.fasta
	This file contains three finger toxins of standard format.
Data/raw/animal/animal_toxins_combined.fasta
	This file contains all animal toxins used in this study.
Data/raw/animal/TfTox_without_standard_format.fasta
	This file contains three finger toxins not of standard format.
#### Data/raw/SignalP6_training_data.fasta
	This file contains the SignalP-6.0 training data, used to compare our findings
	[accessed October 2023](https://services.healthtech.dtu.dk/services/SignalP-6.0/).


## Exploratory
### python 
Contains the unused python code replaced throughout the project
### R
	Contains the unused R code replaced throughout the project


## Figures
	The figures folder holds all visualizations
	The following naming convention is used: 
	all_toxins 					animal and bacterial toxins combined
	animal_toxins				toxins only from animals
	bacterial toxins			toxins only from bacteria
	logomaker					surprise logo for each amino acid
	animal_control				non-toxins from animals
	bacterial_control			non-toxins from bacteria
	mmseqs_reduction_level 		level of sequence similartiy set in the MMseq2 algorithm
	SST							level of sequence similartiy set in the MMseq2 algorithm
	shared_aromaticity			multiple visualizations of aromaticity
	shared_lengths				multiple visualizations of sequence length
	shared_logos				multiple visualizations of logos
	shared_pIs					multiple visualizations of isoelectric points
	shared_sequence_diversity 	multiple visualizations of sequence similarity determined by mmseqs
	amino_acid_dist 			differences in amino acid ratio in histogramms
	density						value on the y-axis on the plot displayed
	1by2						how plots are ordered in a shared plot
	0.1 and 0.05				smoothness of kernel density function 



