
Updated Jan, 2025
Created by: Tanja Krueger

# Overview:
This folder contains everything needed for the additional fungi data analysis.
All published data and code is acceable under Creative Commons
Licensing - Share Alike (CC-BY-SA 4.0)

# Folder structure: 
## Directories
* Code
* Data
* Figures

## Files
* makefile
* README_fungi.md

## Code
The code folder is subdivided into the programming language of the code
### R
#### Code/R/matrix_for_logomaker_fungi.R
	This file creates a matrix with surprise values on its diagonal. This script assumes fungi data.
#### Code/R/visualization_length2.R
	This file visualises the protein length.
### Python
#### Code/python/len_analysis_part1.py
	This file analyses the length data to be passed to the script Code/R/visualization_length2.R
#### Code/python/process_fasta.py
	This file processes and combines fasta files from UniProt and NCBI into one format.
#### Code/python/shared_pI_aromaticity.py
	This file creates visualisations of the protein aromaticity, pI, and amino acid usage in form of logos. 


## Data
### derived
#### combined_fungi_fasta.fasta
	Fasta file of the combined fungi data-sets - contains unique IDs but may include duplicate sequences. 
#### combined_fungi_fasta_data.csv
	csv file of the combined fungi toxins data-sets - contains unique IDs but may include duplicate sequences 
#### length_output_fungal.csv
	contains length information and toxin origin to be plotted by Code/R/visualization_length2.R
#### mmseqs2_results
	This folder contains sets after different levels of redundancy reduction using MMseqs2 
	the following abbreviations mean:
	- all_fungiandbacterial_toxins toxins from fungi and bacteria combined.
	- all_fungiandanimal_toxins toxins from fungi and animals combined.
	- animal_toxins toxins only from animals
	- bacterial toxins toxins only from bacteria
	- bacterial_control secreted,non-toxins from bacteria
	- SST the level of sequence similarity threshold in the MMseq2 algorithm
	- all_seqs all sequences in MMseqs2 clusters in fasta format
	- cluster all sequences in MMseqs2 cluster in tsv format
	- rep_seqs only the representative sequences 
	- mode0 the modus of MMseqs2 when run from the command line
#### forLogoMaker
	 This folder contains matrices with the surprise values on the diagonal.
#### fungi
	This folder contains the UniProt and NCBI results that were cleaned by hand. 
	Manual cleaning criteria are described in the appendix of the main manuscript. 
### raw
contains the downloaded protein sequences from UniProt and NCBI in raw fasta format

## Figures
	The figures folder holds all visualisations of the fungal toxins data analysis.
### shared_lengths_fungi.png
	Depicts the length of animal toxins, bacterial toxins and fungi toxins.
### shared_aromaticity_0.1.png
	Depicts the aromaticity of animal toxins, bacterial toxins and fungi toxins.
### shared_logos_fungi.png
	Depicts the amino acid usage of animal toxins, bacterial toxins and fungi toxins as sequence logos.
### shared_pIs_1by2_fungi.png
	Depicts isoelectric points of proteins. A shows all toxins sets. B shows the fungal toxins in direct comparision with the non-toxic bacterial controls. 
