User talk:Acumb002

Concatenating .faa/.ffn Files
cat *.faa > #name for concatenated files cat #names of .faa files > #name for concatentated files

Cdhit for annotating .faa files
Download cd-hit master

Put in path: pwd where your file is, copy the path, nano ~/.bashrc, PATH=$PATH:/#paste your path here

Source ~/.bashrc

cd-hit -i #concatenated file here -o #name for output .out -c 0.7


 * 1) Example

cd-hit -i cat_faa_all.faa -o cdhit_v1.out -c 0.7

Spring 2017 update:

We will be using .ffn (nucleotide) files instead of .faa (amino acid) files.

cdhit-est -i #concatenated .ffn files from all clades -o #name for output .out -c 0.8 -aS 0.3 (this means how short a match has to be to match up) -G 0 -M 0

Cluster_search
perl cluster_search.pl #cdhit output .clstr #namelist #cdhit output .out #name for cluster search output

To make namelist: nano namelist

perl cluster_search.pl cdhit_v1.out.clstr namelist cdhit_v1.out cdhit_v1_all_cluster_output
 * 1) Example

Eggnog
Upload cluster search output into eggnog. Use HMMR setting and bacteria. Use defaults for all others.

Once you get eggnog output, run this script:

perl eggnog_parse2.pl -i #cluster search output -a #eggnog output -o #name for new output file

perl ~/scripts/eggnog_parse2.pl -i cdhit_v2_noref_cluster_output -a cdhit_v2_noref_cluster_output.emapper.annotations_HMMR_080817 -o perl_output_file
 * 1) Example

TextWrangler on Eggnog Perl Output
1. Open new TextWrangler box.

2. Command F.

3. Make sure that grep and

Counting COGs in Excel
=COUNTIF(range, "character of interest")

Bioinformatics
MSIM Bioinformatics Project

March 14, 2017
Started working on assemblies for new sequences (13 mostly nonchromogenic mycos) with Alex C.

1. Copied all files into the MSIM_HTS_working folder in /cm/shared/courses/gauthier 2. Re-named and gunzip'd everyone Naming: * 324_14.1.fastq * 324_14.2.fastq

3. Ran fastqSample to get 25X and 50X coverage for each of the full coverage files. -l is 300 because this sequencing run was 300 paired-end reads 4. Ran fastqToCA on 324_14 only 5. Ran runCA on 324_14 only
 * 1) This generates .frg file that is used by runCA in assembly

Need to figure out how to use qsub script to que up a bunch of assemblies at once:

more celerasubmit.sh
 * 1) !/bin/tcsh


 * 1) Name of the job
 * 2) $ -N wgs


 * 1) Use the wgs parallel environment requesting 20 slots
 * 2) $ -pe wgs 20

module load /cm/shared/modulefiles/wgs/8.1
 * 1) Load the latest version of Celera


 * 1) Merge output/error into single file. Name of output file
 * 2) $ -j y
 * 3) $ -cwd
 * 4) $ -o output

runCA -d /scratch/dgauthie/MMAR3_75X/assemblies/1218R_S85_75X_v1s4 -p 1218R_S85_75X -s /scratch/dgauthie/MMAR3_50X/assemblies/MMAR_v4.spec /scratch/dgauthie/ MMAR3_75X/1218R_S85_75X.frg
 * 1) Modify runCA command below

March 15 - March 22
* Continued (and finished) all of the generation of .frg files and subsequent assemblies. All assemblies are in the /cm/shared/courses/gauthier/ASSEMBLIES folder * The 025x and 050x assemblies all have _025x_assembly or _050x_assembly after the bug name, full coverage assemblies are just labeled _assembly.

fastqToCA code stayed the same: Path for runCA changed a little bit, directed all assemblies after the first one to create a new assembly folder inside the /ASSEMBLIES folder:

April 17, 2017
Bioinformatics Project April 17, 2017

-Working on kSNP for new myco isolates

Created infile for kSNP (in /cm/shared/courses/gauthier/MSIM/kSPNPksnp_042017)

MakeKSNP3infile /cm/shared/courses/gauthier/MSIM/CTG/ ksnp_in_2017 A
 * then nano'd the infile and kept only the desired new assemblies from this semester

MakeFasta ksnp_in_2017 fasta_input

kSNP3 -k 21 -in /cm/shared/courses/gauthier/ksnp_in -outdir ksnp_out -annotate annotate_in -ML -core

April 19, 2017
Ran kSNP on all samples, including the ones from last semester.

April 22, 2017
Ran kSNP (files saved _v3) with same files as previously except for removing 324_648, 324_016, and 324_815.

Then, from local desktop:

Prokka
Ran Prokka after kSNP on all bugs. Ran locally on desktop after installing Prokka using HomeBrew (code to install below). Instructions to install from HomeBrew's website were very clear and helpful.

Example code for running prokka below.

CDHIT
Downloaded cdhit from github, moved the folder to my Desktop and added to my path. Still couldn't get the command to run, so did some googling and ran the command below on terminal from inside the cdhit folder. This worked!

First concatenated all of the .faa files, including the three references, into one .faa file. Then ran cd-hit using the command below.

Output is saved in a cdhit folder in the Bioinformatics folder on my desktop.

Cluster Search
Had some trouble running Dave's cluster_search.pl script, realized there were a couple of problems. 1. When we ran prokka, we did not designate a tag, so the output .faa files had strange prefixes for each protein name that were not associated with the bug id. Used the sed command below to change the prefix names for each of the .faa files. The text replaced (in the example below >IPLFPDIN) was different for each .faa file.

2. We were using the incorrect version of cluster_search.pl, the correct version is below:

Finally, the command to run the perl script cluster_search.pl is below: This code determines the core genome for all of the bugs.

EggNog
EggNog is a program used for functional annotation of genomes. Input file for EggNog is the output of the clustersearch.pl script (above). http://eggnogdb.embl.de/#/app/emapper

2017-08-08
Submitted eggnog job with the following specifications:

* Input file: cdhit_v2_noref_cluster_output * Mapping mode: DIAMOND * Taxonomic scope: Bacteria * Left Gene Orthologs and Gene Ontology evidence categories as default

Also submitted another eggnog job using a different mapping mode:

* Input file: cdhit_v2_noref_cluster_output * Mapping mode: HMMR * Taxonomic scope: Bacteria * Left Gene Orthologs and Gene Ontology evidence categories as default

Visualizing Eggnog Output
Dave sent a perl script that take the output from eggnog and the original protein file that you send to eggnog as inputs. The script takes the COG category from the eggnog output and appends it to the original protein file as a new column.

The script I ran on 8/9/17 is:

perl ~/scripts/eggnog_parse2.pl -i cdhit_v2_noref_cluster_output -a cdhit_v2_noref_cluster_output.emapper.annotations_HMMR_080817 -o perl_output_file

Opened the perl_output_file in TextWrangler and used regular expressions to clean it up. Removed all amino acid sequences so I am left only with the name of the cluster and the COG category for that cluster id. The regular expressions 'find' terms I used are below, each was replaced with nothing.

* ^(\d+.*)$ * ^[A-Z]+$ * ^\s+$ * ^\r

Re-Do Analysis with 3 clades
After meeting with Dave today (8/9/17), decided to start again with cdhit to run analysis separately on 3 clades based on the ksnp_v4 ML tree
 * still need to re-do analysis with all samples included.

cat 453_47.temp.faa wp_419.temp.faa wp_417.temp.faa wp_414.temp.faa > clade3_cat.faa

cat 164_251.temp.faa 324_935.temp.faa 324_692.temp.faa wp_407.temp.faa > simb_cat.faa

cat 324_844.temp.faa 453_83.temp.faa 324_137.temp.faa 324_635.temp.faa 324_669.temp.faa 324_569.temp.faa > sima_cat.faa

Then re-did cd-hit with each of the concatenated clade .faa files: cd-hit -i clade3_cat.faa -o clade3_cat.out -c 0.7

Notes from meeting with Dr. G August 9 2017
Use V4 from ksnp (tree) maximum likelihood tree used to define 3 clades Do cdhit on 3 clades separately Then eggnog on 3 clades separately

83 and 47 are non-chromogens, separating out from clade SimA and SimB was natural division in Dave's original paper, seems like our bugs are also separating out that way

Also interesting to figure out what the wp guys are (SimA or SimB) - Dr. G is going to do this

Also need to meet with Jessica to figure out the pigments of the wp guys

Meeting with Dr. G, September 12, 2017
Alex and I were having trouble with running the cluster_search.pl script for clade 3. We previously got around this by removing wp_419 from the analysis, but obviously not a long term solution. Fixed this problem today by adding a '>' to the end of the clade3_cat.out file before running the cluster_search.pl command:

Submitted the clade3_all_cluster_output file to Eggnog using the HMMR database to get COG category outputs. Will create histograms/pie charts of COG categories for each clade when the run is finished.

SED command to change random isolate assignments to clade
sed -E extended version s replace g execute globally
 * or

Command: sed -E 's/#name of isolate in clade|#name of isolate in clade/#clade name/g' #cdhit.out file from all clades > #new file name.out

Example: sed -E 's/wp_417|wp_414|wp_419|453_47/clade3/g' cdhit_ffn_renameAB.out > cdhit_ffn_renameAB3.out

To check: grep ">" #new file name.out