User:Azertyproxy/sandbox

Sequence information
Sequence name: Teg100 (RsaOL)

Position: 857273..857273

Lenght: 73 pb

Comments
RNA-seq coverage: 318

Flanking genes in genome:

- SA0712 (position 812436..812672)

- uvrB (SA0713) (position 812935..814926)

Type of sequence: sRNA

Species / strains where RNA is found: Staphilococus Aureus, Staphylococcus epidermidis and Staphylococcus warneri

Protocol for alignment construction
Sequence name: Teg100 (RsaOL)

Initial sequence:

>chromosome (857273..857201)

TGTTTAACATTACTTCTTGTTATCGCCATATACAGACTTTAATTCTCGTCATTATTAAAGTTCTGTATTGGCG

First Training Set with BLASTN: A local BLAST run requires a properly formatted database. This database regroup 4547 staphylococcus genomes.

makeblastdb -in allbacs.fa -dbtype 'nucl'

First, we used BLASTN, which is used to scan nucleic acid databases for nucleic acid sequences. A xml file can be obtained using -outfmt 5 parameter. An evalue of 0.01 was used as treshold. Finally, as shorter words should increase sensitivity, the word size parameter was reduced to to 10:

blastn -db allbacs.fa -query seq.fa -outfmt 5 -evalue 0.01 –word_size 10

After discarding the redundant hits, our raw training set is composed of 5 distinct sequences. We have found Teg100 in 3 species: Staphilococus Aureus, Staphylococcus epidermidis and Staphylococcus warneri

In order to establish the RNA secondary structure of our sequence, we used the LOCARNA program. Indeed, LOCARNA align and fold a set of RNA sequences. First, we produced a Fasta file from the BLASTN output. Then, we used LOCARNA server to produce a secondary structure. The output format was a clustal format

HMMER is used to identify homologous nucleotide sequences. It does this by comparing a profile-HMM to a database of sequences. First, we used Biopython in order to convert the alignement prediction (clustal format) in a stockholm format.

from Bio import SeqIO

SeqIO.convert("seq.aln", "clustal", "seq.stk", "stockholm")

Then, we added the secondary structure to the stockholm file.

Next, HMMER converts the stochkolm format in a profile Markov model

hmmbuild seq.hmm seq.stk

This creates a file named seq.hmm, which can be used for scanning genome sequences.

nhmmer seq.hmm allbacs.fa

we obtain 49 solutions.

INFERNAL uses covariance models to describe a structural RNA sequence alignment and subsequently locates instances of this model in genome database. First we produced the covariance model from the Stockholm structural alignment seq.stk

cmbuild seq.cm seq.stk

The output covariance model is written in file seq.cm. In order to benefit from E-value calculation and performance-enhancing filters, the model must be processed using the cmcalibrate command

cmcalibrate seq.cm

Then, we ran the search over our genome database.  cmsearch seq.cm allbacs.fa > result_infernal.out

This run takes about 18 minutes and finds 50 solutions.

We added the HMMER and INFERNAL results in our training set and used it to re-run the programs described in the protocol above 3 times. Furthermore, we ran blastn search on the ncbi database (nucleotide collection nr/nt) and added one more sequence of interest. Our final training set was composed of 13 sequences and 3 species. Finally, the Structural Alignment shown below was obtained thanks to LOCARNA server.

Structural Alignment (Stockholm format)
Sequence name: Teg100 (RsaOL)

....(((((........))))).(((((..(((((((((((((..........))))))))..))))).)))))