Difference between revisions of "BuKyung In a multi-sequence FASTA file, produce statistics such as sequence number, average seq length, GC content, AT content, etc"

From Biolecture.org
imported>Baik BuKyung
(Created page with "<p>Back to Baik BuKyung</p> <hr /> <p> </p>")
 
imported>Baik BuKyung
Line 1: Line 1:
 
<p>Back to [[Baik BuKyung]]</p>
 
<p>Back to [[Baik BuKyung]]</p>
  
 +
<hr />
 +
<p><span style="font-size:24px">Source code:</span></p>
 +
 +
<hr />
 +
<div>
 +
<div>#!/usr/bin/perl<br />
 +
&nbsp;use strict;<br />
 +
&nbsp;use warnings;<br />
 +
&nbsp;open FH, &quot;&gt;&quot;, &quot;outer.fasta&quot; or die &quot;$!\n&quot;;<br />
 +
&nbsp;my $numberofseq=0;<br />
 +
&nbsp;my @matrix;<br />
 +
&nbsp;while(&lt;&gt;){<br />
 +
&nbsp;if($_=~ /&gt;/){<br />
 +
&nbsp;&nbsp;$matrix[$numberofseq]{seqname}=$_;<br />
 +
&nbsp;&nbsp;$matrix[$numberofseq]{seqname}=~ s/\n//;<br />
 +
&nbsp;&nbsp;<br />
 +
}<br />
 +
&nbsp;else{<br />
 +
&nbsp;&nbsp;$matrix[$numberofseq]{seq}=$_;<br />
 +
&nbsp;&nbsp; &nbsp;$matrix[$numberofseq]{seq}=~ s/\n//;<br />
 +
&nbsp;&nbsp;$numberofseq++;<br />
 +
}<br />
 +
}</div>
 +
 +
<div>&nbsp;</div>
 +
 +
<div>for(my $i=0;$i&lt;$numberofseq;$i++){<br />
 +
&nbsp;my $count=0;<br />
 +
&nbsp;$matrix[$i]{seqlen}=length($matrix[$i]{seq});<br />
 +
&nbsp;for(my $j=0;$j&lt;$matrix[$i]{seqlen};$j++){<br />
 +
&nbsp;&nbsp;my $seq_char=substr($matrix[$i]{seq},$j,1);<br />
 +
&nbsp;&nbsp;if($seq_char=~/[GC]/){<br />
 +
&nbsp;&nbsp;&nbsp;$count++;<br />
 +
&nbsp;&nbsp;}<br />
 +
&nbsp;&nbsp;$matrix[$i]{GC}=$count;<br />
 +
&nbsp;}<br />
 +
}</div>
 +
 +
<div>&nbsp;</div>
 +
 +
<div>my $total_seqlen=0;<br />
 +
my $total_GC=0;<br />
 +
for(my $i=0;$i&lt;$numberofseq;$i++){<br />
 +
&nbsp;print FH ($matrix[$i]{seqname},&quot;\n&quot;,$matrix[$i]{seq},&quot;\n GC content is:&quot;,$matrix[$i]{GC},&quot;\n&quot;);<br />
 +
&nbsp;$total_seqlen=$total_seqlen+$matrix[$i]{seqlen};<br />
 +
&nbsp;$total_GC=$total_GC+$matrix[$i]{GC};<br />
 +
}<br />
 +
print FH (&quot;Average sequence length is :&quot;,$total_seqlen/$numberofseq,&quot;\n GC contents:&quot;,$total_GC/$total_seqlen,&quot;\n AT contents:&quot;,1-($total_GC/$total_seqlen),&quot;\n&quot;)</div>
 +
</div>
 +
 +
<div>
 
<hr />
 
<hr />
 
<p>&nbsp;</p>
 
<p>&nbsp;</p>
 +
 +
<p><img alt="" src="/ckfinder/userfiles/images/%EC%BA%A1%EC%B2%9818.PNG" style="height:631px; width:1162px" /></p>
 +
 +
<p>&nbsp;</p>
 +
</div>
 +
 +
<div>
 +
<hr />
 +
<p>&nbsp;</p>
 +
 +
<p><span style="font-size:24px">Result</span></p>
 +
 +
<p><img alt="" src="/ckfinder/userfiles/images/%EC%BA%A1%EC%B2%9819.PNG" style="height:20px; width:406px" /></p>
 +
 +
<p>&nbsp;</p>
 +
 +
<p><span style="font-size:16px">After the 6.pl is executed with the 5_100-length_Seq.fasta file, the outer.fasta file is generated.</span></p>
 +
 +
<p><span style="font-size:16px">The original content in the tert_Human.fasta file contains 5 fasta sequences with each&nbsp;length 100. The made this file.</span></p>
 +
 +
<p>&nbsp;</p>
 +
 +
<p><em>&gt;0<br />
 +
ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT<br />
 +
&gt;1<br />
 +
CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT<br />
 +
&gt;2<br />
 +
ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT<br />
 +
&gt;3<br />
 +
CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG<br />
 +
&gt;4<br />
 +
CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG</em><br />
 +
&nbsp;</p>
 +
 +
<p>&nbsp;</p>
 +
 +
<p>&nbsp;</p>
 +
 +
<p><span style="font-size:16px">The content of outer.fasta file&nbsp;is</span></p>
 +
 +
<p>&nbsp;</p>
 +
 +
<div>
 +
<div><em>&gt;0<br />
 +
ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT<br />
 +
&nbsp;GC content is:0.49<br />
 +
&gt;1<br />
 +
CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT<br />
 +
&nbsp;GC content is:0.5<br />
 +
&gt;2<br />
 +
ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT<br />
 +
&nbsp;GC content is:0.48<br />
 +
&gt;3<br />
 +
CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG<br />
 +
&nbsp;GC content is:0.55<br />
 +
&gt;4<br />
 +
CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG<br />
 +
&nbsp;GC content is:0.5</em></div>
 +
 +
<div>&nbsp;</div>
 +
 +
<div><br />
 +
<em>Average sequence length is :100<br />
 +
&nbsp;GC contents:0.504<br />
 +
&nbsp;AT contents:0.496</em></div>
 +
</div>
 +
 +
<div>&nbsp;</div>
 +
 +
<div>&nbsp;</div>
 +
 +
<div><span style="font-size:16px">I just add GC content of each sequence to the end of each sequence. At the end of the file, Average sequence length, GC contents and AT contents are printed.</span></div>
 +
</div>

Revision as of 21:01, 16 June 2016

Back to Baik BuKyung


Source code:


#!/usr/bin/perl

 use strict;
 use warnings;
 open FH, ">", "outer.fasta" or die "$!\n";
 my $numberofseq=0;
 my @matrix;
 while(<>){
 if($_=~ />/){
  $matrix[$numberofseq]{seqname}=$_;
  $matrix[$numberofseq]{seqname}=~ s/\n//;
  
}
 else{
  $matrix[$numberofseq]{seq}=$_;
    $matrix[$numberofseq]{seq}=~ s/\n//;
  $numberofseq++;
}

}
 
for(my $i=0;$i<$numberofseq;$i++){

 my $count=0;
 $matrix[$i]{seqlen}=length($matrix[$i]{seq});
 for(my $j=0;$j<$matrix[$i]{seqlen};$j++){
  my $seq_char=substr($matrix[$i]{seq},$j,1);
  if($seq_char=~/[GC]/){
   $count++;
  }
  $matrix[$i]{GC}=$count;
 }

}
 
my $total_seqlen=0;

my $total_GC=0;
for(my $i=0;$i<$numberofseq;$i++){
 print FH ($matrix[$i]{seqname},"\n",$matrix[$i]{seq},"\n GC content is:",$matrix[$i]{GC},"\n");
 $total_seqlen=$total_seqlen+$matrix[$i]{seqlen};
 $total_GC=$total_GC+$matrix[$i]{GC};
}

print FH ("Average sequence length is :",$total_seqlen/$numberofseq,"\n GC contents:",$total_GC/$total_seqlen,"\n AT contents:",1-($total_GC/$total_seqlen),"\n")

 

 


 

Result

 

After the 6.pl is executed with the 5_100-length_Seq.fasta file, the outer.fasta file is generated.

The original content in the tert_Human.fasta file contains 5 fasta sequences with each length 100. The made this file.

 

>0
ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT
>1
CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT
>2
ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT
>3
CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG
>4
CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG

 

 

 

The content of outer.fasta file is

 

>0

ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT
 GC content is:0.49
>1
CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT
 GC content is:0.5
>2
ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT
 GC content is:0.48
>3
CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG
 GC content is:0.55
>4
CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG

 GC content is:0.5
 

Average sequence length is :100
 GC contents:0.504

 AT contents:0.496
 
 
I just add GC content of each sequence to the end of each sequence. At the end of the file, Average sequence length, GC contents and AT contents are printed.