Difference between pages "Bioinformatics BIO431SB" and "BuKyung In a multi-sequence FASTA file, produce statistics such as sequence number, average seq length, GC content, AT content, etc"

From Biolecture.org
(Difference between pages)
imported>SaetByeol-Kim
 
imported>Baik BuKyung
 
Line 1: Line 1:
<p>&lt;p&gt;This is Sungwon's Bioinformatics Lecture Note&lt;/p&gt;<br />
+
<p>Back to [[Baik BuKyung]]</p>
&lt;p&gt;[[Biopedia by SungwonJeon]]&lt;/p&gt;<br />
+
 
&lt;h1&gt;Principles of Bioinformatics&lt;/h1&gt;<br />
+
<hr />
&lt;p&gt;&amp;nbsp;&lt;/p&gt;<br />
+
<p><span style="font-size:24px">Source code:</span></p>
&lt;h1&gt;Bioprogramming&lt;/h1&gt;<br />
+
 
&lt;p&gt;Human DNA has about 3 billion base pairs in a single cell. It is 2.79 GB text (1 letter is 1 byte). However, when Human DNA is sequenced, it may be sequenced 30X or more DNA (In case of NGS(Next Generation Sequencing)). So Raw sequenced DNA text file is more than 84 GB. Because of large amount of Raw data, it is difficult to analyze Raw data with our hands. That's why we need computer / computer program to analyze NGS data. We are not enough smart to analyze NGS data. To deal with computer program, we should know what program is , what programming is, what computer is.&lt;/p&gt;<br />
+
<hr />
&lt;h2&gt;What is programming?&lt;/h2&gt;<br />
+
<div>
 +
<div>#!/usr/bin/perl<br />
 +
&nbsp;use strict;<br />
 +
&nbsp;use warnings;<br />
 +
&nbsp;open FH, &quot;&gt;&quot;, &quot;outer.fasta&quot; or die &quot;$!\n&quot;;<br />
 +
&nbsp;my $numberofseq=0;<br />
 +
&nbsp;my @matrix;<br />
 +
&nbsp;while(&lt;&gt;){<br />
 +
&nbsp;if($_=~ /&gt;/){<br />
 +
&nbsp;&nbsp;$matrix[$numberofseq]{seqname}=$_;<br />
 +
&nbsp;&nbsp;$matrix[$numberofseq]{seqname}=~ s/\n//;<br />
 +
&nbsp;&nbsp;<br />
 +
}<br />
 +
&nbsp;else{<br />
 +
&nbsp;&nbsp;$matrix[$numberofseq]{seq}=$_;<br />
 +
&nbsp;&nbsp; &nbsp;$matrix[$numberofseq]{seq}=~ s/\n//;<br />
 +
&nbsp;&nbsp;$numberofseq++;<br />
 +
}<br />
 +
}</div>
 +
 
 +
<div>&nbsp;</div>
 +
 
 +
<div>for(my $i=0;$i&lt;$numberofseq;$i++){<br />
 +
&nbsp;my $count=0;<br />
 +
&nbsp;$matrix[$i]{seqlen}=length($matrix[$i]{seq});<br />
 +
&nbsp;for(my $j=0;$j&lt;$matrix[$i]{seqlen};$j++){<br />
 +
&nbsp;&nbsp;my $seq_char=substr($matrix[$i]{seq},$j,1);<br />
 +
&nbsp;&nbsp;if($seq_char=~/[GC]/){<br />
 +
&nbsp;&nbsp;&nbsp;$count++;<br />
 +
&nbsp;&nbsp;}<br />
 +
&nbsp;&nbsp;$matrix[$i]{GC}=$count;<br />
 +
&nbsp;}<br />
 +
}</div>
 +
 
 +
<div>&nbsp;</div>
 +
 
 +
<div>my $total_seqlen=0;<br />
 +
my $total_GC=0;<br />
 +
for(my $i=0;$i&lt;$numberofseq;$i++){<br />
 +
&nbsp;print FH ($matrix[$i]{seqname},&quot;\n&quot;,$matrix[$i]{seq},&quot;\n GC content is:&quot;,$matrix[$i]{GC},&quot;\n&quot;);<br />
 +
&nbsp;$total_seqlen=$total_seqlen+$matrix[$i]{seqlen};<br />
 +
&nbsp;$total_GC=$total_GC+$matrix[$i]{GC};<br />
 +
}<br />
 +
print FH (&quot;Average sequence length is :&quot;,$total_seqlen/$numberofseq,&quot;\n GC contents:&quot;,$total_GC/$total_seqlen,&quot;\n AT contents:&quot;,1-($total_GC/$total_seqlen),&quot;\n&quot;)</div>
 +
</div>
 +
 
 +
<div>
 +
<hr />
 +
<p>&nbsp;</p>
 +
 
 +
<p><img alt="" src="/ckfinder/userfiles/images/%EC%BA%A1%EC%B2%9818.PNG" style="height:631px; width:1162px" /></p>
 +
 
 +
<p>&nbsp;</p>
 +
</div>
 +
 
 +
<div>
 +
<hr />
 +
<p>&nbsp;</p>
 +
 
 +
<p><span style="font-size:24px">Result</span></p>
 +
 
 +
<p><img alt="" src="/ckfinder/userfiles/images/%EC%BA%A1%EC%B2%9819.PNG" style="height:20px; width:406px" /></p>
 +
 
 +
<p>&nbsp;</p>
 +
 
 +
<p><span style="font-size:16px">After the 6.pl is executed with the 5_100-length_Seq.fasta file, the outer.fasta file is generated.</span></p>
 +
 
 +
<p><span style="font-size:16px">The original content in the tert_Human.fasta file contains 5 fasta sequences with each&nbsp;length 100. The made this file.</span></p>
 +
 
 +
<p>&nbsp;</p>
 +
 
 +
<p><em>&gt;0<br />
 +
ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT<br />
 +
&gt;1<br />
 +
CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT<br />
 +
&gt;2<br />
 +
ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT<br />
 +
&gt;3<br />
 +
CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG<br />
 +
&gt;4<br />
 +
CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG</em><br />
 
&nbsp;</p>
 
&nbsp;</p>
 +
 +
<p>&nbsp;</p>
 +
 +
<p>&nbsp;</p>
 +
 +
<p><span style="font-size:16px">The content of outer.fasta file&nbsp;is</span></p>
 +
 +
<p>&nbsp;</p>
 +
 +
<div>
 +
<div><em>&gt;0<br />
 +
ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT<br />
 +
&nbsp;GC content is:0.49<br />
 +
&gt;1<br />
 +
CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT<br />
 +
&nbsp;GC content is:0.5<br />
 +
&gt;2<br />
 +
ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT<br />
 +
&nbsp;GC content is:0.48<br />
 +
&gt;3<br />
 +
CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG<br />
 +
&nbsp;GC content is:0.55<br />
 +
&gt;4<br />
 +
CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG<br />
 +
&nbsp;GC content is:0.5</em></div>
 +
 +
<div>&nbsp;</div>
 +
 +
<div><br />
 +
<em>Average sequence length is :100<br />
 +
&nbsp;GC contents:0.504<br />
 +
&nbsp;AT contents:0.496</em></div>
 +
</div>
 +
 +
<div>&nbsp;</div>
 +
 +
<div>&nbsp;</div>
 +
 +
<div><span style="font-size:16px">I just add GC content of each sequence to the end of each sequence. At the end of the file, Average sequence length, GC contents and AT contents are printed.</span></div>
 +
</div>

Revision as of 21:01, 16 June 2016

Back to Baik BuKyung


Source code:


#!/usr/bin/perl

 use strict;
 use warnings;
 open FH, ">", "outer.fasta" or die "$!\n";
 my $numberofseq=0;
 my @matrix;
 while(<>){
 if($_=~ />/){
  $matrix[$numberofseq]{seqname}=$_;
  $matrix[$numberofseq]{seqname}=~ s/\n//;
  
}
 else{
  $matrix[$numberofseq]{seq}=$_;
    $matrix[$numberofseq]{seq}=~ s/\n//;
  $numberofseq++;
}

}
 
for(my $i=0;$i<$numberofseq;$i++){

 my $count=0;
 $matrix[$i]{seqlen}=length($matrix[$i]{seq});
 for(my $j=0;$j<$matrix[$i]{seqlen};$j++){
  my $seq_char=substr($matrix[$i]{seq},$j,1);
  if($seq_char=~/[GC]/){
   $count++;
  }
  $matrix[$i]{GC}=$count;
 }

}
 
my $total_seqlen=0;

my $total_GC=0;
for(my $i=0;$i<$numberofseq;$i++){
 print FH ($matrix[$i]{seqname},"\n",$matrix[$i]{seq},"\n GC content is:",$matrix[$i]{GC},"\n");
 $total_seqlen=$total_seqlen+$matrix[$i]{seqlen};
 $total_GC=$total_GC+$matrix[$i]{GC};
}

print FH ("Average sequence length is :",$total_seqlen/$numberofseq,"\n GC contents:",$total_GC/$total_seqlen,"\n AT contents:",1-($total_GC/$total_seqlen),"\n")

 

 


 

Result

 

After the 6.pl is executed with the 5_100-length_Seq.fasta file, the outer.fasta file is generated.

The original content in the tert_Human.fasta file contains 5 fasta sequences with each length 100. The made this file.

 

>0
ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT
>1
CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT
>2
ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT
>3
CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG
>4
CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG

 

 

 

The content of outer.fasta file is

 

>0

ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT
 GC content is:0.49
>1
CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT
 GC content is:0.5
>2
ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT
 GC content is:0.48
>3
CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG
 GC content is:0.55
>4
CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG

 GC content is:0.5
 

Average sequence length is :100
 GC contents:0.504

 AT contents:0.496
 
 
I just add GC content of each sequence to the end of each sequence. At the end of the file, Average sequence length, GC contents and AT contents are printed.