Difference between revisions of "BuKyung In a multi-sequence FASTA file, produce statistics such as sequence number, average seq length, GC content, AT content, etc"
imported>Baik BuKyung (Created page with "<p>Back to Baik BuKyung</p> <hr /> <p> </p>") |
imported>Baik BuKyung |
||
(3 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
<p>Back to [[Baik BuKyung]]</p> | <p>Back to [[Baik BuKyung]]</p> | ||
+ | <hr /> | ||
+ | <p><span style="font-size:24px">Source code:</span></p> | ||
+ | |||
+ | <hr /> | ||
+ | <div> | ||
+ | <div>#!/usr/bin/perl<br /> | ||
+ | use strict;<br /> | ||
+ | use warnings;<br /> | ||
+ | open FH, ">", "outer.fasta" or die "$!\n";<br /> | ||
+ | my $numberofseq=0;<br /> | ||
+ | my @matrix;<br /> | ||
+ | while(<>){<br /> | ||
+ | if($_=~ />/){<br /> | ||
+ | $matrix[$numberofseq]{seqname}=$_;<br /> | ||
+ | $matrix[$numberofseq]{seqname}=~ s/\n//;<br /> | ||
+ | <br /> | ||
+ | }<br /> | ||
+ | else{<br /> | ||
+ | $matrix[$numberofseq]{seq}=$_;<br /> | ||
+ | $matrix[$numberofseq]{seq}=~ s/\n//;<br /> | ||
+ | $numberofseq++;<br /> | ||
+ | }<br /> | ||
+ | }</div> | ||
+ | |||
+ | <div> </div> | ||
+ | |||
+ | <div>for(my $i=0;$i<$numberofseq;$i++){<br /> | ||
+ | my $count=0;<br /> | ||
+ | $matrix[$i]{seqlen}=length($matrix[$i]{seq});<br /> | ||
+ | for(my $j=0;$j<$matrix[$i]{seqlen};$j++){<br /> | ||
+ | my $seq_char=substr($matrix[$i]{seq},$j,1);<br /> | ||
+ | if($seq_char=~/[GC]/){<br /> | ||
+ | $count++;<br /> | ||
+ | }<br /> | ||
+ | $matrix[$i]{GC}=$count;<br /> | ||
+ | }<br /> | ||
+ | }</div> | ||
+ | |||
+ | <div> </div> | ||
+ | |||
+ | <div>my $total_seqlen=0;<br /> | ||
+ | my $total_GC=0;<br /> | ||
+ | for(my $i=0;$i<$numberofseq;$i++){<br /> | ||
+ | print FH ($matrix[$i]{seqname},"\n",$matrix[$i]{seq},"\n GC content is:",$matrix[$i]{GC},"\n");<br /> | ||
+ | $total_seqlen=$total_seqlen+$matrix[$i]{seqlen};<br /> | ||
+ | $total_GC=$total_GC+$matrix[$i]{GC};<br /> | ||
+ | }<br /> | ||
+ | print FH ("Average sequence length is :",$total_seqlen/$numberofseq,"\n GC contents:",$total_GC/$total_seqlen,"\n AT contents:",1-($total_GC/$total_seqlen),"\n")</div> | ||
+ | </div> | ||
+ | |||
+ | <div> | ||
<hr /> | <hr /> | ||
<p> </p> | <p> </p> | ||
+ | |||
+ | <p><img alt="" src="/ckfinder/userfiles/images/%EC%BA%A1%EC%B2%9818.PNG" style="height:631px; width:1162px" /></p> | ||
+ | |||
+ | <p> </p> | ||
+ | </div> | ||
+ | |||
+ | <div> | ||
+ | <hr /> | ||
+ | <p> </p> | ||
+ | |||
+ | <p><span style="font-size:24px">Result</span></p> | ||
+ | |||
+ | <p><img alt="" src="/ckfinder/userfiles/images/%EC%BA%A1%EC%B2%9819.PNG" style="height:20px; width:406px" /></p> | ||
+ | |||
+ | <p> </p> | ||
+ | |||
+ | <p><span style="font-size:16px">After the 6.pl is executed with the 5_100-length_Seq.fasta file, the outer.fasta file is generated.</span></p> | ||
+ | |||
+ | <p><span style="font-size:16px">The original content in the tert_Human.fasta file contains 5 fasta sequences with each length 100. The editted version of [[BuKyung Randomly generate five 100 AA long protein sequences and store them in a FASTA file]] made this file.</span></p> | ||
+ | |||
+ | <p> </p> | ||
+ | |||
+ | <p><em>>0<br /> | ||
+ | ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT<br /> | ||
+ | >1<br /> | ||
+ | CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT<br /> | ||
+ | >2<br /> | ||
+ | ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT<br /> | ||
+ | >3<br /> | ||
+ | CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG<br /> | ||
+ | >4<br /> | ||
+ | CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG</em><br /> | ||
+ | </p> | ||
+ | |||
+ | <p> </p> | ||
+ | |||
+ | <p> </p> | ||
+ | |||
+ | <p><span style="font-size:16px">The content of outer.fasta file is</span></p> | ||
+ | |||
+ | <p> </p> | ||
+ | |||
+ | <div> | ||
+ | <div><em>>0<br /> | ||
+ | ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT<br /> | ||
+ | GC content is:0.49<br /> | ||
+ | >1<br /> | ||
+ | CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT<br /> | ||
+ | GC content is:0.5<br /> | ||
+ | >2<br /> | ||
+ | ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT<br /> | ||
+ | GC content is:0.48<br /> | ||
+ | >3<br /> | ||
+ | CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG<br /> | ||
+ | GC content is:0.55<br /> | ||
+ | >4<br /> | ||
+ | CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG<br /> | ||
+ | GC content is:0.5</em></div> | ||
+ | |||
+ | <div> </div> | ||
+ | |||
+ | <div><br /> | ||
+ | <em>Average sequence length is :100<br /> | ||
+ | GC contents:0.504<br /> | ||
+ | AT contents:0.496</em></div> | ||
+ | </div> | ||
+ | |||
+ | <div> </div> | ||
+ | |||
+ | <div> </div> | ||
+ | |||
+ | <div><span style="font-size:16px">I just add GC content of each sequence to the end of each sequence. At the end of the file, Average sequence length, GC contents and AT contents are printed out.</span></div> | ||
+ | </div> |
Latest revision as of 21:13, 16 June 2016
Back to Baik BuKyung
Source code:
use strict;
use warnings;
open FH, ">", "outer.fasta" or die "$!\n";
my $numberofseq=0;
my @matrix;
while(<>){
if($_=~ />/){
$matrix[$numberofseq]{seqname}=$_;
$matrix[$numberofseq]{seqname}=~ s/\n//;
}
else{
$matrix[$numberofseq]{seq}=$_;
$matrix[$numberofseq]{seq}=~ s/\n//;
$numberofseq++;
}
my $count=0;
$matrix[$i]{seqlen}=length($matrix[$i]{seq});
for(my $j=0;$j<$matrix[$i]{seqlen};$j++){
my $seq_char=substr($matrix[$i]{seq},$j,1);
if($seq_char=~/[GC]/){
$count++;
}
$matrix[$i]{GC}=$count;
}
my $total_GC=0;
for(my $i=0;$i<$numberofseq;$i++){
print FH ($matrix[$i]{seqname},"\n",$matrix[$i]{seq},"\n GC content is:",$matrix[$i]{GC},"\n");
$total_seqlen=$total_seqlen+$matrix[$i]{seqlen};
$total_GC=$total_GC+$matrix[$i]{GC};
}
Result
After the 6.pl is executed with the 5_100-length_Seq.fasta file, the outer.fasta file is generated.
The original content in the tert_Human.fasta file contains 5 fasta sequences with each length 100. The editted version of BuKyung Randomly generate five 100 AA long protein sequences and store them in a FASTA file made this file.
>0
ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT
>1
CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT
>2
ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT
>3
CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG
>4
CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG
The content of outer.fasta file is
ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT
GC content is:0.49
>1
CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT
GC content is:0.5
>2
ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT
GC content is:0.48
>3
CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG
GC content is:0.55
>4
CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG
Average sequence length is :100
GC contents:0.504