BuKyung In a multi-sequence FASTA file, produce statistics such as sequence number, average seq length, GC content, AT content, etc
Back to Baik BuKyung
Source code:
use strict;
use warnings;
open FH, ">", "outer.fasta" or die "$!\n";
my $numberofseq=0;
my @matrix;
while(<>){
if($_=~ />/){
$matrix[$numberofseq]{seqname}=$_;
$matrix[$numberofseq]{seqname}=~ s/\n//;
}
else{
$matrix[$numberofseq]{seq}=$_;
$matrix[$numberofseq]{seq}=~ s/\n//;
$numberofseq++;
}
my $count=0;
$matrix[$i]{seqlen}=length($matrix[$i]{seq});
for(my $j=0;$j<$matrix[$i]{seqlen};$j++){
my $seq_char=substr($matrix[$i]{seq},$j,1);
if($seq_char=~/[GC]/){
$count++;
}
$matrix[$i]{GC}=$count;
}
my $total_GC=0;
for(my $i=0;$i<$numberofseq;$i++){
print FH ($matrix[$i]{seqname},"\n",$matrix[$i]{seq},"\n GC content is:",$matrix[$i]{GC},"\n");
$total_seqlen=$total_seqlen+$matrix[$i]{seqlen};
$total_GC=$total_GC+$matrix[$i]{GC};
}
Result
After the 6.pl is executed with the 5_100-length_Seq.fasta file, the outer.fasta file is generated.
The original content in the tert_Human.fasta file contains 5 fasta sequences with each length 100. The editted version of BuKyung Randomly generate five 100 AA long protein sequences and store them in a FASTA file made this file.
>0
ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT
>1
CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT
>2
ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT
>3
CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG
>4
CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG
The content of outer.fasta file is
ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT
GC content is:0.49
>1
CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT
GC content is:0.5
>2
ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT
GC content is:0.48
>3
CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG
GC content is:0.55
>4
CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG
Average sequence length is :100
GC contents:0.504