6) Producing statistics of a multi-sequence FASTA file: sequence number, average seq length, GC content, AT content

Share

Here I created sub function, which opens the file and removes a header, also combine all lines; Further, each nucleatide in fasta file placed to arrays, then counted their content:

#!/usr/bin/perl
use strict;
use warnings;

my $file = $ARGV[0];
open (FILE, "$file") or die "Can't open the file: $_\n";

sub readingfa
{
my $string = "";
while(<FILE>){
chomp($_);
if($_ =~ /^>/){ next; }
else { $string .= $_; }
}
return($string); }
my $seq = &readingfa();
my $C = 0; my $G = 0;
my $A = 0; my $T = 0; my $N = 0;

my $length = length $seq;

my @nucleatides = split "",$seq;
foreach my $nuc (@nucleatides) {
if ($nuc eq "C"){
$C = $C+1; }
elsif ($nuc eq "G"){
$G = $G+1; }
elsif ($nuc eq "A"){
$A = $A+1; }
elsif ($nuc eq "T"){
$T = $T+1;}
elsif ($nuc eq "N") { $N = $N+1;
}}

my $GC_content = (($C+$G)/$length)*100;
my $AT_content = (($A+$T)/$length)*100;

print "Fasta file statistics: \n";
print "Sequence length = $length\n";
print "Number of A: $A\n";
print "Number of C: $C\n";
print "Number of T: $T\n";
print "Number of G: $G\n";
print "Number of N: $N\n";
print "GC content is = $GC_content\n";
print "AT content is = $AT_content\n";

----------------------------------------------------------------------------------------------

Perl script output will be:

Fasta file statistics:
Sequence length = 156040895
Number of A: 46754807
Number of C: 30523780
Number of T: 46916701
Number of G: 30697741
Number of N: 1147861
GC content is = 39.2342795777991
AT content is = 60.0301017242948

6) Producing statistics of a multi-sequence FASTA file: sequence number, average seq length, GC content, AT content

Navigation menu

Views

Personal tools

Search

Navigation

Advertisements

Tools

Related Links[Edit]