Changes

BuKyung In a multi-sequence FASTA file, produce statistics such as sequence number, average seq length, GC content, AT content, etc

3,418 bytes added, 21:01, 16 June 2016

no edit summary

<p>Back to [[Baik BuKyung]]</p> <hr /><p><span style="font-size:24px">Source code:</span></p> <hr /><div><div>#!/usr/bin/perl<br /> use strict;<br /> use warnings;<br /> open FH, ">", "outer.fasta" or die "$!\n";<br /> my $numberofseq=0;<br /> my @matrix;<br /> while(<p>~~This is Sungwon's Bioinformatics Lecture Note~~){<br />&ltnbsp;if($_=~ /p>/){<br />&ltnbsp;p&gtnbsp;$matrix[$numberofseq]{seqname}=$_;<br />  $matrix[~~Biopedia by SungwonJeon~~$numberofseq]{seqname}=~ s/\n//;<br />  <br />}<br /> else{<br />  $matrix[$numberofseq]{seq}=$_;<br /> &ltnbsp;  $matrix[$numberofseq]{seq}=~ s/\n//;<br /p> &gtnbsp;$numberofseq++;<br />}<br />}</div> <div> </div> <div>for(my $i=0;$i<h1$numberofseq;$i++){<br /> my $count=0;<br />&gtnbsp;$matrix[$i]{seqlen}=length($matrix[$i]{seq});<br /> for(my $j=0;~~Principles of Bioinformatics~~$j<$matrix[$i]{seqlen};$j++){<br /h1>&gtnbsp; my $seq_char=substr($matrix[$i]{seq},$j,1);<br />&ltnbsp;p&gtnbsp;if($seq_char=~/[GC]/){<br />&~~amp~~nbsp; &ltnbsp;$count++;<br /p> &gtnbsp;}<br />&ltnbsp;h1&gtnbsp;$matrix[$i]{GC}=$count;~~Bioprogramming~~<br />&ltnbsp;}<br />}</h1div> <div>&gtnbsp;</div> <div>my $total_seqlen=0;<br />my $total_GC=0;<br />for(my $i=0;$i<p$numberofseq;$i++){<br /> print FH ($matrix[$i]{seqname},"\n",$matrix[$i]{seq},&gtquot;~~Human DNA has about 3 billion base pairs in a single cell. It~~ \n GC content is ~~2.79 GB text~~ :",$matrix[$i]{GC},"\n");<br /> $total_seqlen=$total_seqlen+$matrix[$i]{seqlen};<br /> $total_GC=$total_GC+$matrix[$i]{GC};<br />}<br />print FH (~~1 letter~~ "Average sequence length is ~~1 byte). However~~:",$total_seqlen/$numberofseq,"\n GC contents:",$total_GC/$total_seqlen, ~~when Human DNA is sequenced~~"\n AT contents:", ~~it may be sequenced 30X or more DNA~~ 1-(~~In case of NGS(Next Generation Sequencing~~$total_GC/$total_seqlen),"\n")</div></div> <div><hr /><p> </p> <p><img alt="" src="/ckfinder/userfiles/images/%EC%BA%A1%EC%B2%9818. ~~So Raw sequenced DNA text file~~ PNG" style="height:631px; width:1162px" /></p> <p> </p></div> <div><hr /><p> </p> <p><span style="font-size:24px">Result</span></p> <p><img alt="" src="/ckfinder/userfiles/images/%EC%BA%A1%EC%B2%9819.PNG" style="height:20px; width:406px" /></p> <p> </p> <p><span style="font-size:16px">After the 6.pl is ~~more than 84 GB~~executed with the 5_100-length_Seq. ~~Because of large amount of Raw data~~fasta file, it the outer.fasta file is ~~difficult to analyze Raw data with our hands~~generated. ~~That's why we need computer~~ </span></ ~~computer program to analyze NGS data~~p> <p><span style="font-size:16px">The original content in the tert_Human. ~~We are not enough smart to analyze NGS data~~fasta file contains 5 fasta sequences with each length 100. ~~To deal with computer program, we should know what program is , what programming is, what computer is~~The made this file.</span></p> <p>&ltnbsp;</p> <p><em>>0<br />ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT<br />&ltgt;h21<br />CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT<br />>~~What is programming?~~2<br />ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT<br />&ltgt;3<br />CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG<br /h2>>4<br />CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG</em><br />

 </p>

<p><span style="font-size:16px">The content of outer.fasta file is</span></p>

<div>

ACCACTACTAAGCGCATGAACGACTGTTAGGTTTCCGATGGCTGCTTGCGTTCCGTGTTCCAGCTGACTGGGCTGAACTATTTGTAATGTTGGTTGCACT<br />

 GC content is:0.49<br />

>1<br />

CAGGTACACGGACTGTTTGGTTTGCCCAATTAATTGGCGGGTCGTAAACCGGTTTTTCGTTGGGCGCGGAGTTGTCGTAAACGGTCGGTATTAACTACCT<br />

 GC content is:0.5<br />

>2<br />

ATATTCTGTTCGAAGGCGAGGCCTTAATAAACGGGCTCACACTATACGTTTCTAGCGTGCCAGTACGCGTATGCCCTGAGCAGCATCTTGAATAGTCCTT<br />

 GC content is:0.48<br />

>3<br />

CACGTCTTGAGGCATGCTCACATAACTTGGGATTGATACAATCGGGGGACGGTAGCGGGGCTAGTGGGCATCGTCGGCGGTCTACGAGCAAAAGTATCAG<br />

 GC content is:0.55<br />

>4<br />

CAGGACGTGAACCGAAAGCTGCACACCTATACTATCGTAGTATACCACCGTTCCGTAAATCCATCGCTGATCCTGCCATGAAGGGCTAAGTACGCATGAG<br />

 GC content is:0.5</em></div>

<em>Average sequence length is :100<br />

 GC contents:0.504<br />

 AT contents:0.496</em></div>

</div>

<div><span style="font-size:16px">I just add GC content of each sequence to the end of each sequence. At the end of the file, Average sequence length, GC contents and AT contents are printed.</span></div>

</div>

Anonymous user

imported>Baik BuKyung

Changes

BuKyung In a multi-sequence FASTA file, produce statistics such as sequence number, average seq length, GC content, AT content, etc

Navigation menu

Views

Personal tools

Search

Navigation

Advertisements

Tools

Related Links[Edit]