Perl Read File Into Memory for Parsing

Perl script running deadening with big files

4

Hi all,

I've written 1 perl script, which compares two vcf files and write output in tertiary file. it is running skilful with small files, merely when i am inputting big files in Gbs, my arrangement becomes tiresome and information technology would hang.

I'g using dbSNP vcf file as input which is around 9GB.

Delight Advise me something that will make my perl script run faster. This is my first perl script and i'm new to perl.

Any help would be appreciated !!

Thank you !!!!

                        #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; use List::MoreUtils;   open up (FILE, "<", "dbSNP_in.vcf") or die "failed to open up: $!\n"; my @array=(<FILE>); my @CHR; my @location; my @rs; my @ref_n; my @alt_n; foreach (@array) {     chomp;     my ($chrom, $pos, $id, $ref, $alt, $qual, $filter, $info) = split(/\t/, $_);  push @CHR, $chrom; push button @location, $pos; push @rs, $id; push button @ref_n, $ref; push @alt_n, $alt; }  open (FILE1, "<", "trial_rs.vcf") or dice "failed to open: $!\n"; my @array1=(<FILE1>);  open up (OUT,">trial_output.vcf"); my @columns;  foreach (@array1) {     chomp;     @columns=split(/\t/, $_);     my $i;     for ($i=0; $i<@array; $i++)     {         if (($columns[0] eq $CHR[$i]) and ($columns[1] eq $location[$i]) and ($columns[iii] eq $ref_n[$i]) and ($columns[4] eq $alt_n[$i]))         {             $columns[2]=$rs[$i];         }      }  print OUT join("\t", @columns), "\north"; }                                              

perl SNP snp alignment Assembly • 10k views

ADD COMMENT • link

The problem is this:

                    my @assortment=(<FILE>); ## slurp a whole Terabyte into RAM                                      

Do not read a file like this, unless it is very small or your memory is huge, instead use:

                    while (<FILE>) { # read each line, then forget well-nigh it     chomp;     split /t/;     .... }                                      

In addition your code is extremely inefficient:

You are iterating over all entries of a huge array for each line of a huge file. Hashes in Perl provide constant fourth dimension access to elements given the hash key. Apply a hash of hashes to shop your filter table.

You seem to filter a large file by a pocket-size file, therefore:

  • procedure the minor file showtime (using while construct), parse the pocket-sized file into a hash using the location as cardinal of a nested hash
  • process the large file as above and await up each of the entries in the hash created before

As a result you will only need as much retentivity equally is required to shop the pocket-size file.

Consider the following refactoring:

                    apply strict; use warnings; apply autodie;  my ( @assortment, %hash );  open my $FILE, "<", "dbSNP_in.vcf";  while (<$FILE>) {     $hash{"@array[0, 1, 3, 4]"} = $array[2] if @array = dissever /\t/, $_, 6; }  close $FILE;  open my $OUT,   ">", "trial_output.vcf"; open my $FILE1, "<", "trial_rs.vcf";  while (<$FILE1>) {     my @columns = split /\t/;      $columns[two] = $hash{"@columns[0, i, three, 4]"}       if exists $hash{"@columns[0, 1, three, four]"};      impress $OUT bring together "\t", @columns; }  close $OUT; close $FILE1;                                      
  • autodie is used to trap all I/O errors.
  • chomp is non needed within either loop.
  • splitting only the corporeality needed is faster than splitting all.
  • An interpolated array slice (@array[0, 1, three, 4], which produces a string comprised of elements 0, i, three, 4) is used as a hash key with a corresponding value of assortment element ii--for apply subsequently.
  • An interpolated array slice, as a hash fundamental, is tested for existence, and $columns[2] is set up to that cardinal'south corresponding value if that key exists. Doing this, instead of using iv eq statements, should speed upward the file processing.

Promise this helps!

I'd suggest using Tabix to alphabetize the larger file and utilise the tabix perl module to query it for each locus in the smaller VCF file. That's likely the well-nigh optimal solution in terms of time and retention.

http://samtools.sourceforge.internet/tabix.shtml

In that location are better answers that address your specific issue. But they do not answer the general trouble: What do you practise if your perl script is slow?

You contour your lawmaking. NYTProf + kcachegrind are good tool to figure out which parts of your code needs attention:

                    perl -d:NYTProf testme.pl;  nytprofcg;  kcachegrind nytprof.callgrind                                      

When you cannot make that faster, run across if you can somehow parallelize the code. Can you separate your input into chunks, compute each chunk individually and merge the output when y'all are done? This technique is called map-reduce.

For smaller jobs (requiring fewer than 1000 CPUs) GNU Parallel can often be of help in doing the chunking of input and running of jobs. Yous just demand to merge the final output.

If yous however demand more speed, yous will need to look at a compiled language such equally C++. Again you will utilise the profiler to figure out which parts of your code need the speedup. And instead of rewriting everything into C++ yous can accost the minor office of the code that is the bottleneck. The good office almost this is that much of error handling and corner cases often tin be left to Perl.

Login before adding your respond.

cappsprours.blogspot.com

Source: https://www.biostars.org/p/106737/

0 Response to "Perl Read File Into Memory for Parsing"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel