Perl Read File Into Memory for Parsing
Perl script running deadening with big files 4 Hi all, I've written 1 perl script, which compares two vcf files and write output in tertiary file. it is running skilful with small files, merely when i am inputting big files in Gbs, my arrangement becomes tiresome and information technology would hang. I'g using dbSNP vcf file as input which is around 9GB. Delight Advise me something that will make my perl script run faster. This is my first perl script and i'm new to perl. Any help would be appreciated !! Thank you !!!! ADD COMMENT • link The problem is this: Do not read a file like this, unless it is very small or your memory is huge, instead use: In addition your code is extremely inefficient: You are iterating over all entries of a huge array for each line of a huge file. Hashes in Perl provide constant fourth dimension access to elements given the hash key. Apply a hash of hashes to shop your filter table. You seem to filter a large file by a pocket-size file, therefore: As a result you will only need as much retentivity equally is required to shop the pocket-size file. Consider the following refactoring: Promise this helps! I'd suggest using Tabix to alphabetize the larger file and utilise the tabix perl module to query it for each locus in the smaller VCF file. That's likely the well-nigh optimal solution in terms of time and retention. http://samtools.sourceforge.internet/tabix.shtml In that location are better answers that address your specific issue. But they do not answer the general trouble: What do you practise if your perl script is slow? You contour your lawmaking. NYTProf + kcachegrind are good tool to figure out which parts of your code needs attention: When you cannot make that faster, run across if you can somehow parallelize the code. Can you separate your input into chunks, compute each chunk individually and merge the output when y'all are done? This technique is called map-reduce. For smaller jobs (requiring fewer than 1000 CPUs) GNU Parallel can often be of help in doing the chunking of input and running of jobs. Yous just demand to merge the final output. If yous however demand more speed, yous will need to look at a compiled language such equally C++. Again you will utilise the profiler to figure out which parts of your code need the speedup. And instead of rewriting everything into C++ yous can accost the minor office of the code that is the bottleneck. The good office almost this is that much of error handling and corner cases often tin be left to Perl.
#!/usr/bin/env perl use strict; use warnings; use Data::Dumper; use List::MoreUtils; open up (FILE, "<", "dbSNP_in.vcf") or die "failed to open up: $!\n"; my @array=(<FILE>); my @CHR; my @location; my @rs; my @ref_n; my @alt_n; foreach (@array) { chomp; my ($chrom, $pos, $id, $ref, $alt, $qual, $filter, $info) = split(/\t/, $_); push @CHR, $chrom; push button @location, $pos; push @rs, $id; push button @ref_n, $ref; push @alt_n, $alt; } open (FILE1, "<", "trial_rs.vcf") or dice "failed to open: $!\n"; my @array1=(<FILE1>); open up (OUT,">trial_output.vcf"); my @columns; foreach (@array1) { chomp; @columns=split(/\t/, $_); my $i; for ($i=0; $i<@array; $i++) { if (($columns[0] eq $CHR[$i]) and ($columns[1] eq $location[$i]) and ($columns[iii] eq $ref_n[$i]) and ($columns[4] eq $alt_n[$i])) { $columns[2]=$rs[$i]; } } print OUT join("\t", @columns), "\north"; }
my @assortment=(<FILE>); ## slurp a whole Terabyte into RAM
while (<FILE>) { # read each line, then forget well-nigh it chomp; split /t/; .... }
apply strict; use warnings; apply autodie; my ( @assortment, %hash ); open my $FILE, "<", "dbSNP_in.vcf"; while (<$FILE>) { $hash{"@array[0, 1, 3, 4]"} = $array[2] if @array = dissever /\t/, $_, 6; } close $FILE; open my $OUT, ">", "trial_output.vcf"; open my $FILE1, "<", "trial_rs.vcf"; while (<$FILE1>) { my @columns = split /\t/; $columns[two] = $hash{"@columns[0, i, three, 4]"} if exists $hash{"@columns[0, 1, three, four]"}; impress $OUT bring together "\t", @columns; } close $OUT; close $FILE1;
autodie
is used to trap all I/O errors.chomp
is non needed within either loop.split
ting only the corporeality needed is faster than split
ting all.@array[0, 1, three, 4]
, which produces a string comprised of elements 0, i, three, 4) is used as a hash key with a corresponding value of assortment element ii--for apply subsequently.$columns[2]
is set up to that cardinal'south corresponding value if that key exists. Doing this, instead of using iv eq
statements, should speed upward the file processing.
perl -d:NYTProf testme.pl; nytprofcg; kcachegrind nytprof.callgrind
Source: https://www.biostars.org/p/106737/
0 Response to "Perl Read File Into Memory for Parsing"
Post a Comment