Split 10 Billion Line File Into 5,000 Files By Column Value In Perl Or Python
Solution 1:
awk
to the rescue!
awk 'f!=$1{close(f)} {f=$1; print >> f}' file
it will process row by row, will keep one file open at a time.
If you split the original file into chunks this can be done more efficiently in parallel and merge back the generated files (need to mark them if order need to be preserved)
Solution 2:
You can keep a hash (an associative array) mapping column values to open output file handles, and open an output file only if none is open for that column value yet.
This will be good enough unless you'll hit your limit on maximum number of open files. (Use ulimit -Hn
to see it in bash
.) If you do, either you need to close file handles (e.g. a random one, or the one that hasn't been used the longest, which is easy to keep track of in another hash), or you need to do multiple passes across the input, processing only as many column values as you can open output files in one pass and skipping them in future passes.
Solution 3:
This program will do as you ask. It expects the input file as a parameter on the command line, and writes output files whose names are taken from the first column of the input file records
It keeps a hash %fh
of file handles and a parallel hash %opened
of flags that indicate whether a given file has ever been opened before. A file is opened for append if it appear in the %opened
hash, or for write if it has never been opened before. If the limit on open files is hit then a (random) selection of 1,000 file handles is closed. There is no point in keeping track of when each handle was last used and closing the most out of date handles: if the data in the input file is randomly ordered then every handle in the hash has the same chance of being the next to be used, alternatively if the data is already sorted then none of the file handles will ever be used again
use strict;
use warnings 'all';
my %fh;
my %opened;
while ( <> ) {
my ($tag) = split;
if ( notexists $fh{$tag} ) {
my $mode = $opened{$tag} ? '>>' : '>';
while () {
eval {
open $fh{$tag}, $mode, $tag ordieqq{Unable to open "$tag" for output: $!};
};
if ( not $@ ) {
$opened{$tag} = 1;
last;
}
die $@ unless $@ =~ /Too many open files/;
my $n;
formy $tag ( keys %fh ) {
my $fh = delete $fh{$tag};
close $fh ordie $!;
lastif ++$n >= 1_000orkeys %fh == 0;
}
}
}
print { $fh{$tag} } $_;
}
close $_ ordie $! forvalues %fh;
Post a Comment for "Split 10 Billion Line File Into 5,000 Files By Column Value In Perl Or Python"