Skip to content Skip to sidebar Skip to footer

Split 10 Billion Line File Into 5,000 Files By Column Value In Perl Or Python

I have a 10 billion line tab-delimited file that I want to split into 5,000 sub-files, based on a column (first column). How can I do this efficiently in Perl or Python? This ha

Solution 1:

awk to the rescue!

awk 'f!=$1{close(f)} {f=$1; print >> f}' file

it will process row by row, will keep one file open at a time.

If you split the original file into chunks this can be done more efficiently in parallel and merge back the generated files (need to mark them if order need to be preserved)

Solution 2:

You can keep a hash (an associative array) mapping column values to open output file handles, and open an output file only if none is open for that column value yet.

This will be good enough unless you'll hit your limit on maximum number of open files. (Use ulimit -Hn to see it in bash.) If you do, either you need to close file handles (e.g. a random one, or the one that hasn't been used the longest, which is easy to keep track of in another hash), or you need to do multiple passes across the input, processing only as many column values as you can open output files in one pass and skipping them in future passes.

Solution 3:

This program will do as you ask. It expects the input file as a parameter on the command line, and writes output files whose names are taken from the first column of the input file records

It keeps a hash %fh of file handles and a parallel hash %opened of flags that indicate whether a given file has ever been opened before. A file is opened for append if it appear in the %opened hash, or for write if it has never been opened before. If the limit on open files is hit then a (random) selection of 1,000 file handles is closed. There is no point in keeping track of when each handle was last used and closing the most out of date handles: if the data in the input file is randomly ordered then every handle in the hash has the same chance of being the next to be used, alternatively if the data is already sorted then none of the file handles will ever be used again

use strict;
use warnings 'all';

my %fh;
my %opened;

while ( <> ) {

    my ($tag) = split;

    if ( notexists $fh{$tag} ) {

        my $mode = $opened{$tag} ? '>>' : '>';

        while () {

            eval {
                open $fh{$tag}, $mode, $tag ordieqq{Unable to open "$tag" for output: $!};
            };

            if ( not $@ ) {
                $opened{$tag} = 1;
                last;
            }

            die $@ unless $@ =~ /Too many open files/;

            my $n;
            formy $tag ( keys %fh ) {
                my $fh = delete $fh{$tag};
                close $fh ordie $!;
                lastif ++$n >= 1_000orkeys %fh == 0;
            }
        }
    }

    print { $fh{$tag} } $_;
}


close $_ ordie $! forvalues %fh;

Post a Comment for "Split 10 Billion Line File Into 5,000 Files By Column Value In Perl Or Python"