File::Sip a perl module to read huge text files with limited memory

sip beer
When slurping makes you sick, try to sip!

Even though we live in a world where buying a server with 500 GB of RAM is possible, there can always be a situation where we don’t have enough memory available. What about log files from a web server that handle a billion of requests every day, what if you need to parse these files as efficiently as possible with a limited amount of main memory available?

When the file is small enough, or when the memory is big enough, it’s not an issue, you just bring File::Slurp::Tiny and slurp the whole file into memory, you can then access the line you like. But when one of these conditions is not satisfied, the rules of the game change: you need to be able to access any line of the file without loading the whole content into memory.

What do you do? Well, that could be an interesting job-interview question!

At work we were in a similar situation a couple of months ago and we wanted to have a way to iterate over a file without loading it up into memory, after a quick search on metacpan, we realized nothing was doing what we wanted, so as good citizens, we implemented it and now, we release it to CPAN. So, let me introduce File::Sip.

I could have entitled this blog post “When slurping makes you sick, try to sip” that’s the whole idea: instead of loading the whole content of the file into memory, File::Sip builds an index of each line’s first character, accessible with the corresponding line number. Don’t get it wrong, File::Sip is slower than File::Slurp::Tiny, because it needs an init phase to scan the whole file for building its index. But if you want to parse a 10 GB file on a system where you only have 4GB of RAM for the current process, it won’t be a problem for File::Sip: only the current line and the index are in memory, nothing more.

The project has been released on CPAN and is hosted, as usual, in Weborama’s GitHub repo, comments, issues, patches and forks are welcome!

6 thoughts on “File::Sip a perl module to read huge text files with limited memory

  1. Nice idea, another module used to read a part of a file in the memory is Tie::File, it would be interesting to compare these 2 modules.

    mestia

  2. What’s the benefit of File::Sip over regular use of the diamond operator, particularly if you just want to iterate over it line by line? Ie

    open($fh, ‘<', '02packages.details.txt');
    while (my $line = ) {

    }

    Using File::Sip is significantly slower, and doesn’t appear to use less memory than the above. Or am I missing something (entirely likely :-)

    Neil Bowers

  3. @Neil: the advantage over the diamond operator is that you can access directly the line of the file you want, without parsing them all everytime. If you need to access multiple times, in random order, many lines of a zillions-lines file, then I can assure you this approach is better than while (< $fh>) { }

    File::Sip is slower if you access the lines only once (because it needs one complete run to build its index), but if you need to access many lines of the file, many times, at specific points, then, iterating over the file handle over and over to get where you want will be slower.

    sukria

  4. @sukria: ah, right. So I was missing something :-) Thanks.

    Neil Bowers

  5. At first I was wondering why you were remembering the first character of each line, that seemed like a really narrow usecase (“Give me the first character of line 238189”?!) – then I realized that I misunderstood what “builds an index of each line’s first character, accessible with the corresponding line number” meant; you index the position of the first character of each line, of course, so it is easy to jump to an arbitrary line.

    Chalk that one up to me not being a native English speaker/reader.

    I am wondering if there would be any benefit in building up the index lazily, so if it turns out you only need to jump around in the first couple of terabytes of a hundred terabyte file, you save reading the rest.

    Probably also a narrow usecase, though :-)

    Adam

Comments are closed.