File::Sip a perl module to read huge text files with limited memory
Even though we live in a world where buying a server with 500 GB of RAM is possible, there can always be a situation where we don’t have enough memory available. What about log files from a web server that handle a billion of requests every day, what if you need to parse these files as efficiently as possible with a limited amount of main memory available?
When the file is small enough, or when the memory is big enough, it’s not an issue, you just bring File::Slurp::Tiny and slurp the whole file into memory, you can then access the line you like. But when one of these conditions is not satisfied, the rules of the game change: you need to be able to access any line of the file without loading the whole content into memory.
What do you do? Well, that could be an interesting job-interview question!
At work we were in a similar situation a couple of months ago and we wanted to have a way to iterate over a file without loading it up into memory, after a quick search on metacpan, we realized nothing was doing what we wanted, so as good citizens, we implemented it and now, we release it to CPAN. So, let me introduce File::Sip.
I could have entitled this blog post “When slurping makes you sick, try to sip” that’s the whole idea: instead of loading the whole content of the file into memory, File::Sip builds an index of each line’s first character, accessible with the corresponding line number. Don’t get it wrong, File::Sip is slower than File::Slurp::Tiny, because it needs an init phase to scan the whole file for building its index. But if you want to parse a 10 GB file on a system where you only have 4GB of RAM for the current process, it won’t be a problem for File::Sip: only the current line and the index are in memory, nothing more.
The project has been released on CPAN and is hosted, as usual, in Weborama’s GitHub repo, comments, issues, patches and forks are welcome!