john-dev - ring, etc

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <CDC48B48-7FF5-42E2-A9E0-666FFFA9E939@gmail.com>
Date: Sat, 19 Sep 2020 18:44:17 -0700
From: Fred Wang <waffle.contest@...il.com>
To: john-dev@...ts.openwall.com
Subject: ring, etc

Thanks for looking at rling.  I spent a couple of weeks looking at the rli program, and made a number of improvements.

I am concerned about a couple of things you said on twitter, though.

The default operation of rling is using a hash, and always keeps input line order.  In fact, I go to great pains on that.  The number of threads in operation does not, in any way, affect line ordering.   If you have found a case that you think it does, I would sure like to see it so it can be fixed.

rling -b  changes the operation to a binary search, rather than a hash.  This still will not change line order on output, unless the -s switch is also given (in which case, it uses lexical sorted order on output).  It’s quite fast on sorting, usually beating gnu sort by a large margin (several times, depending on size).  In addition, it checks sort order on input, which means that rling -b -s on an already sorted file is blindingly fast (7.5 seconds to read and write a 1 billion line file on my development system, including the “sort”).

rling -2 requires that all files are sorted already (and produces a proper error message if they aren’t).

rling -f uses a file-based “virtual memory” system, and should be used as a last resort on systems with limited memory, and large files).

Would you be able to put your test file up somewhere, so I can snag it?  For my 1 billion line file, I generated it with a perl script: ‘for ($x=0; $x<1000000000; $x++) {print “$x\n”;}’

Thanks!

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.