[ This was originally in PowerPoint. I saved it to Word, then saved
it as MS-DOS text with line breaks, copied it to Unix got rid of the Control-M
characters and generally cleaned it up. I know PPT will save as HTML, but
the result is ugly. I hope this is useful. ]
Sorting in Perl
John Klassa / Raleigh Perl Mongers / June, 2000
Introduction
- Perl has a built-in "sort" function.
- It uses a quicksort algorithm, which has good (in fact, O(n log
n)) performance. (Note, however, that a simple bubble sort can be faster
than most other sorts for very short lists.)
- It's easy to use. You say:
@s = sort @a;
and you've got a sorted array...
So Why Do We Need a Talk?
- In order to do useful sorts, you need to add a bit of code to your
"sort" calls.
- If you haven't had the need to do a complex sort, you will.
- The built-in "sort" compares the sortkeys O(n log n) times. How
(and how often) you generate the sortkeys, and what you sort them with,
makes a huge difference.
The Basics
More Basics
Variations
Combination Sorts
Sort Subroutines
Advanced Sorting: Motivation
- Everything that happens in your sort subroutine/clause happens O(n
log n) times.
- If you do something expensive (like an extraction via a regexp,
or perhaps a "stat" on a file), this is a Bad Thing.
- The Goal: Extract the sortkeys just once.
Solution #1: "Orcish" Maneuver
Problems with the OM
- Performs an extra test after each sortkey lookup.
- False values are recomputed each time.
A Better Way: The Schwartzian Transform
- The Schwartzian Transform creates a sorted list by transforming
the original list into an intermediate form, where the sortkeys are cached,
and then pulling the original list back out.
But First, A Digression...
- Just as @a = (1, 2, 3) creates an array, $aref = [1, 2, 3] creates
a reference to an anonymous array.
- Just as $a[0] is the first element of @a, $aref->[0] is the first
element of the anonymous array to which $aref refers.
- Understanding this is central to understanding the ST.
The Schwartzian Transform
ST: Mechanics
- Map the list into a new one that contains the extracted sortkeys
and the original values.
- Sort on the sortkeys.
- Map the resulting list into a new one that contains the original
values in the sorted order.
ST: Verbose Approach
- Verbosely, in code:
# @a exists, and contains filenames
@x = map { [ $_, -M ] } @a; # transform: value, sortkey
@sx = sort { $a->[1] <=> $b->[1] } @x; # sort
@s = map { $_->[0] } @sx; # restore original values
ST: Final
- Put it all together? The key is to read it backwards.
# @a exists, and contains filenames
@s = map { $_->[0] } # restore original values
sort { $a->[1] <=> $b->[1] } # sort
map { [$_, -M] } @a; # transform: value, sortkey
Can We Do Better?
- I didn't think so until I read the Guttman-Rosler paper.
- Turns out, yes. Use packed sortkeys and the default sort.
The "Packed Default" Sort
- So-named because it uses packed sortkeys, then sorts them with the
default "sort" (i.e. no sort subroutine or sort clause. just the native,
all-in-C comparison routine).
- The benefits: fast, one-time sortkey generation; fast comparison;
fast extraction.
PD: The Mechanics
- Pack the sortkeys into a single string (tack on subkeys, if any).
- Tack on the original values (or an index, if the original values
are complex data structures).
- Sort.
- Retrieve original values via "substr", "split" or whatever.
PD: Example
Conclusion
- Using "sort" is always O(n log n).
- For complicated sorts, how you pull out the sortkeys and how you compare
them is what matters.
- The ST is my personal favorite. It's easy to remember, and it's fast.
- The PD sort is faster, but it's also a bit more cryptic (unless you're
a natural with "pack", and have a desire to really understand your data).
By the way, Uri Guttman started on a Sort::Records module that does a PD
sort under the covers, but did not finish it or publish it to CPAN. He has,
however, offered to give us the current source, design ideas, help, etc.
if anyone in raleigh.pm would like to pick it up.
References
Revisions
- November 20, 2002: Rob West: Update based on feedback from Uri Guttman.
- August 22, 2003: Rob West: Updated References links.