Natural string sorting, an in depth analysis

by markg85

Hi KDE folks,

Update: content updated! The blog post has been updated all over the place with proper timings.

Yes, a blog post about natural string sorting was bound to happen. Given that I’ve been playing with that for quite a long time. This post is all about the performance of sorting strings in a way consider natural, but machines don’t.

What does a natural order look like? Here is a small example.
0.txt
1.txt
3.txt

8.txt
9.txt
10.txt
11.txt

You get the point. That is a natural sort order for people. Computers on the other hand don’t think so. When they sort a string they look at each individual character. Looking at numbers in that way causes computers to sort the same example as:

0.txt
1.txt
10.txt
11.txt
3.txt

8.txt
9.txt

See the issue at “10.txt” and “11.txt”. Well, that is caused by looking at individual characters for sorting. How exactly that issue is solved on a technical area is beyond this blog post. If you want, you can investigate the KDE function: KStringHandler::naturalCompare.

Currently in KDE SC 4.xx, more specifically in Dolphin, natural sorting is done using this very nifty function. It works just fine and has been doing so for many years. But it has a downside: it’s not really fast. Usually you don’t bother too much about that since you don’t often sort, right? Well, partly. In KDE’s Dolphin you get sorted results by default. This in turn means that you will notice any slowdowns if the sorting takes more then 100ms (0.1 second). That in turn also depends on the person. I had reported this issue in the Dolphin camp quite a while ago and they “solved” it by implementing a sorting algorithm that runs on multiple cores thus is usually done faster. The KStringHandler::naturalCompare function is still used, just on more cores :)

All this isn’t new. Dolphin developers have done really awesome work in speeding this up and it’s been implemented in dolphin since a couple releases (KDE Applications 4.11?).

This still leaves us with a – in my opinion – slow KStringHandler::naturalCompare. Isn’t there anything we can do to speed it up? Till Qt version 5.1 there was nothing we could do about it. Starting with Qt 5.2 we have two awesome new classes:
- QCollator
- QCollatorSortKey

QCollator can be seen as a class that you create that sets the rules for comparing strings. So we want for instance numbers to be sorted in a way we like them (10 after 9, not 1). That class allows us to set those rules. Then we can call QCollator::compare with the strings to compare and they would be checked to those rules.

QCollatorSortKey is a bit different. You can ask QCollator (by calling “sortKey(your string)”) to give a pre-computed key that obeys the rules set in QCollator. When you want to compare two string you would then compare the QCollatorSortKey objects for those strings. It adds a bit of extra bookkeeping so you need to wonder if the extra bookkeeping is worth using it. QCollator::compare is very easy after all.

To know for sure i started benchmarking those 3 options:
- Sorting using KStringHandler::naturalCompare
- Sorting using QCollator::compare
- Sorting using QCollatorSortKey

This chart shows the benchmarking results i gathered with those 4 different methods in Qt build with -developer-build, aka debug + something else. Shorter is better.
sorting_benchmark

 

The following chart is gathered with Qt build in release mode and debug symbols added. That’s about the mode you would have (minus the debug symbols) when you install Qt from your distribution.

sorting_revised

A special note for QCollatorSortKey. It is benchmarked with the overhead time of calling QCollator::sortKey and adding additional bookmarking data. In other terms, the timings for QCollatorSortKey are including all overhead! If i where to remove the overhead and just measure the actual sorting time then all it’s timings are cut in half.

Before i started benchmarking i had some ideas about which one would be faster. I was expecting QCollatorSortKey to be the fastest so no surprise for me there. It’s the other two where i’m really surprised.

I was expecting QCollator::compare to beat KStringHandler::naturalCompare since the former is going to replace the later in a KDE Frameworks world. I wasn’t expecting QCollator::compare to be this much slower, rather my expectation was it to be somewhat faster then what we had. Apparently KStringHandler::naturalCompare still has a big reason to be used when looking at the performance numbers.

Another thing i wasn’t expecting was QCollatorSortKey to _always_ be faster then the other two alternatives. I was under the impression that for a few compares QCollator::compare was faster. The documentation even says so, but these benchmarking results clearly show that QCollatorSortKey is just faster (again, with all overhead included!).

Based on the chart we can draw some conclusions.
1. KStringHandler::naturalCompare is far superior to QCollator::compare when it comes to performance.
2. QCollatorSortKey beats everyone.
3. If you never have more then 100 items to sort then it really doesn’t matter which one you take. All 3 options complete the sorting very fast. If this is your case then i would go for ease of coding and go for QCollator::compare. I completely revise this stance based on the new performance numbers. You should consider using KStringHandler::naturalCompare when you need a simple natural string compare method. If you have more time then you should go for QCollatorSortKey. I cannot make up any reason anymore to use QCollator::compare. It’s just not as performant as the alternatives.
4. Do you want your sorting to still be fast with insanely high numbers of items to sort? Go for QCollatorSortKey. For file browsers (hello Dolphin and Accretion) i would always go for QCollatorSortKey.

If you want to repeat these results, here is the code i used. It needs C++11 since i use some C++11 features.

Next up is investigating if this is the most performance we can drag out of QCollatorSortKey. More performance there would mean patches against Qt. So where is the time being spend when we sort in QCollatorSortKey? Surely in std::sort comparing the internal bytes, right? Aka, not much more we can optimize.

To profile this we add a header in the above source file:
#include <valgrind/callgrind.h>

Then we set points from where we want to have profiling. You can replace your doSort function with this one:

Then compile it (obviously in release mode with debug symbols) and run the executable through valgrind with a command like this:
valgrind –tool=callgrind –instr-atstart=no <executable>

This gives you a file named callgrind.out.<somenumber> You should open kcachegrind with that to read it. I don’t know how others do this, but in this case i’m looking at the actual sort function and the sortKey function cost. For both in the deepest level till the line you find the function that Qt is calling that isn’t defined in Qt.
For the actual sorting that is the line:
__memcmp_sse4_1 (at a cost of 5,79% of the total execution time)

For the sortKey that is the function call:
ucol_getSortKey_53 (at a cost of 27,10% of the total execution time)

+ 0,5% of the total running time create our new vector with the sorted strings.

Note: when i say “total running time” i mean the time that is instrumented by valgrind. This is not the total execution time when you start the app.

Now we immediately have a good idea of the performance we get and the actual costs.
Actual costs: 5,79% + 27,10% + 0,5% = 33,39%
Which means that 66,61% is spend in overhead, Checks and whatever else is going on.

First, lets look at the 5,7% of the memcmp.
QCollatorSortKey_profiling

Look at the image. Within the qstrcmp we have 4 things using cpu:
20.5%, QByteArray::constData()
5,79%, __memcmp_sse4_1
3,49%, QByteArray::length()
2,84%, qMin

Remember, every single function that is being called in __memcmp_sse4_1 is called in O(n log n) time. That is a LOT more then O(n). If we just reduce the function calls there we can easily save 20% of the total running cost thus that alone making our sorting 20% faster. The issue is that this function (qstrcmp which calls __memcmp_sse4_1) isn’t wrong at all. It’s just fine for one QByteArray. However, when we have a big list of QByteArray objects then it becomes faster to maintain bookkeeping of the length of all items and the data pointer to them. Which would completely throw away the calls to constData and length at a cost of accessing an array where that same data is stored. It will be faster, but for Qt it’s not worth it. It’s too specific. For us it would be worth it.

Update the above was done with a Qt build that was compiled with “-developer-build”. I was expecting that to be equal to “release with debug symbols”, but it’s equal to “debug with some more” thus giving me a wrong picture when benchmarking and profiling. I let the above in this post because it was the reason for me to try a different approach. However, in -release mode you don’t see any calls to ::constData anymore. It’s a free function call (0 cost) along with other calls that became free in release mode. That’s why in the charts above (the one for the release build) the timings for QCollatorSortKey are down quite significantly between debug and release. However, the above does show potential data locality improvements that can be made. A general rule of thumb when optimising for high speed is to have data as close together as possible.

An even more ideal approach would be for Qt to have a sortKey function that fills a buffer that we provide it. We then maintain everything ourselves which allows for far easier optimisations then to hack in the Qt code. Another big advantage of that approach is that we control the data locality.

As a proof of concept i implemented that ideal approach in Qt. I fear that it won’t be accepted upstream, but i can always try. The diff:

I used the new function in my benchmark and the results are simply stunning as the following graph shows (shorter bars is better). Taken with Qt build in release mode. (+typo of optimized..)

sorting_revised_colkey_optimized

As you can see in the chart, the speedup in the optimised version is always faster, but you won’t notice that if you sort less then 5000 items. Interesting though is the increased performance improvement in the higher sorting numbers. The optimised sorting is about 80% faster then the non optimised version. I’m guessing this is due to data locality which in turn causes for more efficient usage of the CPU cache. And let me stress it again, this is a timing including all bookkeeping overhead! If i would just measure the sorting speed then the 500.000 sort takes just 180 milliseconds aka 0,18 seconds. The vast majority of the time spend right now is in creating the sortKey with roughly 65%. I can probably find faster ways there as well, but i don’t really want to fiddle with ICU internals :)

There you have it, an in depth analysis of how to do natural string sorting. You now know your options and which one is the fastest.

I hope you liked this insanely long blog post. It took me quite some hours to write it.

Cheers,
Mark