Inconsistant sort order in categories

Inconsistant sort order in categories

See for example , look at end of index letter "A".

It looks like the SQL database uses an inconsistant sort order, or the SQL client used does not properly implement the sorting (very likely an incorrect implementation of the "Quicksort" algorithm, with a missing "=" sign when comparing keys).

This causes the pages navigation (that goes from page to page of 500 members each) to skip a (variable) number of entries, as the starting position (from= value in the GET request) is not honored.

In the category above, just click on letter "A": this shows only the members starting by "A" incorrectly sorted at end of the list (as viewed from the start of the list). Click on the Greek Alpha key: all Greek letters are skipped, you need to go backward and you'll see the missing Greek names, and some Cyrillic names starting by the Cyrillic letter "A". Click on the Hebrew Alef key, you don't see any Hebrew name and you don't even see the first Arabic names normally sorted after it (such as those starting by an Arabic Alef, which will be seen only when going backward to the previous page).

In some cases, going backward in long list will also skip some intermediate values, which will then never be listed.

Note that when generating the HTML, the retrieved 500 first keys (internal SQL: "LIMIT 500" in the query, which should be using a comparator "WHERE key >= 'index') are sorted again in the client (but still inconsistantly) before generating the actual list. Collation also seem to not use the same rules (or locale) in the SQL server and in the webapp (in PHP).

It is also very likely that sorting indexes for categories are incorrectly computed (using incorrectly generated collation keys) and that they were not reindexed when there was an upgrade of the SQL server software.

So categories that have more than 500 members (notably categories of users) are not navigatable, and cannot be seen entirely from the web interface (this does not seem to affect the same way queries using the Mediawiki API, for example returning list of category members in JSON).

The non-regeneration of collation keys for page names (computed normally when performing "INSERT" or "UPDATE" on invididual rows) is causing issues, but I doubt this is caused by the internal implementation in the SQL server, which just internally uses a binary UTF-8 sort, or is supposed to use it; collation keys are stored in a supplementary column, separately of the UTF-8 page name, but taking into account the sort key given in [[Category:categoryname|custom-sort-key]]) as a prefix followed by some separator, which should be a collation element lower than the collation element for an ASCII space, so that separator should be a control character like 0x01; if it is not possible to use that separator due to data restriction on the stored collation-key column, then that separator should be " !", i.e. a SPACE followed by an exclamation mark.

Now this wiki seems to apply it own custom sort after retreiving each subset of 500 rows, to use an ICU based collation on this subset. This custom sort is visibly affected by the incorrect implementation of the QuickSort algorithm, or it is not stable. The result of that is that list of members displayed are incoherent, they are missing items which are not listed (incorrectly dropped during this custom sort).

Can you investigate this? Where is the source code of the client used on this wiki? can you investigate if stored collation keys for list of "sortkey+pagenames" in categories are correctly generated (you may need to use an asdmin script to regenerate these data columns constainong the generated collection key, (i.e. custom sort key + separator + pagename + separator + namespace), then filtered and possibly truncated to avoid exausting a length limit. It is even possible that different rows may have the same collation key stored in the database even if they are for distinct page names.

Verdy p (talk)06:02, 5 May 2022