[publican-list] sortable lists, esp. glossaries

Fri Feb 3 09:08:55 UTC 2012

On Tue, Jan 31, 2012 at 06:45:12PM +1100, I wrote:

> Apparently, the mapping from a string of Kanji to its pronunciation
> (ordering) isn't even a deterministic operation, at least for proper
> names.

(Of course I meant "proper nouns".  Actual non-determinism might even be
limited to proper nouns, though I'm not sure that that changes anything
from a coding point of view.)

> Thus, the solution would have to involve supplying pronunciations somehow
> for at least some glossary entries.

More precisely, it follows that sorting Kanji entries by pronunciation
would in general require supplying pronunciations for some entries.

However, I don't want my unclear wording to contribute to wrong
conclusions about what Publican actually requires: I'm not in a position
to say whether Publican requires index or glossary entries involving
Kanji to be sorted by contextually-correct pronunciation.  All I've
learnt over the past couple of days is that *outside of* a book index or
glossary, Kanji are sorted sometimes by contextually-correct
pronunciation and sometimes by some other order (and I think there's more
than one alternative, even).

If anyone wants a concrete sample for an "is this output acceptable"
question (and if not using software just for japanese sorting, like
Lingua::JA::Sort::JIS), then I suggest making sure that the collation
function is tailored for a Japanese locale (e.g. using
Unicode::Collate::Locale->new(locale => 'ja-JP')): without that,
collation software is unlikely to try to use a specifically-japanese
ordering of Kanji characters or intersperse Katakana with Hiragana.

In particular, the documentation for plain Unicode::Collate is explicit
that it doesn't intersperse Katakana with Hiragana, and that its Kanji
ordering is simply by unicode block & code point rather than by a JIS
ordering.

So I think the easiest thing to do that has a good chance of getting a
"yes, this is acceptable" answer would be to switch from Unicode::Collate
to Unicode::Collate::Locale and pass locale => $LANG to the constructor
(where $LANG is the Publican language like en-US or ja-JP).

Effect on other languages:

Switching to a locale-sensitive collator might also make for a better
collation of Indic languages (handling of virama, and some related
reordering rules).

Whereas if applied to Spanish for indexes, note that it might move
entries like chkconfig from near the beginning of the C entries to just
before D; it's not clear to me whether that's a good or bad thing for a
word like chkconfig that isn't even spanish and thus arguably isn't using
the Spanish ch digraph.

(In both cases, I haven't actually tested the behaviour, nor have I asked
a native speaker for their preferences for index/glossary sorting in
technical documentation.)

pjrm.