[libvirt PATCH 00/51] Use permutable format strings in translations

Mon Mar 27 14:37:34 UTC 2023

On Mon, Mar 27, 2023 at 01:08:09PM +0200, Jiri Denemark wrote:
> On Fri, Mar 10, 2023 at 17:14:32 +0000, Daniel P. Berrangé wrote:
> > Even if fixed, it might be worth switching the .pot file anyway, but
> > this can't be done without us bulk updating the translations, and
> > bulk re-importing them, which will be challenging. We'll almost
> > certainly want to try this on a throw-away repo in weblate first,
> > not our main repo.
> 
> I was able to come up with steps leading to the desired state:
> 
>  0. lock weblate repository
>  1. update libvirt.pot from the most recent potfile job
>  2. push to libvirt.git
>  2. wait for translations update from Fedora Weblate and merge it
>  3. pull from libvirt.git
>  4. apply the first 50 patches from this seires (with required changes
>     to make sure all translation strings are updated)
>  5. update all po files with the attached script
>  6. update libvirt.pot by running meson compile libvirt-pot
>  7. apply patch 51 of this series
>  8. push to libvirt.git
>  9. wait for translations update from Fedora Weblate and merge it
> 10. unlock weblate repository
> 
> The process takes about an hour if we're lucky as weblate is quite slow
> when processing such large amount of changes.
> 
> The result can be seen at
> 
>     https://gitlab.com/jirkade/libvirt/-/commits/format-strings
> 
> and the corresponding weblate repository at
> 
>     https://translate.fedoraproject.org/projects/libvirt/test/
> 
> I used d05ad0f15e737fa2327dd68870a485821505b58f commit as a base.

Looking at this, I picked a random language (Bengali) and compared
stats:

  https://translate.fedoraproject.org/projects/libvirt/test/bn_IN/

vs

  https://translate.fedoraproject.org/projects/libvirt/libvirt/bn_IN/

Translated strings matches to within 2 words, which is probably
accounted for by being based on different HEAD

Strings with failing checks is massively different, and that is
the fault of 'failing check: C format' - 1300 more failing checks
afterwards.

Comparing

https://translate.fedoraproject.org/browse/libvirt/test/bn_IN/?q=check%3Ac_format&sort_by=source&offset=3

with

https://translate.fedoraproject.org/browse/libvirt/libvirt/bn_IN/?offset=1&q=check%3Ac_format&sort_by=source&checksum=

we can see some obvious missing examples

https://translate.fedoraproject.org/translate/libvirt/test/bn_IN/?checksum=260fc1387343083b&q=check%3Ac_format&sort_by=source

Which is:

 msgid  "active commit requested but '%1$s' is not active"
 msgstr "সংরক্ষণের পুল '%s' সক্রিয় নয়"

looking at po/bn_IN.po I see that this string was already marked as
'fuzzy' before your changes, and thus your script did not try to
convert its format string.

Skipping fuzzy strings makes sense when the number of format
strings is mis-matched. If there's a matching count and matching
ordering, I think we ought to update the msgstr even when fuzzy,
but *keep* it marked fuzzy, so translators can review.

Anyway broadly speaking this script seems to have done the right
thing such that we don't loose translation coverage in the
compiled .mo files. My query is merely about fuzzy strings
which already get excluded from .mo files.

> If we agree this is a reasonable approach, I think we should apply it
> just after a release to give translators the whole release cycle to
> check or update the translations if they wish so.

Yep, doing it at the start makes sense.

> The attached script analyzes a single po file and updates all msgid
> strings to use permutable format strings. It also tries to update all
> translations, but only if the format strings in them exactly match
> (including their order) the corresponding msgid format string. That is,
> a msgstr will not be updated if format strings in it were incorrect or
> reordered or they already used the permutable form. That is, the
> processing should be a NO-OP except for strings that already used
> permutable format in msgstr, such translations were failing c-format
> check in weblate before but would be marked as correct now.

NB, even though your script would fix those cases of pre-existng use
of format positions, they'd still be left marked 'fuzzy' so will need
manual review in weblate. At least that is now possible that the
c-format check is no longer failed though.

> 
> Jirka

> #!/usr/bin/env python3
> 
> import sys
> import re
> 
> 
> # see man 3 printf
> reIndex = r"([1-9][0-9]*\$)?"
> reFlags = r"([-#0+I']|' ')*"
> reWidth = rf"([1-9][0-9]*|\*{reIndex})?"
> rePrecision = rf"(\.{reWidth})?"
> reLenghtMod = r"(hh|h|l|ll|q|L|j|z|Z|t)?"
> reConversion = r"[diouxXeEfFgGaAcspnm%]"
> reCFormat = "".join([
>     r"%",
>     rf"(?P<index>{reIndex})",
>     rf"(?P<flags>{reFlags})",
>     rf"(?P<width>{reWidth})",
>     rf"(?P<precision>{rePrecision})",
>     rf"(?P<length>{reLenghtMod})",
>     rf"(?P<conversion>{reConversion})"])
> 
> 
> def translateFormat(fmt, idx, m):
>     groups = m.groupdict()
> 
>     if groups["index"] or groups["conversion"] == "%":
>         print(f"Ignoring c-format '{fmt}'")
>         return idx, fmt
> 
>     for field in "width", "precision":
>         if "*" in groups[field]:
>             groups[field] = f"{groups[field]}{idx}$"
>             idx += 1
> 
>     newFmt = f"%{idx}${''.join(groups.values())}"
>     idx += 1
> 
>     return idx, newFmt
> 
> 
> def process(ids, strs, fuzzy):
>     regex = rf"(.*?)({reCFormat})(.*)"
>     fmts = []
>     idx = 1
> 
>     newIds = []
>     for s in ids:
>         new = []
>         m = re.search(regex, s)
>         while m is not None:
>             new.append(m.group(1))
> 
>             oldFmt = m.group(2)
>             idx, newFmt = translateFormat(oldFmt, idx, m)
>             fmts.append((oldFmt, newFmt))
>             new.append(newFmt)
> 
>             s = m.group(m.lastindex)
>             m = re.search(regex, s)
> 
>         new.append(s)
>         newIds.append("".join(new))
> 
>     if fuzzy:
>         return newIds, strs
> 
>     n = 0
>     newStrs = []
>     for s in strs:
>         new = []
>         m = re.search(regex, s)
>         while m is not None:
>             new.append(m.group(1))
> 
>             if n < len(fmts) and fmts[n][0] == m.group(2):
>                 new.append(fmts[n][1])
>                 n += 1
>             else:
>                 print("Ignoring translation", strs)
>                 print("              for id", newIds)
>                 return newIds, strs
> 
>             s = m.group(m.lastindex)
>             m = re.search(regex, s)
> 
>         new.append(s)
>         newStrs.append("".join(new))
> 
>     return newIds, newStrs
> 
> 
> def writeMsg(po, header, strs):
>     if len(strs) == 0:
>         return
> 
>     po.write(header)
>     po.write(" ")
>     for s in strs:
>         po.write('"')
>         po.write(s)
>         po.write('"\n')
> 
> 
> if len(sys.argv) != 2:
>     print(f"usage: {sys.argv[0]} PO-FILE", file=sys.stderr)
>     sys.exit(1)
> 
> pofile = sys.argv[1]
> 
> with open(pofile, "r") as po:
>     polines = po.readlines()
> 
> with open(pofile, "w") as po:
>     current = None
>     cfmt = False
>     fuzzy = False
>     ids = []
>     strs = []
> 
>     for line in polines:
>         m = re.search(r'^(([a-z]+) )?"(.*)"', line)
>         if m is None:
>             if cfmt:
>                 ids, strs = process(ids, strs, fuzzy)
> 
>             writeMsg(po, "msgid", ids)
>             writeMsg(po, "msgstr", strs)
>             po.write(line)
> 
>             cfmt = line.startswith("#,") and " c-format" in line
>             fuzzy = line.startswith("#,") and " fuzzy" in line
> 
>             current = None
>             ids = []
>             strs = []
>             continue
> 
>         if m.group(2):
>             current = m.group(2)
> 
>         if current == "msgid":
>             ids.append(m.group(3))
>         elif current == "msgstr":
>             strs.append(m.group(3))
> 
>     if cfmt:
>         ids, strs = process(ids, strs, fuzzy)
> 
>     writeMsg(po, "msgid", ids)
>     writeMsg(po, "msgstr", strs)

My attempt at convertnig fuzzy strings involved this diff:

--- /home/berrange/format-strings.py~	2023-03-27 13:29:05.777343030 +0100
+++ /home/berrange/format-strings.py	2023-03-27 13:43:33.950701633 +0100
@@ -62,9 +62,6 @@
         new.append(s)
         newIds.append("".join(new))
 
-    if fuzzy:
-        return newIds, strs
-
     n = 0
     newStrs = []
     for s in strs:
@@ -77,8 +74,9 @@
                 new.append(fmts[n][1])
                 n += 1
             else:
-                print("Ignoring translation", strs)
-                print("              for id", newIds)
+                if not fuzzy:
+                    print("Ignoring translation", strs)
+                    print("              for id", newIds)
                 return newIds, strs
 
             s = m.group(m.lastindex)
@@ -87,6 +85,12 @@
         new.append(s)
         newStrs.append("".join(new))
 
+    if n != len(fmts):
+        if not fuzzy and "".join(strs) != "":
+            print("Ignoring mismatched format count", strs)
+            print("                          for id", newIds)
+        return newIds, strs
+            
     return newIds, newStrs
 
 


With that I believe "Failing check: C format" should match before/after
your changes.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|