[libvirt] [PATCH 5/5] po: minimize & canonicalize translations stored in git

Daniel P. Berrangé berrange at redhat.com
Thu Apr 19 08:03:54 UTC 2018


On Thu, Apr 19, 2018 at 09:53:01AM +0200, Ján Tomko wrote:
> On Thu, Apr 12, 2018 at 02:28:22PM +0100, Daniel P. Berrangé wrote:
> > Similar to the libvirt.pot, .po files contain line numbers and file
> > names identifying where in the source a translatable string comes from.
> > The source locations in the .po files are thrown away and replaced with
> > content from the libvirt.pot whenever msgmerge is run, so this is not
> > precious information that needs to be stored in git.
> > 
> > When msgmerge processes a .po file, it will add in any msgids from the
> > libvirt.pot that were not already present. Thus, if a particular msgid
> > currently has no translation, it can be considered redundant and again
> > does not need storing in git.
> > 
> > When msgmerge processes a .po file and can't find an exact existing
> > translation match, it will try todo fuzzy matching instead, marking such
> > entries with a "# fuzzy" comment to alert the translator to take a
> > look and either discard, edit or accept the match. Looking at the
> > existing fuzzy matches in .po files shows that the quality is awful,
> > with many having a completely different set of printf format specifiers
> > between the msgid and fuzzy msgstr entry. Fortunately when msgfmt
> > generates the .gmo, the fuzzy entries are all ignored anyway. The fuzzy
> > entries could be useful to translators if they were working on the .po
> > files directly from git, but Libvirt outsourced translation to the
> > Fedora Zanata system, so keeping fuzzy matches in git is not much help.
> > 
> > Finally, by default msgids are sorted based on source location. Thus, if
> > a bit of code with translatable text is moved from one file to another,
> > it may shift around in the .po file, despite the msgid not itself changing.
> > If the msgids were sorted alphabetically, the .po files would have
> > stable ordering when code is refactored.
> > 
> > This patch takes advantage of the above observations to canonicalize
> > and minimize the content stored for .po files in git. Instead of storing
> > the real .po files, we now store .mini.po files.
> > 
> > The .mini.po files are the same file format as .po files, but have no
> > source location comments, are sorted alphabetically, and all fuzzy
> > msgstrs and msgids with no translation are discarded. This cuts the size
> > of content in the po directory from 109MB to 19MB.
> > 
> > Users working from a libvirt git checkout who need the full .po files
> > can run "make update-po", which merges the libvirt.pot and .mini.po
> > file to create a .po file containing all the content previously stored
> > in git.
> > 
> > Conversely if a full .po file has been modified, for example, by
> > downloading new content from Zanata, the .mini.po files can be updated
> > by running "make update-mini-po". The resulting diffs of the .mini.po
> > file will clearly show the changed translations without any of the noise
> > that previously obscured content. Being able to see content changes
> > clearly actually identified a bug in the zanata python client where it
> > was adding bogus "fuzzy" annotations to many messages:
> > 
> >  https://bugzilla.redhat.com/show_bug.cgi?id=1564497
> > 
> > Users working from libvirt releases should not see any difference in
> > behaviour, since the tarballs only contain the full .po files, not the
> > .mini.po files.
> > 
> > As an added benefit, generating tarballs with "make dist", will no
> > longer cause creation of dirty files in git, since it won't touch the
> > .mini.po files, only the .po files which are no longer kept in git.
> > 
> > To avoid creating a single commit 100+MB in size, each language is
> > minimized separately in a following commit.
> 
> From a brief look at those, the few Slovak "translations" are all in
> English and many of the translation team pages still point to transifex,
> but I assume that data comes from Zanata.

Yeah there's a few other languages too where, for unknown reasons, the
english has been duplicated into the translation. I could go clicky-clicky
and kill that in Zanata UI but there's alot, so I want to figure out a way
to automatically extract that list of bad translations & cull them all in
one go via the API.

Good point about the translation URLs pointing to transifex. I'll submit
another patch for that too.

> > Signed-off-by: Daniel P. Berrangé <berrange at redhat.com>
> > ---
> > .gitignore               |  3 +++
> > build-aux/minimize-po.pl | 37 +++++++++++++++++++++++++++++++++
> > po/Makefile.am           | 30 ++++++++++++++-------------
> > po/README.md             | 53 +++++++++++++++++++++++++++++++++++++++++-------
> > 4 files changed, 102 insertions(+), 21 deletions(-)
> > create mode 100755 build-aux/minimize-po.pl
> > 
> 
> Reviewed-by: Ján Tomko <jtomko at redhat.com>
> 
> Jano



Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|




More information about the libvir-list mailing list