Executive Summary: ------------------ We were computing the number of translations in a po file incorrectly. This lead to the erroneous conclusion we were losing translations when importing a po file from Transifex. Suggested Fix: -------------- 1) Utilize msgfmt instead of msgcmp to compute po file statistics. 2) Prevent msgmerge from incorporating fuzzy translations for new msgid's. Detailed Analysis: ------------------ A msgid is the raw i18n string extracted from our source code and provided to a translator to translate into a specific language. A msgstr is the translation of a msgid into a specific language. A po file is a set of pairs which provides the mapping from a string in the source code to a language specific translated string. There is one po file per national language. A pot file is the collection of all msgid's in a translation catalog. It serves as a "template" for po files (hence 'po' + 't' for template). pot files are created and updated by extracting i18n marked strings in the source code (typically with the tool xgettext). As the source code base evolves the set of i18n strings change. The pot file is regenerated periodically so it has the current set if i18n strings in the pot file as msgid's. When the pot file is updated each language po file must also be updated so the msgid's in the po file match those in the pot file. New msgid's in the pot file are added to the po file and msgid's in the po file which are no longer in the pot file are removed. A translator must subsequently add new translations for the new msgid's the new pot file introduced. The process of updating a po file to match a pot file is called "message merge" and is typically done via the msgmerge tool. The messages in a po file may have optional flags attached to it which help the translator or the automated i18n tool chain. One such flag is called "fuzzy" and is meant to indicate the msgstr translation of the msgid is not 100% accurate and needs review by a translator. Fuzzy flags may be inserted by a translator or by the i18n tool chain. msgmerge tries to make the translators life easier by trying to recognize new msgid's which are close to previous msgid's and copying the msgstr translation associated with the closest match msgid. These msgstr's are marked with the fuzzy flag because they are not 100% correct but hopefully provides the translator with a good starting point that she may need only to tweak. It would be more work for a translator if any edit whatsoever to a msgid caused the existing translation (msgstr) to be thrown away forcing the translator to re-translate the msgid from scratch again. Any msgstr marked as fuzzy is NOT considered a valid translation. Only valid msgstr's are presented to the end user. Fuzzy msgstr's are for a translators benefit only. We have a make target called 'msg-stats' which computes statistics concerning our i18n translations. It is used to see what percentage of a language is translated and to generally validate the state of our translation files. The precise cause of our problem was we were counting the number of msgstr's in a po file and comparing that to the count of msgid's in the pot file to compute how much of a language had been translated. In other words if the pot file had 10 msgid's and a po file had 8 msgstr's that would indicate 80% of the strings in that language had been translated. But if a msgstr has the fuzzy flag associated with it it's not a valid translation and does not count. The fix was to replace our use of msgcmp with msgfmt --statistics which provides a count of translations (msgstr's without the fuzzy flag), a count of fuzzy translations, and a count of msgid's without a msgstr (i.e. untranslated strings). The original logic we were using was copied from another project and was presumed to be correct, however it wasn't. When we pull a po file from Transifex fuzzy translations are never included because they are not valid translations. Transifex does keep the fuzzy msgstr's as "suggestions" to the translator. The confusion arose when we pulled a po file from Transifex and the number of translations appeared to decrease from what we had in our copy of the po (presuming our copy of the po contained fuzzy translations) because we were incorrectly counting fuzzy translations. This implied there was some type of data loss, when in fact there wasn't. How good are fuzzy suggestions? ------------------------------- During the process of trying to figure out how different versions of a po file differed and why I wrote a tool to analyze pot and po files. One of the things the tool allowed me to do was see how msgmerge picked "close" strings to use as fuzzy suggestions. In some cases it worked very well, however in a large number of cases the suggestion was significantly incorrect and if a translator was not careful they might accept a suggestion which was inaccurate. Below are some examples from our own code base. The string: "Permissions" Was suggested for the strings: "Add Permission" "Provisioning" "Permission Type" "Permission name" "Self Service Permissions" The string: "Default users group" Was suggested for the strings: "Failed users/groups" "Default user objectclasses" "Don't create user private group" The string: "Member service groups" Was suggested for the strings: "Member HBAC service groups" "Indirect Member HBAC service group" "Member user group" The string: "Password used in bulk enrollment" Was suggested for the string: "Generate a random password to be used in bulk enrollment" The string: "Certificate" Was suggested for the strings: "Certificate Hold" "New Certificate" "Certificate Revoked" "No Valid Certificate" "Host Certificate" "Service Certificate" The string: "External host" Was suggested for the strings: "External" "External User" "RunAs External User" "RunAs External Group" The string: "Added sudo rule "%(value)s"" Was suggested for the strings: "Changed password for "%(value)s"" "Added HBAC rule "%(value)s"" The string: "Modified service "%(value)s"" Was suggested for the strings: "Modified privilege "%(value)s"" "Modified HBAC service "%(value)s"" "Modified privilege "%(value)s"" "Modified selfservice "%(value)s"" The string: "type, filter, subtree and targetgroup are mutually exclusive" Was suggested for the string: "filter and memberof are mutually exclusive" As you can see from the above examples the suggested string can be significantly different in content, meaning and intent from the actual string. A translator would have to be alert when being presented with suggestions to recognize the sometimes subtle but critical distinction between the actual string and the suggestion. My concern is with with human nature. When a translator is presented with hundreds of strings to translate and a large proportion of those have inaccurate suggestions which can just be "clicked through" and accepted it seems to me there is a high probability of introducing inaccurate translations. I think the quality of our translations would be better if we didn't provide suggestions which can be clicked through and accepted, instead the translator would have to type the new translation in from scratch. Yes, this would mean more work for the translator but it doesn't seem terribly onerous either and could result in a much higher quality translation. My suggestion would be to turn off the automatic generation of suggestions during the message merge phase (i.e. fuzzy msgstr's). Comments?