[libvirt PATCH v3 1/3] scripts: Check spelling

Peter Krempa pkrempa at redhat.com
Fri Jan 28 15:48:42 UTC 2022


On Fri, Jan 21, 2022 at 10:41:48 +0100, Tim Wiederhake wrote:
> This is a wrapper for codespell [1], a spell checker for source code.
> Codespell does not compare words to a dictionary, but rather works by
> checking words against a list of common typos, making it produce fewer
> false positives than other solutions.
> 
> The script in this patch works around the lack of per-directory ignore
> lists and some oddities regarding capitalization in ignore lists.
> The ".codespellrc" file is used to coarsly filter out translation and
> git files, as scanning those makes up for roughly 50% of the run time
> otherwise.
> 
> [1] (https://github.com/codespell-project/codespell/)
> 
> Signed-off-by: Tim Wiederhake <twiederh at redhat.com>
> ---
>  .codespellrc              |   2 +
>  scripts/check-spelling.py | 135 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 137 insertions(+)
>  create mode 100644 .codespellrc
>  create mode 100755 scripts/check-spelling.py
> 
> diff --git a/.codespellrc b/.codespellrc
> new file mode 100644
> index 0000000000..0c45be445b
> --- /dev/null
> +++ b/.codespellrc
> @@ -0,0 +1,2 @@
> +[codespell]
> +skip = .git/,*.po

This doesn't seem to work for me, after patch 2/3 I'm getting the
following:

>>> MALLOC_PERTURB_=230 /bin/python3 /home/pipo/libvirt/scripts/check-spelling.py --ignore-untracked
――――――――――――――――――――――――――――――――――――― ✀  ―――――――――――――――――――――――――――――――――――――
stdout:
(".git/logs/HEAD", "lenght"),	# line 187, "length"?
(".git/logs/HEAD", "programatic"),	# line 424, "programmatic"?
(".git/logs/HEAD", "programatic"),	# line 570, "programmatic"?
(".git/logs/HEAD", "programatic"),	# line 583, "programmatic"?
(".git/logs/HEAD", "programatic"),	# line 714, "programmatic"?
(".git/logs/HEAD", "programatic"),	# line 727, "programmatic"?
(".git/logs/HEAD", "programatic"),	# line 802, "programmatic"?
(".git/logs/HEAD", "programatic"),	# line 810, "programmatic"?
(".git/logs/HEAD", "programatic"),	# line 816, "programmatic"?
(".git/logs/HEAD", "programatic"),	# line 935, "programmatic"?
(".git/logs/HEAD", "programatic"),	# line 944, "programmatic"?
(".git/logs/HEAD", "programatic"),	# line 1005, "programmatic"?
(".git/logs/HEAD", "programatic"),	# line 1007, "programmatic"?
(".git/logs/HEAD", "programatic"),	# line 1008, "programmatic"?
(".git/logs/HEAD", "programatic"),	# line 1009, "programmatic"?
(".git/logs/HEAD", "programatic"),	# line 1014, "programmatic"?

...

(I'll be elaborating a bit more on the exclude pattern later after
playing with codespell for a bit).


> diff --git a/scripts/check-spelling.py b/scripts/check-spelling.py
> new file mode 100755
> index 0000000000..ce3e7d89f0
> --- /dev/null
> +++ b/scripts/check-spelling.py
> @@ -0,0 +1,135 @@
> +#!/usr/bin/env python3
> +
> +import argparse
> +import re
> +import subprocess
> +import os
> +
> +
> +IGNORE_LIST = [
> +    # ignore this script
> +    ("scripts/check-spelling.py", []),
> +
> +    # 3rd-party: keycodemapdb
> +    ("src/keycodemapdb/", []),
> +
> +    # 3rd-party: VirtualBox SDK
> +    ("src/vbox/vbox_CAPI", []),
> +
> +    # 3rd-party: qemu
> +    ("tests/qemucapabilitiesdata/caps_", []),
> +
> +    # other
> +    ("", ["msdos", "MSDOS", "wan", "WAN", "hda", "HDA", "inout"]),

These can be replaced by --ignore-words-list msdos,wan,hda,inout


> +    ("NEWS.rst", "crashers"),

IMO this can be reworded.

> +    ("docs/gitdm/companies/others", "Archiv"),
> +    ("docs/glib-adoption.rst", "preferrable"),

This too.

> +    ("docs/js/main.js", "whats"),

Skip all of docs/js?

> +    ("examples/polkit/libvirt-acl.rules", ["userA", "userB", "userC"]),
> +    ("src/libvirt-domain.c", "PTD"),
> +    ("src/libxl/libxl_logger.c", "purposedly"),
> +    ("src/nwfilter/nwfilter_dhcpsnoop.c", "ether"),
> +    ("src/nwfilter/nwfilter_ebiptables_driver.c", "parm"),
> +    ("src/nwfilter/nwfilter_learnipaddr.c", "ether"),
> +    ("src/qemu/qemu_agent.c", "crypted"),
> +    ("src/qemu/qemu_agent.h", "crypted"),
> +    ("src/qemu/qemu_process.c", "wee"),

This too.

> +    ("src/security/apparmor/libvirt-lxc", "devic"),
> +    ("src/security/apparmor/libvirt-qemu", "readby"),
> +    ("src/storage_file/storage_file_probe.c", "conectix"),
> +    ("src/util/virnetdevmacvlan.c", "calld"),
> +    ("src/util/virtpm.c", "parm"),
> +    ("tests/qemuagenttest.c", "IST"),
> +    ("tests/storagepoolxml2xml", "cant"),
> +    ("tests/sysinfodata/", "sie"),

This is technically 3rd party.

> +    ("tests/testutils.c", "nIn"),
> +    ("tests/vircgroupdata/ovirt-node-6.6.mounts", "hald"),

This is also 3rd party
`
> +    ("tests/virhostcpudata/", "sie"),

This too.

> +    ("tools/virt-host-validate-common.c", "sie"),
> +]
> +
> +
> +def ignore(filename, linenumber, word, suggestion):
> +    if len(word) <= 2:
> +        return True
> +
> +    for f, w in IGNORE_LIST:
> +        if not filename.startswith(f):
> +            continue
> +        if word in w or not w:
> +            return True
> +    return False
> +
> +
> +def main():
> +    line_pattern = re.compile("^(.*):(.*): (.*) ==> (.*)$")
> +    output_template = "(\"{0}\", \"{2}\"),\t# line {1}, \"{3}\"?"
> +
> +    parser = argparse.ArgumentParser(description="Check spelling")
> +    parser.add_argument(
> +        "dir",
> +        help="Path to source directory. "
> +        "Defaults to parent directory of this script",
> +        type=os.path.realpath,
> +        nargs='?')
> +    parser.add_argument(
> +        "-i",
> +        "--ignore",
> +        help="File to ignore. Can be specified more than once",
> +        metavar="FILE",
> +        default=list(),
> +        action="append")

This is IMO pointless to have as an option, as we have an internal list
of ignore files and user it's not used with

> +    parser.add_argument(
> +        "--ignore-untracked",
> +        help="Ignore all files not tracked by git",
> +        action="store_true")

IMO this option should not exist at all, there's no good reason to check
files which are not tracked by git.

> +    args = parser.parse_args()
> +
> +    if not args.dir:
> +        args.dir = os.path.dirname(os.path.dirname(os.path.realpath(__file__)))
> +
> +    if args.ignore_untracked:
> +        args.ignore.extend(subprocess.check_output(
> +            ["git", "-C", args.dir, "ls-files", "--others"],
> +            universal_newlines=True).split("\n"))
> +
> +    try:
> +        process = subprocess.run(
> +            [
> +                "codespell",
> +                args.dir,
> +                "--config",
> +                os.path.join(args.dir, ".codespellrc")],
> +            stdout=subprocess.PIPE,
> +            stderr=subprocess.PIPE,
> +            universal_newlines=True)
> +    except FileNotFoundError:
> +        exit("error: codespell not found")
> +    if process.returncode not in (0, 65):
> +        exit("error: unexpected returncode %s" % process.returncode)
> +
> +    if process.stderr:
> +        exit("error: unexpected output to stderr: \"%s\"" % process.stderr)
> +
> +    findings = 0
> +    for line in process.stdout.split("\n"):
> +        line = line.strip().replace(args.dir, "").lstrip("/")
> +        if not line:
> +            continue
> +
> +        match = line_pattern.match(line)
> +        if not match:
> +            exit("error: unexpected line: \"%s\"" % line)
> +
> +        if match.group(1) in args.ignore or ignore(*match.groups()):
> +            continue
> +
> +        print(output_template.format(*match.groups()))
> +        findings += 1
> +
> +    if findings:
> +        exit("error: %s spelling errors" % findings)
> +

I have a rather substantial problem with runtime of this contraption

on my box a pure 'ninja test' takes a bit below 4 seconds ...

real	0m3.604s
user	0m34.187s
sys	0m10.662s

Now a pure codespell invocation with the arguments used here takes more
than 4,2 seconds. In case it's run as part of the test suite it doubles
the execution time. That is IMO not acceptable.

We should be using the 'skip' pattern a bit more, either through
generating a config file or commandline arguments.

This will also allow to ignore the untracked files directly, by putting
them into the ignore pattern, rather than filtering them out.

I also propose to ignore all of 'tests/' there's simply too much noise
and this would waste too many cpu cycles.

With the following ignore pattern:

 codespell -L msdos,inout,hda,wan -S .git,po,tests,vbox_CAPI*,*.js,polkit,*.Dockerfile,keycodemapdb,cpu_map

The analysis is now down to 1.4 seconds from 4.2 and also the invocation
via ninja test is now bearable again.

IMO it's not worth to fiddle with tests/ to try to include the source
files and the time saving is immense.

The above exclude pattern also fixes what I've complained above.
codespell is matching either full absolute paths or a subdirectory name
but it can't contain a '/' if it's relative.




More information about the libvir-list mailing list