plague: Job waited too long for repo to unlock. Killing it...
Michael Schwendt
bugs.michael at gmx.net
Mon Dec 31 17:04:06 UTC 2007
On Mon, 31 Dec 2007 11:00:12 -0500, Dan Williams wrote:
> On Sun, 2007-12-30 at 17:54 +0100, Michael Schwendt wrote:
> > If in a failed job.log you see the message
> >
> > Job waited too long for repo to unlock. Killing it...
> >
> > please notify me.
> >
> > It's a problem in the plague server code that results in a denial of
> > service for subsequent build jobs. I have a traceback from Dec 28th, but
> > in the context of the source code it doesn't make sense yet (because a few
> > lines earlier the code ensures that the files to be copied exist and are
> > readable). Buildsys runs a slightly modified version that adds a bit more
> > debug output in this area.
>
> Maybe just trap the exception, print it out, and continue? That way at
> least the server doesn't fall over, it just fails to copy one item.
The buildsys runs such a patched Repo.py already. It catches OSError,
IOError, unlocks the locks and prints/logs the results of the file access
check prior to when files are copied.
I also added a debug line in the package job code to see when it starts
deleting the copied files. Normally it waits until a callback tells it
that all files are copied.
> It might also help debugging to see if only specific files can't be
> copied...
The offending file was copied, but shutil.copy() failed in its second part
when trying to copy the file mode. It didn't find the source file it had
just copied. :-}
More information about the epel-devel-list
mailing list