[Pulp-list] repo sync runs hanging

Fri Jan 30 23:06:23 UTC 2015

I’ve been having a lot of trouble with my pulp server lately related to rpm repo syncs hanging/stalling. I thought the issue might have been related to the 2.6 beta build I was running because it fixed my bug (1176698) but it doesn’t appear be just that version.

I built a new pulp server this week on version 2.5.3-0.2.rc. This RHEL6.6 VM has 8 vcpus, 8 GB of RAM and 400Gb of SAN LUNs attached to it. Both /var/lib/pulp and /var/lib/mongodb are symlinked to the SAN LUN for performance. 

Initially this new server was working great. I created and sync several rpm repos without any issues but today the hangs/stalls of the syncs started again.  I’m beginning to wonder if something about the 2.5+ architecture isn’t handling the nearly 100,000 rpms that have been pulled into it. 

When the stall happens it is always on the downloading of RPMs from the feed but nothing is logged and no errors are thrown. I’ve let the process sit and run overnight and it never resumes. After canceling the sync task, I have to stop all of the pulp-processes and one of the workers never stops:

# for s in {goferd,pulp_celerybeat,pulp_resource_manager,pulp_workers,httpd}; do service $s stop; done
goferd: unrecognized service
celery init v10.0.
Using configuration: /etc/default/pulp_workers, /etc/default/pulp_celerybeat
Stopping pulp_celerybeat... OK
celery init v10.0.
Using config script: /etc/default/pulp_resource_manager
celery multi v3.1.11 (Cipater)
> Stopping nodes...
	> resource_manager at dvpuap01.capella.edu: QUIT -> 2387
> Waiting for 1 node -> 2387.....
	> resource_manager at dvpuap01.capella.edu: OK

celery init v10.0.
Using config script: /etc/default/pulp_workers
celery multi v3.1.11 (Cipater)
> Stopping nodes...
	> reserved_resource_worker-5 at dvpuap01.capella.edu: QUIT -> 2664
	> reserved_resource_worker-2 at dvpuap01.capella.edu: QUIT -> 2570
	> reserved_resource_worker-4 at dvpuap01.capella.edu: QUIT -> 2633
	> reserved_resource_worker-1 at dvpuap01.capella.edu: QUIT -> 2540
	> reserved_resource_worker-7 at dvpuap01.capella.edu: QUIT -> 2723
	> reserved_resource_worker-3 at dvpuap01.capella.edu: QUIT -> 2602
	> reserved_resource_worker-6 at dvpuap01.capella.edu: QUIT -> 2692
	> reserved_resource_worker-0 at dvpuap01.capella.edu: QUIT -> 2513
> Waiting for 8 nodes -> 2664, 2570, 2633, 2540, 2723, 2602, 2692, 2513............
	> reserved_resource_worker-5 at dvpuap01.capella.edu: OK
> Waiting for 7 nodes -> 2570, 2633, 2540, 2723, 2602, 2692, 2513....
	> reserved_resource_worker-2 at dvpuap01.capella.edu: OK
> Waiting for 6 nodes -> 2633, 2540, 2723, 2602, 2692, 2513....
	> reserved_resource_worker-4 at dvpuap01.capella.edu: OK
> Waiting for 5 nodes -> 2540, 2723, 2602, 2692, 2513....
	> reserved_resource_worker-1 at dvpuap01.capella.edu: OK
> Waiting for 4 nodes -> 2723, 2602, 2692, 2513....
	> reserved_resource_worker-7 at dvpuap01.capella.edu: OK
> Waiting for 3 nodes -> 2602, 2692, 2513....
	> reserved_resource_worker-3 at dvpuap01.capella.edu: OK
> Waiting for 2 nodes -> 2692, 2513.....
	> reserved_resource_worker-0 at dvpuap01.capella.edu: OK
> Waiting for 1 node -> 2692.................................................................................................................................................................................................................................................................................................................................................................................................................................................................^C
Session terminated, killing shell... ...killed.

If I run the for loop again, everything appears to clean up but there is always a single process that I have to manually kill:

apache    2763     1  2 15:34 ?        00:02:09 /usr/bin/python -m celery.__main__ worker -c 1 -n reserved_resource_worker-6 at dvpuap01.capella.edu --events --app=pulp.server.async.app --loglevel=INFO --logfile=/var/log/pulp/reserved_resource_worker-6.log --pidfile=/var/run/pulp/reserved_resource_worker-6.pid

After killing this final process, I usually stop mongodb, start everything back up and try the sync again. I’ve also tried rebooting the VM but it doesn’t seem to be more effective than just stopping and starting the services.

Below are the repos I’ve successfully sync’ed so far on the new server. (Notice the rhel-6-optional one has 7368 rpm units but hasn’t successfully finished downloading yet even though I’ve killed it and restarted it 3 time this afternoon.)

# pulp-admin rpm repo list
+----------------------------------------------------------------------+
                            RPM Repositories
+----------------------------------------------------------------------+

Id:                  ol5_x86_64_latest
Display Name:        ol5_x86_64_latest
Description:         None
Content Unit Counts:
  Erratum:          1116
  Package Category: 9
  Package Group:    103
  Rpm:              6761
  Srpm:             2292

Id:                  ol6_x86_64_latest
Display Name:        ol6_x86_64_latest
Description:         None
Content Unit Counts:
  Erratum:          1659
  Package Category: 14
  Package Group:    207
  Rpm:              13215
  Srpm:             3812

Id:                  epel6
Display Name:        epel6
Description:         None
Content Unit Counts:
  Erratum:                3668
  Package Category:       3
  Package Group:          208
  Rpm:                    11135
  Yum Repo Metadata File: 1

Id:                  epel5
Display Name:        epel5
Description:         None
Content Unit Counts:
  Erratum:                1953
  Package Category:       5
  Package Group:          36
  Rpm:                    6678
  Yum Repo Metadata File: 1

Id:                  epel7
Display Name:        epel7
Description:         None
Content Unit Counts:
  Erratum:                1252
  Package Category:       4
  Package Environment:    1
  Package Group:          209
  Rpm:                    7161
  Yum Repo Metadata File: 1

Id:                  rhel-6-os
Display Name:        rhel-6-os
Description:         None
Content Unit Counts:
  Erratum:                2842
  Package Category:       10
  Package Group:          202
  Rpm:                    14574
  Yum Repo Metadata File: 1

Id:                  rhel-5-os
Display Name:        rhel-5-os
Description:         None
Content Unit Counts:
  Erratum:                3040
  Package Category:       6
  Package Group:          99
  Rpm:                    16668
  Yum Repo Metadata File: 1

Id:                  rhel-6-optional
Display Name:        rhel-6-optional
Description:         None
Content Unit Counts:
  Rpm: 7368

Thanks,
- Bryce