[Spacewalk-list] Ongoing jabberd/osad issues.

Mon Aug 22 14:57:46 UTC 2016

Matt,

Thank you very much for this information.  There was a lot here to process, so I had to read it multiple time to make sure that I understood what you were doing.

One of the errors  that I saw in the osa-dispatcher log was an invalid password.  I ran the sql command as suggested and saw that there were in fact two entries for the rhn-dispatcher.  One from the original install a year ago, and then a second one with the FQDN.  I removed both entries, cleaned out the databases and restarted jabberd/osa-dispatcher.  I'll keep an eye on things for a while and see if this helps with my database issues.

Oh, almost forgot......I do not see an entry in /etc/rhn/rhn.conf for osa_dispatcher.debug.  I added it in and stop/started osa-dispatcher, but I did not see any additional logging.

Thanks

Daryl

________________________________
From: spacewalk-list-bounces at redhat.com <spacewalk-list-bounces at redhat.com> on behalf of Matt Moldvan <matt at moldvan.com>
Sent: Friday, August 19, 2016 9:57 AM
To: spacewalk-list at redhat.com
Subject: Re: [Spacewalk-list] Ongoing jabberd/osad issues.

To answer your question about how you can tell if clients are registering (to OSA dispatcher, not jabber), set osa_dispatcher.debug = 4 (goes up to 9 but that is -too much- info in my experience) in /etc/rhn/rhn.conf and restart osa-dispatcher.  This will tell you if systems are subscribing to the dispatcher's presence and are ready to receive actions.

2016/07/28 08:16:28 -05:00 3187 0.0.0.0<http://0.0.0.0>: osad/jabber_lib._roster_callback('Updating the roster', <iq to='rhn-dispatcher-sat at somefqdn/superclient' type='set' id='d0i9bb6qg2mxcvod2204d697ar81obq2b9cmljd1'><query xmlns = 'jabber:iq:roster' ><item jid='osad-5987ca54b5 at somefqdn' subscription='both' /></query></iq>)
2016/07/28 08:16:28 -05:00 3187 0.0.0.0<http://0.0.0.0>: osad/jabber_lib._presence_callback('rhn-dispatcher-sat at somefqdn/superclient', osad-5987ca54b5 at somefqdn, u'subscribed')

For jabber connections, check /var/log/messages for jabberd entries via syslog.  I had to change rsyslog settings to see all of them, as there were so many coming in when I'd restart the jabber services that rsyslog would start to rate-limit them.

Also, osad should -not- be installed on the masters, it conflicts with the OSA dispatcher packages...

If you have to do it manually, do this.

In the Spacewalk back end database (execute "sudo spacewalk-sql -i" or as root), run select * from rhnpushdispatcher to see what OSA dispatcher thinks the password is.  Stop jabberd, delete the entries in that table, delete the Berkley DB files, then start jabberd and osa-dispatcher.  I have more about the issue below, but it's a very long and boring story.

--- long story

This is a long story... excuse the rant but it's caused me a lot of late nights over the course of implementation.  This is a sort of catharsis to be able to send.

So, I've spent the last year and a half trying various methods to keep OSA dispatcher on our Spacewalk masters stable, and trying to hammer the square peg that is jabberd2 into our round hole of using F5s for GTMs and LTMs.  Generally, for example for LDAP services, we create an SSL cert signed by our internal CA and include the GTM FQDN as the subject, and any LTM FQDN and pool members as the Subject Alternate Names.  But, the question wasn't really about that, just wanted to share my frustration that jabberd2 (s2s specifically) doesn't play nicely when a system has the GTM FQDN in the jabberd config (routes get marked as invalid and so on).  Also, the default implementation in Spacewalk uses the Berkley DB, which is notorious for crashing and corrupting itself and also doesn't support locking, so good luck trying to put it on an NFS mount and having multiple proxies writing to that default database.

So instead I tried Postgres in the jabber config, which worked well for a time, but pointing all of our clients to the GTM FQDN and attempting to schedule actions in the GUI wasn't working out.  When you're in an environment that follows ITIL processes and are doing production systems in always-varying scheduled and approved change windows, having rhnsd pick up actions whenever it feels like it (I've seen 24 hours+) just doesn't work for us.

I had to disable snapshots, because for when spacewalk-clone-by-date runs (daily) and also when systems check in with Puppet (ensuring registration daily), Spacewalk sees that the base channel for thousands of systems has been updated, and attempts to update these sequentially, locks the table, and causes memory exhaustion on the master and the database server.

This week, I gave up on using the proxies for Jabber and pointed all my clients to the masters in their respective datacenters.  This brought up an interesting issue, in the OSA dispatcher Python code, an attempt is made to retrieve the password from the rhnPushDispatcher table in the Spacewalk database.  Unfortunately, it only uses part of the Jabber ID, which is hard coded to "rhn-dispatcher-sat" in the code.  So, if you have two, it will most likely not pull the correct one and forever grab the wrong password and attempt to use that against Jabber, and then crash OSA dispatcher.

I personally have two masters (and two OSA dispatchers) because I want datacenter redundancy in case of an issue in one.  Also, I want to use the GUI via GTM in global availability mode again, for redundancy.  So, having OSA dispatcher running on both is very preferable in my situation.  Unfortunately with the issue I mentioned above about the dispatcher code not pulling the password properly from rhnPushDispatcher when there are multiple entries, I had to implement an ugly hack.

Another issue to throw in here was that jabber and OSA dispatcher would randomly crash.  SM and C2S would segfault at random times and OSA dispatcher would error out for many various reasons:
  - timeouts when it was loading the rows from "active" and "roster-items"
  - invalidclient errors (didn't look into these much but maybe old versions of OSAD components on the clients).  ridiculous that one bad client could cause the whole dispatcher process to bomb out though
  - SELinux errors that weren't covered by running osa-dispatcher-selinux-enable (had to run audit2allow and semodule more times than I'm proud of, a few times in a while loop because fixing one SELinux error would lead to another being denied via SELinux in the next line of the code)

Below is the script that I mentioned to run the hack to get around the osa dispatcher laziness of using a "like" statement to pull the password from rhnPushDispatcher.  It's ugly but the dispatcher daemon has stayed up for almost 24 hours now (I consider this a win after all my other issues and my year and half long struggle with the software) and we have ~5.5k systems online with it.  It's in cron running every 2 minutes to check on both jabberd and osa-dispatcher services.

Note that the SQL file is after changing the jabber config on the masters to use SQLite (another attempt in the long list of things I tried to keep things stable).  A similar method would most likely work but the fixjabber.sql file below would need updating.

I don't like the idea of deleting the entire Jabber database every time, I think it's a cop out and requires all systems to recreate entries in multiple tables of that database, whether it's Berkley DB or Postgres or whatever.

---

[me at ourmaster1 ~]$ cat /var/log/fixjabber.sh.out
osa-dispatcher and jabberd have been restarted 0 times since Thu Aug 18 11:27:59 CDT 2016
[me at ourmaster2 ~]$ cat /var/log/fixjabber.sh.out
osa-dispatcher and jabberd have been restarted 0 times since Thu Aug 18 11:27:01 CDT 2016

[me at ourmaster1 ~]$ sudo cat /usr/local/bin/fixjabber.sh
#!/bin/bash
PATH=/sbin:/usr/bin:/bin
LOGFILE=/var/log/$(basename $0).out
if [ ! -f "${LOGFILE}" ]; then
        echo "osa-dispatcher and jabberd have been restarted 0 times since $(date)" >> "${LOGFILE}"
fi
service jabberd status && service osa-dispatcher status
if [ $? -ne 0 ]; then
        service osa-dispatcher stop
        service jabberd stop
        echo "delete from rhnpushdispatcher where jabber_id='rhn-dispatcher-sat@$(uname -n)/superclient';" | spacewalk-sql -i
        sqlite3 /var/lib/jabberd/db/sqlite.db < /usr/local/etc/fixjabber.sql
        service jabberd start
        service osa-dispatcher start
        oldnum=$(cut -d ' ' -f7 ${LOGFILE})
        newnum=$(expr $oldnum + 1)
        sed -i "s/$oldnum/$newnum/g" ${LOGFILE}
fi

[me at ourmaster1 ~]$ cat /usr/local/etc/fixjabber.sql
delete from authreg where username = 'rhn-dispatcher-sat';
delete from "roster-items" where "collection-owner" = 'rhn-dispatcher-sat at ourmaster1.fqdn';
delete from status where "collection-owner" = 'rhn-dispatcher-sat at ourmaster1.fqdn';
delete from active where "collection-owner" = 'rhn-dispatcher-sat at ourmaster1.fqdn';

On Fri, Aug 19, 2016 at 8:34 AM Daryl Rose <darylrose at outlook.com<mailto:darylrose at outlook.com>> wrote:

Basically you're doing everything that I've been doing with the exception of the db_recover command.  I was not familiar with that command.

How can I tell if the clients are self registering or not?

Thank you.

Daryl

________________________________
From: Robert Paschedag <robert.paschedag at web.de<mailto:robert.paschedag at web.de>>
Sent: Friday, August 19, 2016 3:42 AM
To: Daryl Rose
Cc: spacewalk-list at redhat.com<mailto:spacewalk-list at redhat.com>
Subject: Re: [Spacewalk-list] Ongoing jabberd/osad issues.

Now I had also a problem with the database. Just wanted to check, which
logfiles of the jabber db are not needed anymore as stated in
http://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/transapp/logfile.html
Berkeley DB Reference Guide: Log file removal<http://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/transapp/logfile.html>
web.stanford.edu<http://web.stanford.edu>
Log file removal. The fourth component of the infrastructure, log file removal, concerns the ongoing disk consumption of the database log files.

Just running "db_archive" killed my jabber db. And....fixing it with

db_recover -v or
db_recover -c

within /var/lib/jabber/db

did not work.

So I was also in the situation to "clean" the database.

1. Stop jabberd (/etc/init.d/jabberd stop)
2. Stop osa-dispatcher (/etc/init.d/osa-dispatcher stop)
3. Remove contents of /var/lib/jabber/db (rm -f /var/lib/jabber/db/*)
4. Start jabber (/etc/init.d/jabberd start)
5. Start osa-dispatcher (/etc/init.d/osa-dispatcher start)

I thought, I should restart the osad client everywhere....but no....the
clients are just re-registering themeselves automatically. Of course, I
have to check this on every client but what I have checked so far is
looking good.

Regards
Robert

Am 18.08.2016 um 09:30 schrieb Robert Paschedag:
> Hi Daryl,
>
> as long as there are no error messages within the logs that there seems to be an error with the jabber db, I wouldn't do anything with the db.
>
> As I earlier wrote, I only had to repair the db once within about 3 1/2 years.
>
> So, what I would do now is to really delete the jabber db (back it up... just in case) to start up with a"clean " install. If the clients (that already have authentication information) do not re-register automatically, you should go to the client, stop osad, remove /etc/sysconfig/rhn/osad-auth.conf and start osad again. The client should then register and you should see the status on the web GUI as "online". If not, check the /var/log/rhn/osad.log on the client  (if I remember correct right now) and osa-dispatcher logs in the server.
>
> I also wrote, that my spacewalk servers are NOT clients of themselves. I don't think, that should be a problem but just for " testing " you should deactivate osad "client" on the spacewalk server.
>
> Start with one test server.
>
> Good luck.
>
> Regards
> Robert
> Am 17.08.2016 20:43 schrieb Daryl Rose <darylrose at outlook.com<mailto:darylrose at outlook.com>>:
>>
>> I've posted here issues that I've had with jabberd and osad, as have others.  But I haven't gotten things resolved, so I am posting additional information.
>>
>>
>> I put SW into production about a year ago.  After a period of time, I noticed issues with the WUI and servers not reporting correctly and other issues.  Google searches show that I need to shutdown spacewalk and remove all the contents in /var/lib/jabberd/db.   This seemed to work, but after a few months, I realized that osad was no longer communicating with osa-dispatcher.
>>
>>
>> I started doing some additional research and learned that was not a good way to resolve this issue.  According to the official Spacewalk documentation, I should create a checkpoint and then clean up log files keeping the database and auth database files.
>>
>>
>> https://fedorahosted.org/spacewalk/wiki/JabberDatabase
>>
>> JabberDatabase – spacewalk - Fedora Hosted
>> fedorahosted.org<http://fedorahosted.org>
>> Jabber Database. Spacewalk utilizes Jabber to facilitate communications between the server and the clients for osa-dispatcher/osad. The Jabber program uses the ...
>> These are the steps that I followed:
>>
>>
>> /usr/bin/db_checkpoint -1 -h /var/lib/jabberd/db/ ## mark logs for deletion
>> /usr/bin/db_archive -d -h /var/lib/jabberd/db/  ## delete logs
>> service jabberd restart
>>
>> However, this also causes problems with jabberd and osad.  If I use the commands as the documentation instructs, then osa-dispatcher will start, but die, and I get errors in the log that there is an invalid password.
>>
>>
>> So to help explain my issue, I ran a test and tried to capture everything that I could and I'll post it here.
>>
>>
>> 1. Listing of /var/lib/jabberd/db
>>
>> [root@<spwalk-server> db]# ls
>> __db.001  __db.006        log.0000000004  log.0000000009  log.0000000014  log.0000000019  log.0000000024  sm.db
>> __db.002  authreg.db      log.0000000005  log.0000000010  log.0000000015  log.0000000020  log.0000000025
>> __db.003  log.0000000001  log.0000000006  log.0000000011  log.0000000016  log.0000000021  log.0000000026
>> __db.004  log.0000000002  log.0000000007  log.0000000012  log.0000000017  log.0000000022  log.0000000027
>> __db.005  log.0000000003  log.0000000008  log.0000000013  log.0000000018  log.0000000023  log.0000000028
>>
>> 2. Spacewalk Server Status
>>
>> [root@<spwalk-server> db]# spacewalk-service status
>> postmaster (pid  1175) is running...
>> router (pid 21431) is running...
>> sm (pid 21441) is running...
>> c2s (pid 21451) is running...
>> s2s (pid 21461) is running...
>> tomcat6 (pid 1304) is running...                           [  OK  ]
>> httpd (pid  1385) is running...
>> osa-dispatcher (pid  21479) is running...
>> rhn-search is running (1441).
>> cobblerd (pid 1491) is running...
>> RHN Taskomatic is running (1515).
>>
>> 3.  Most recent log file entry:
>>
>> 2016/08/17 07:44:13 -05:00 21476 0.0.0.0<http://0.0.0.0>: osad/jabber_lib.__init__
>> 2016/08/17 07:44:13 -05:00 21476 0.0.0.0<http://0.0.0.0>: osad/jabber_lib.setup_connection('Connected to jabber server', '<spwalk-server>.com')
>> 2016/08/17 07:44:13 -05:00 21476 0.0.0.0<http://0.0.0.0>: osad/osa_dispatcher.fix_connection('Upstream notification server started on port', 1290)
>> 2016/08/17 07:44:14 -05:00 21476 0.0.0.0<http://0.0.0.0>: osad/jabber_lib.process_forever
>>
>> 4.  Ran the commands as instructed in the jabberd documentation.
>>
>> /usr/bin/db_checkpoint -1 -h /var/lib/jabberd/db/ ## mark logs for deletion
>> /usr/bin/db_archive -d -h /var/lib/jabberd/db/  ## delete logs
>> service jabberd restart
>>
>> 5.  Log file entry:
>>
>> 2016/08/17 13:28:19 -05:00 21476 0.0.0.0<http://0.0.0.0>: osad/jabber_lib.main('ERROR', 'Traceback (most recent call last):\n  File "/usr/share/rhn/osad/jabber_lib.py", line 121, in main\n    self.process_forever(c)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 179, in process_forever\n    self.process_once(client)\n  File "/usr/share/rhn/osad/osa_dispatcher.py", line 187, in process_once\n    client.retrieve_roster()\n  File "/usr/share/rhn/osad/jabber_lib.py", line 729, in retrieve_roster\n    stanza = self.get_one_stanza()\n  File "/usr/share/rhn/osad/jabber_lib.py", line 801, in get_one_stanza\n    self.process(timeout=tm)\n  File "/usr/share/rhn/osad/jabber_lib.py", line 1055, in process\n    data = self._read(self.BLOCK_SIZE)\nSSLError: (\'OpenSSL error; will retry\', "(-1, \'Unexpected EOF\')")\n')
>> 2016/08/17 13:28:29 -05:00 21476 0.0.0.0<http://0.0.0.0>: osad/jabber_lib.__init__
>> 2016/08/17 13:28:29 -05:00 21476 0.0.0.0<http://0.0.0.0>: osad/jabber_lib.setup_connection('Connected to jabber server', '<spwalk-server>.com')
>> 2016/08/17 13:28:29 -05:00 21476 0.0.0.0<http://0.0.0.0>: osad/jabber_lib.register('ERROR', 'Invalid password')
>>
>> 6.  Spacewalk server status
>>
>> [root@<spwalk-server> db]# spacewalk-service status
>> postmaster (pid  1175) is running...
>> router (pid 27119) is running...
>> sm (pid 27129) is running...
>> c2s (pid 27139) is running...
>> s2s (pid 27149) is running...
>> tomcat6 (pid 1304) is running...                           [  OK  ]
>> httpd (pid  1385) is running...
>> osa-dispatcher dead but pid file exists
>> rhn-search is running (1441).
>> cobblerd (pid 1491) is running...
>> RHN Taskomatic is running (1515).
>>
>> 7. Long listing of /var/lib/jabberd/db
>>
>> [root@<spwalk-server> db]# ls -l
>> total 7536
>> -rw-r-----. 1 jabber jabber    24576 Aug 17 13:28 __db.001
>> -rw-r-----. 1 jabber jabber   204800 Aug 17 13:29 __db.002
>> -rw-r-----. 1 jabber jabber   270336 Aug 17 13:29 __db.003
>> -rw-r-----. 1 jabber jabber    98304 Aug 17 13:29 __db.004
>> -rw-r-----. 1 jabber jabber   753664 Aug 17 13:29 __db.005
>> -rw-r-----. 1 jabber jabber    57344 Aug 17 13:29 __db.006
>> -rw-r-----. 1 jabber jabber   368640 Aug 17 07:46 authreg.db
>> -rw-r-----. 1 jabber jabber 10485760 Aug 17 13:29 log.0000000031
>> -rw-r-----. 1 jabber jabber   487424 Aug 17 13:29 sm.db
>>
>> So, neither completely cleaning out jabberd database/log files works, and creating a checkpoint and removing log files that need to be cleaned out doesn't' work, so what can I do to get jabberd and osad to work, and to push out updates when I need to push them out?
>>
>>
>> Thank you.
>>
>>
>> Daryl
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
> _______________________________________________
> Spacewalk-list mailing list
> Spacewalk-list at redhat.com<mailto:Spacewalk-list at redhat.com>
> https://www.redhat.com/mailman/listinfo/spacewalk-list
>
_______________________________________________
Spacewalk-list mailing list
Spacewalk-list at redhat.com<mailto:Spacewalk-list at redhat.com>
https://www.redhat.com/mailman/listinfo/spacewalk-list
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/spacewalk-list/attachments/20160822/07878a14/attachment.htm>