[Spacewalk-list] Ongoing jabberd/osad issues.

Fri Aug 19 14:57:56 UTC 2016

To answer your question about how you can tell if clients are registering
(to OSA dispatcher, not jabber), set osa_dispatcher.debug = 4 (goes up to 9
but that is -too much- info in my experience) in /etc/rhn/rhn.conf and
restart osa-dispatcher.  This will tell you if systems are subscribing to
the dispatcher's presence and are ready to receive actions.

2016/07/28 08:16:28 -05:00 3187 0.0.0.0:
osad/jabber_lib._roster_callback('Updating the roster', <iq
to='rhn-dispatcher-sat at somefqdn/superclient' type='set'
id='d0i9bb6qg2mxcvod2204d697ar81obq2b9cmljd1'><query xmlns =
'jabber:iq:roster' ><item jid='osad-5987ca54b5 at somefqdn'
subscription='both' /></query></iq>)
2016/07/28 08:16:28 -05:00 3187 0.0.0.0:
osad/jabber_lib._presence_callback('rhn-dispatcher-sat at somefqdn/superclient',
osad-5987ca54b5 at somefqdn, u'subscribed')

For jabber connections, check /var/log/messages for jabberd entries via
syslog.  I had to change rsyslog settings to see all of them, as there were
so many coming in when I'd restart the jabber services that rsyslog would
start to rate-limit them.

Also, osad should -not- be installed on the masters, it conflicts with the
OSA dispatcher packages...

If you have to do it manually, do this.

In the Spacewalk back end database (execute "sudo spacewalk-sql -i" or as
root), run select * from rhnpushdispatcher to see what OSA dispatcher
thinks the password is.  Stop jabberd, delete the entries in that table,
delete the Berkley DB files, then start jabberd and osa-dispatcher.  I have
more about the issue below, but it's a very long and boring story.

--- long story

This is a long story... excuse the rant but it's caused me a lot of late
nights over the course of implementation.  This is a sort of catharsis to
be able to send.

So, I've spent the last year and a half trying various methods to keep OSA
dispatcher on our Spacewalk masters stable, and trying to hammer the square
peg that is jabberd2 into our round hole of using F5s for GTMs and LTMs.
Generally, for example for LDAP services, we create an SSL cert signed by
our internal CA and include the GTM FQDN as the subject, and any LTM FQDN
and pool members as the Subject Alternate Names.  But, the question wasn't
really about that, just wanted to share my frustration that jabberd2 (s2s
specifically) doesn't play nicely when a system has the GTM FQDN in the
jabberd config (routes get marked as invalid and so on).  Also, the default
implementation in Spacewalk uses the Berkley DB, which is notorious for
crashing and corrupting itself and also doesn't support locking, so good
luck trying to put it on an NFS mount and having multiple proxies writing
to that default database.

So instead I tried Postgres in the jabber config, which worked well for a
time, but pointing all of our clients to the GTM FQDN and attempting to
schedule actions in the GUI wasn't working out.  When you're in an
environment that follows ITIL processes and are doing production systems in
always-varying scheduled and approved change windows, having rhnsd pick up
actions whenever it feels like it (I've seen 24 hours+) just doesn't work
for us.

I had to disable snapshots, because for when spacewalk-clone-by-date runs
(daily) and also when systems check in with Puppet (ensuring registration
daily), Spacewalk sees that the base channel for thousands of systems has
been updated, and attempts to update these sequentially, locks the table,
and causes memory exhaustion on the master and the database server.

This week, I gave up on using the proxies for Jabber and pointed all my
clients to the masters in their respective datacenters.  This brought up an
interesting issue, in the OSA dispatcher Python code, an attempt is made to
retrieve the password from the rhnPushDispatcher table in the Spacewalk
database.  Unfortunately, it only uses part of the Jabber ID, which is hard
coded to "rhn-dispatcher-sat" in the code.  So, if you have two, it will
most likely not pull the correct one and forever grab the wrong password
and attempt to use that against Jabber, and then crash OSA dispatcher.

I personally have two masters (and two OSA dispatchers) because I want
datacenter redundancy in case of an issue in one.  Also, I want to use the
GUI via GTM in global availability mode again, for redundancy.  So, having
OSA dispatcher running on both is very preferable in my situation.
Unfortunately with the issue I mentioned above about the dispatcher code
not pulling the password properly from rhnPushDispatcher when there are
multiple entries, I had to implement an ugly hack.

Another issue to throw in here was that jabber and OSA dispatcher would
randomly crash.  SM and C2S would segfault at random times and OSA
dispatcher would error out for many various reasons:
  - timeouts when it was loading the rows from "active" and "roster-items"
  - invalidclient errors (didn't look into these much but maybe old
versions of OSAD components on the clients).  ridiculous that one bad
client could cause the whole dispatcher process to bomb out though
  - SELinux errors that weren't covered by running
osa-dispatcher-selinux-enable (had to run audit2allow and semodule more
times than I'm proud of, a few times in a while loop because fixing one
SELinux error would lead to another being denied via SELinux in the next
line of the code)

Below is the script that I mentioned to run the hack to get around the osa
dispatcher laziness of using a "like" statement to pull the password from
rhnPushDispatcher.  It's ugly but the dispatcher daemon has stayed up for
almost 24 hours now (I consider this a win after all my other issues and my
year and half long struggle with the software) and we have ~5.5k systems
online with it.  It's in cron running every 2 minutes to check on both
jabberd and osa-dispatcher services.

Note that the SQL file is after changing the jabber config on the masters
to use SQLite (another attempt in the long list of things I tried to keep
things stable).  A similar method would most likely work but the
fixjabber.sql file below would need updating.

I don't like the idea of deleting the entire Jabber database every time, I
think it's a cop out and requires all systems to recreate entries in
multiple tables of that database, whether it's Berkley DB or Postgres or
whatever.

---

[me at ourmaster1 ~]$ cat /var/log/fixjabber.sh.out
osa-dispatcher and jabberd have been restarted 0 times since Thu Aug 18
11:27:59 CDT 2016
[me at ourmaster2 ~]$ cat /var/log/fixjabber.sh.out
osa-dispatcher and jabberd have been restarted 0 times since Thu Aug 18
11:27:01 CDT 2016

[me at ourmaster1 ~]$ sudo cat /usr/local/bin/fixjabber.sh
#!/bin/bash
PATH=/sbin:/usr/bin:/bin
LOGFILE=/var/log/$(basename $0).out
if [ ! -f "${LOGFILE}" ]; then
        echo "osa-dispatcher and jabberd have been restarted 0 times since
$(date)" >> "${LOGFILE}"
fi
service jabberd status && service osa-dispatcher status
if [ $? -ne 0 ]; then
        service osa-dispatcher stop
        service jabberd stop
        echo "delete from rhnpushdispatcher where
jabber_id='rhn-dispatcher-sat@$(uname -n)/superclient';" | spacewalk-sql -i
        sqlite3 /var/lib/jabberd/db/sqlite.db < /usr/local/etc/fixjabber.sql
        service jabberd start
        service osa-dispatcher start
        oldnum=$(cut -d ' ' -f7 ${LOGFILE})
        newnum=$(expr $oldnum + 1)
        sed -i "s/$oldnum/$newnum/g" ${LOGFILE}
fi

[me at ourmaster1 ~]$ cat /usr/local/etc/fixjabber.sql
delete from authreg where username = 'rhn-dispatcher-sat';
delete from "roster-items" where "collection-owner" = 'rhn-dispatcher-sat@
ourmaster1.fqdn';
delete from status where "collection-owner" =
'rhn-dispatcher-sat at ourmaster1.fqdn';
delete from active where "collection-owner" =
'rhn-dispatcher-sat at ourmaster1.fqdn';

On Fri, Aug 19, 2016 at 8:34 AM Daryl Rose <darylrose at outlook.com> wrote:

> Basically you're doing everything that I've been doing with the exception
> of the db_recover command.  I was not familiar with that command.
>
>
> How can I tell if the clients are self registering or not?
>
>
> Thank you.
>
>
> Daryl
>
>
> ------------------------------
> *From:* Robert Paschedag <robert.paschedag at web.de>
> *Sent:* Friday, August 19, 2016 3:42 AM
> *To:* Daryl Rose
> *Cc:* spacewalk-list at redhat.com
> *Subject:* Re: [Spacewalk-list] Ongoing jabberd/osad issues.
>
> Now I had also a problem with the database. Just wanted to check, which
> logfiles of the jabber db are not needed anymore as stated in
>
> http://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/transapp/logfile.html
> Berkeley DB Reference Guide: Log file removal
> <http://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/transapp/logfile.html>
> web.stanford.edu
> Log file removal. The fourth component of the infrastructure, log file
> removal, concerns the ongoing disk consumption of the database log files.
>
>
>
> Just running "db_archive" killed my jabber db. And....fixing it with
>
> db_recover -v or
> db_recover -c
>
> within /var/lib/jabber/db
>
> did not work.
>
> So I was also in the situation to "clean" the database.
>
> 1. Stop jabberd (/etc/init.d/jabberd stop)
> 2. Stop osa-dispatcher (/etc/init.d/osa-dispatcher stop)
> 3. Remove contents of /var/lib/jabber/db (rm -f /var/lib/jabber/db/*)
> 4. Start jabber (/etc/init.d/jabberd start)
> 5. Start osa-dispatcher (/etc/init.d/osa-dispatcher start)
>
> I thought, I should restart the osad client everywhere....but no....the
> clients are just re-registering themeselves automatically. Of course, I
> have to check this on every client but what I have checked so far is
> looking good.
>
> Regards
> Robert
>
>
> Am 18.08.2016 um 09:30 schrieb Robert Paschedag:
> > Hi Daryl,
> >
> > as long as there are no error messages within the logs that there seems
> to be an error with the jabber db, I wouldn't do anything with the db.
> >
> > As I earlier wrote, I only had to repair the db once within about 3 1/2
> years.
> >
> > So, what I would do now is to really delete the jabber db (back it up...
> just in case) to start up with a"clean " install. If the clients (that
> already have authentication information) do not re-register automatically,
> you should go to the client, stop osad, remove
> /etc/sysconfig/rhn/osad-auth.conf and start osad again. The client should
> then register and you should see the status on the web GUI as "online". If
> not, check the /var/log/rhn/osad.log on the client  (if I remember correct
> right now) and osa-dispatcher logs in the server.
> >
> > I also wrote, that my spacewalk servers are NOT clients of themselves. I
> don't think, that should be a problem but just for " testing " you should
> deactivate osad "client" on the spacewalk server.
> >
> > Start with one test server.
> >
> > Good luck.
> >
> > Regards
> > Robert
> > Am 17.08.2016 20:43 schrieb Daryl Rose <darylrose at outlook.com>:
> >>
> >> I've posted here issues that I've had with jabberd and osad, as have
> others.  But I haven't gotten things resolved, so I am posting additional
> information.
> >>
> >>
> >> I put SW into production about a year ago.  After a period of time, I
> noticed issues with the WUI and servers not reporting correctly and other
> issues.  Google searches show that I need to shutdown spacewalk and remove
> all the contents in /var/lib/jabberd/db.   This seemed to work, but after a
> few months, I realized that osad was no longer communicating with
> osa-dispatcher.
> >>
> >>
> >> I started doing some additional research and learned that was not a
> good way to resolve this issue.  According to the official Spacewalk
> documentation, I should create a checkpoint and then clean up log files
> keeping the database and auth database files.
> >>
> >>
> >> https://fedorahosted.org/spacewalk/wiki/JabberDatabase
> >>
> >> JabberDatabase – spacewalk - Fedora Hosted
> >> fedorahosted.org
> >> Jabber Database. Spacewalk utilizes Jabber to facilitate communications
> between the server and the clients for osa-dispatcher/osad. The Jabber
> program uses the ...
> >> These are the steps that I followed:
> >>
> >>
> >> /usr/bin/db_checkpoint -1 -h /var/lib/jabberd/db/ ## mark logs for
> deletion
> >> /usr/bin/db_archive -d -h /var/lib/jabberd/db/  ## delete logs
> >> service jabberd restart
> >>
> >> However, this also causes problems with jabberd and osad.  If I use the
> commands as the documentation instructs, then osa-dispatcher will start,
> but die, and I get errors in the log that there is an invalid password.
> >>
> >>
> >> So to help explain my issue, I ran a test and tried to capture
> everything that I could and I'll post it here.
> >>
> >>
> >> 1. Listing of /var/lib/jabberd/db
> >>
> >> [root@<spwalk-server> db]# ls
> >> __db.001  __db.006        log.0000000004  log.0000000009
> log.0000000014  log.0000000019  log.0000000024  sm.db
> >> __db.002  authreg.db      log.0000000005  log.0000000010
> log.0000000015  log.0000000020  log.0000000025
> >> __db.003  log.0000000001  log.0000000006  log.0000000011
> log.0000000016  log.0000000021  log.0000000026
> >> __db.004  log.0000000002  log.0000000007  log.0000000012
> log.0000000017  log.0000000022  log.0000000027
> >> __db.005  log.0000000003  log.0000000008  log.0000000013
> log.0000000018  log.0000000023  log.0000000028
> >>
> >> 2. Spacewalk Server Status
> >>
> >> [root@<spwalk-server> db]# spacewalk-service status
> >> postmaster (pid  1175) is running...
> >> router (pid 21431) is running...
> >> sm (pid 21441) is running...
> >> c2s (pid 21451) is running...
> >> s2s (pid 21461) is running...
> >> tomcat6 (pid 1304) is running...                           [  OK  ]
> >> httpd (pid  1385) is running...
> >> osa-dispatcher (pid  21479) is running...
> >> rhn-search is running (1441).
> >> cobblerd (pid 1491) is running...
> >> RHN Taskomatic is running (1515).
> >>
> >> 3.  Most recent log file entry:
> >>
> >> 2016/08/17 07:44:13 -05:00 21476 0.0.0.0: osad/jabber_lib.__init__
> >> 2016/08/17 07:44:13 -05:00 21476 0.0.0.0:
> osad/jabber_lib.setup_connection('Connected to jabber server',
> '<spwalk-server>.com')
> >> 2016/08/17 07:44:13 -05:00 21476 0.0.0.0:
> osad/osa_dispatcher.fix_connection('Upstream notification server started on
> port', 1290)
> >> 2016/08/17 07:44:14 -05:00 21476 0.0.0.0:
> osad/jabber_lib.process_forever
> >>
> >> 4.  Ran the commands as instructed in the jabberd documentation.
> >>
> >> /usr/bin/db_checkpoint -1 -h /var/lib/jabberd/db/ ## mark logs for
> deletion
> >> /usr/bin/db_archive -d -h /var/lib/jabberd/db/  ## delete logs
> >> service jabberd restart
> >>
> >> 5.  Log file entry:
> >>
> >> 2016/08/17 13:28:19 -05:00 21476 0.0.0.0:
> osad/jabber_lib.main('ERROR', 'Traceback (most recent call last):\n  File
> "/usr/share/rhn/osad/jabber_lib.py", line 121, in main\n
> self.process_forever(c)\n  File "/usr/share/rhn/osad/jabber_lib.py", line
> 179, in process_forever\n    self.process_once(client)\n  File
> "/usr/share/rhn/osad/osa_dispatcher.py", line 187, in process_once\n
> client.retrieve_roster()\n  File "/usr/share/rhn/osad/jabber_lib.py", line
> 729, in retrieve_roster\n    stanza = self.get_one_stanza()\n  File
> "/usr/share/rhn/osad/jabber_lib.py", line 801, in get_one_stanza\n
> self.process(timeout=tm)\n  File "/usr/share/rhn/osad/jabber_lib.py", line
> 1055, in process\n    data = self._read(self.BLOCK_SIZE)\nSSLError:
> (\'OpenSSL error; will retry\', "(-1, \'Unexpected EOF\')")\n')
> >> 2016/08/17 13:28:29 -05:00 21476 0.0.0.0: osad/jabber_lib.__init__
> >> 2016/08/17 13:28:29 -05:00 21476 0.0.0.0:
> osad/jabber_lib.setup_connection('Connected to jabber server',
> '<spwalk-server>.com')
> >> 2016/08/17 13:28:29 -05:00 21476 0.0.0.0:
> osad/jabber_lib.register('ERROR', 'Invalid password')
> >>
> >> 6.  Spacewalk server status
> >>
> >> [root@<spwalk-server> db]# spacewalk-service status
> >> postmaster (pid  1175) is running...
> >> router (pid 27119) is running...
> >> sm (pid 27129) is running...
> >> c2s (pid 27139) is running...
> >> s2s (pid 27149) is running...
> >> tomcat6 (pid 1304) is running...                           [  OK  ]
> >> httpd (pid  1385) is running...
> >> osa-dispatcher dead but pid file exists
> >> rhn-search is running (1441).
> >> cobblerd (pid 1491) is running...
> >> RHN Taskomatic is running (1515).
> >>
> >> 7. Long listing of /var/lib/jabberd/db
> >>
> >> [root@<spwalk-server> db]# ls -l
> >> total 7536
> >> -rw-r-----. 1 jabber jabber    24576 Aug 17 13:28 __db.001
> >> -rw-r-----. 1 jabber jabber   204800 Aug 17 13:29 __db.002
> >> -rw-r-----. 1 jabber jabber   270336 Aug 17 13:29 __db.003
> >> -rw-r-----. 1 jabber jabber    98304 Aug 17 13:29 __db.004
> >> -rw-r-----. 1 jabber jabber   753664 Aug 17 13:29 __db.005
> >> -rw-r-----. 1 jabber jabber    57344 Aug 17 13:29 __db.006
> >> -rw-r-----. 1 jabber jabber   368640 Aug 17 07:46 authreg.db
> >> -rw-r-----. 1 jabber jabber 10485760 Aug 17 13:29 log.0000000031
> >> -rw-r-----. 1 jabber jabber   487424 Aug 17 13:29 sm.db
> >>
> >> So, neither completely cleaning out jabberd database/log files works,
> and creating a checkpoint and removing log files that need to be cleaned
> out doesn't' work, so what can I do to get jabberd and osad to work, and to
> push out updates when I need to push them out?
> >>
> >>
> >> Thank you.
> >>
> >>
> >> Daryl
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >
> > _______________________________________________
> > Spacewalk-list mailing list
> > Spacewalk-list at redhat.com
> > https://www.redhat.com/mailman/listinfo/spacewalk-list
> >
> _______________________________________________
> Spacewalk-list mailing list
> Spacewalk-list at redhat.com
> https://www.redhat.com/mailman/listinfo/spacewalk-list
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/spacewalk-list/attachments/20160819/872a7ba6/attachment.htm>