[Spacewalk-list] Ongoing jabberd/osad issues.

Mon Aug 22 15:25:52 UTC 2016

Hi Daryl,

Yeah that long bit of info was brewing for a while, and not all related to
osa-dispatcher.  :)  There are a few somewhat obscure settings that aren't
in the config file directly (like the ability to tune taskomatic via
taskomatic.errata_cache_max_work_items and
taskomatic.channel_repodata_max_work_items, for example) that were only
found from hours of scouring various posts and articles.

The line for osa-dispatcher.debug should look like (except for the # and
after it, not sure how much you've dabbled in coding/scripting):

osa-dispatcher.debug = 2 # (on a scale from 0-9, I guess, with verbosity
increasing as the number also increases).

Once that line is updated, and osa-dispatcher restarted, you should see
additional entries in /var/log/rhn/osa-dispatcher.log like so:

2016/08/18 11:06:03 -05:00 25096 0.0.0.0:
osad/jabber_lib._get_jabber_client('Connecting to', 'yourmasterfqdn')
2016/08/18 11:06:03 -05:00 25096 0.0.0.0: osad/jabber_lib.__init__
2016/08/18 11:06:03 -05:00 25096 0.0.0.0: osad/jabber_lib.connect
2016/08/18 11:06:03 -05:00 25096 0.0.0.0:
osad/jabber_lib.setup_connection('Connected to jabber server',
'yourmasterfqdn')
2016/08/18 11:06:03 -05:00 25096 0.0.0.0: osad/dispatcher_client.start
2016/08/18 11:06:03 -05:00 25096 0.0.0.0:
osad/jabber_lib.auth('rhn-dispatcher-sat',
'VinkOxDq35oU8tJqK8L43T5ZRldygDEs', 'superclient', 1)
2016/08/18 11:06:05 -05:00 25096 0.0.0.0:
osad/osa_dispatcher.fix_connection('Upstream notification server started on
port', 1290)
2016/08/18 11:06:06 -05:00 25096 0.0.0.0: osad/jabber_lib.process_forever
2016/08/18 22:54:51 -05:00 25096 0.0.0.0:
osad/dispatcher_client._message_callback('Ping response',)

In the OSA dispatcher code, there are some debug lines as such, with the
first argument in the log_debug function corresponding to the setting above:

        log_debug(1, "Upstream notification server started on port", port)

        log_debug(4, "Subscribed to",   c._roster.get_subscribed_to())
        log_debug(4, "Subscribed from", c._roster.get_subscribed_from())
        log_debug(4, "Subscribed both", c._roster.get_subscribed_both())

        if client in rfds:
            log_debug(5, "before process")
            client.process(timeout=None)
            log_debug(5, "after process")

Good luck!

On Mon, Aug 22, 2016 at 10:57 AM Daryl Rose <darylrose at outlook.com> wrote:

> Matt,
>
>
> Thank you very much for this information.  There was a lot here to
> process, so I had to read it multiple time to make sure that I understood
> what you were doing.
>
>
> One of the errors  that I saw in the osa-dispatcher log was an invalid
> password.  I ran the sql command as suggested and saw that there were
> in fact two entries for the rhn-dispatcher.  One from the original
> install a year ago, and then a second one with the FQDN.  I removed both
> entries, cleaned out the databases and restarted jabberd/osa-dispatcher.
>  I'll keep an eye on things for a while and see if this helps with my
> database issues.
>
>
> Oh, almost forgot......I do not see an entry in /etc/rhn/rhn.conf for
> osa_dispatcher.debug.  I added it in and stop/started osa-dispatcher, but I
> did not see any additional logging.
>
>
> Thanks
>
>
> Daryl
>
> ------------------------------
> *From:* spacewalk-list-bounces at redhat.com <
> spacewalk-list-bounces at redhat.com> on behalf of Matt Moldvan <
> matt at moldvan.com>
> *Sent:* Friday, August 19, 2016 9:57 AM
> *To:* spacewalk-list at redhat.com
>
> *Subject:* Re: [Spacewalk-list] Ongoing jabberd/osad issues.
> To answer your question about how you can tell if clients are registering
> (to OSA dispatcher, not jabber), set osa_dispatcher.debug = 4 (goes up to 9
> but that is -too much- info in my experience) in /etc/rhn/rhn.conf and
> restart osa-dispatcher.  This will tell you if systems are subscribing to
> the dispatcher's presence and are ready to receive actions.
>
> 2016/07/28 08:16:28 -05:00 3187 0.0.0.0:
> osad/jabber_lib._roster_callback('Updating the roster', <iq
> to='rhn-dispatcher-sat at somefqdn/superclient' type='set'
> id='d0i9bb6qg2mxcvod2204d697ar81obq2b9cmljd1'><query xmlns =
> 'jabber:iq:roster' ><item jid='osad-5987ca54b5 at somefqdn'
> subscription='both' /></query></iq>)
> 2016/07/28 08:16:28 -05:00 3187 0.0.0.0:
> osad/jabber_lib._presence_callback('rhn-dispatcher-sat at somefqdn/superclient',
> osad-5987ca54b5 at somefqdn, u'subscribed')
>
> For jabber connections, check /var/log/messages for jabberd entries via
> syslog.  I had to change rsyslog settings to see all of them, as there were
> so many coming in when I'd restart the jabber services that rsyslog would
> start to rate-limit them.
>
> Also, osad should -not- be installed on the masters, it conflicts with the
> OSA dispatcher packages...
>
> If you have to do it manually, do this.
>
> In the Spacewalk back end database (execute "sudo spacewalk-sql -i" or as
> root), run select * from rhnpushdispatcher to see what OSA dispatcher
> thinks the password is.  Stop jabberd, delete the entries in that table,
> delete the Berkley DB files, then start jabberd and osa-dispatcher.  I have
> more about the issue below, but it's a very long and boring story.
>
> --- long story
>
> This is a long story... excuse the rant but it's caused me a lot of late
> nights over the course of implementation.  This is a sort of catharsis to
> be able to send.
>
> So, I've spent the last year and a half trying various methods to keep OSA
> dispatcher on our Spacewalk masters stable, and trying to hammer the square
> peg that is jabberd2 into our round hole of using F5s for GTMs and LTMs.
> Generally, for example for LDAP services, we create an SSL cert signed by
> our internal CA and include the GTM FQDN as the subject, and any LTM FQDN
> and pool members as the Subject Alternate Names.  But, the question wasn't
> really about that, just wanted to share my frustration that jabberd2 (s2s
> specifically) doesn't play nicely when a system has the GTM FQDN in the
> jabberd config (routes get marked as invalid and so on).  Also, the default
> implementation in Spacewalk uses the Berkley DB, which is notorious for
> crashing and corrupting itself and also doesn't support locking, so good
> luck trying to put it on an NFS mount and having multiple proxies writing
> to that default database.
>
> So instead I tried Postgres in the jabber config, which worked well for a
> time, but pointing all of our clients to the GTM FQDN and attempting to
> schedule actions in the GUI wasn't working out.  When you're in an
> environment that follows ITIL processes and are doing production systems in
> always-varying scheduled and approved change windows, having rhnsd pick up
> actions whenever it feels like it (I've seen 24 hours+) just doesn't work
> for us.
>
> I had to disable snapshots, because for when spacewalk-clone-by-date runs
> (daily) and also when systems check in with Puppet (ensuring registration
> daily), Spacewalk sees that the base channel for thousands of systems has
> been updated, and attempts to update these sequentially, locks the table,
> and causes memory exhaustion on the master and the database server.
>
> This week, I gave up on using the proxies for Jabber and pointed all my
> clients to the masters in their respective datacenters.  This brought up an
> interesting issue, in the OSA dispatcher Python code, an attempt is made to
> retrieve the password from the rhnPushDispatcher table in the Spacewalk
> database.  Unfortunately, it only uses part of the Jabber ID, which is hard
> coded to "rhn-dispatcher-sat" in the code.  So, if you have two, it will
> most likely not pull the correct one and forever grab the wrong password
> and attempt to use that against Jabber, and then crash OSA dispatcher.
>
> I personally have two masters (and two OSA dispatchers) because I want
> datacenter redundancy in case of an issue in one.  Also, I want to use the
> GUI via GTM in global availability mode again, for redundancy.  So, having
> OSA dispatcher running on both is very preferable in my situation.
> Unfortunately with the issue I mentioned above about the dispatcher code
> not pulling the password properly from rhnPushDispatcher when there are
> multiple entries, I had to implement an ugly hack.
>
> Another issue to throw in here was that jabber and OSA dispatcher would
> randomly crash.  SM and C2S would segfault at random times and OSA
> dispatcher would error out for many various reasons:
>   - timeouts when it was loading the rows from "active" and "roster-items"
>   - invalidclient errors (didn't look into these much but maybe old
> versions of OSAD components on the clients).  ridiculous that one bad
> client could cause the whole dispatcher process to bomb out though
>   - SELinux errors that weren't covered by running
> osa-dispatcher-selinux-enable (had to run audit2allow and semodule more
> times than I'm proud of, a few times in a while loop because fixing one
> SELinux error would lead to another being denied via SELinux in the next
> line of the code)
>
> Below is the script that I mentioned to run the hack to get around the osa
> dispatcher laziness of using a "like" statement to pull the password from
> rhnPushDispatcher.  It's ugly but the dispatcher daemon has stayed up for
> almost 24 hours now (I consider this a win after all my other issues and my
> year and half long struggle with the software) and we have ~5.5k systems
> online with it.  It's in cron running every 2 minutes to check on both
> jabberd and osa-dispatcher services.
>
> Note that the SQL file is after changing the jabber config on the masters
> to use SQLite (another attempt in the long list of things I tried to keep
> things stable).  A similar method would most likely work but the
> fixjabber.sql file below would need updating.
>
> I don't like the idea of deleting the entire Jabber database every time, I
> think it's a cop out and requires all systems to recreate entries in
> multiple tables of that database, whether it's Berkley DB or Postgres or
> whatever.
>
> ---
>
> [me at ourmaster1 ~]$ cat /var/log/fixjabber.sh.out
> osa-dispatcher and jabberd have been restarted 0 times since Thu Aug 18
> 11:27:59 CDT 2016
> [me at ourmaster2 ~]$ cat /var/log/fixjabber.sh.out
> osa-dispatcher and jabberd have been restarted 0 times since Thu Aug 18
> 11:27:01 CDT 2016
>
> [me at ourmaster1 ~]$ sudo cat /usr/local/bin/fixjabber.sh
> #!/bin/bash
> PATH=/sbin:/usr/bin:/bin
> LOGFILE=/var/log/$(basename $0).out
> if [ ! -f "${LOGFILE}" ]; then
>         echo "osa-dispatcher and jabberd have been restarted 0 times since
> $(date)" >> "${LOGFILE}"
> fi
> service jabberd status && service osa-dispatcher status
> if [ $? -ne 0 ]; then
>         service osa-dispatcher stop
>         service jabberd stop
>         echo "delete from rhnpushdispatcher where
> jabber_id='rhn-dispatcher-sat@$(uname -n)/superclient';" | spacewalk-sql
> -i
>         sqlite3 /var/lib/jabberd/db/sqlite.db <
> /usr/local/etc/fixjabber.sql
>         service jabberd start
>         service osa-dispatcher start
>         oldnum=$(cut -d ' ' -f7 ${LOGFILE})
>         newnum=$(expr $oldnum + 1)
>         sed -i "s/$oldnum/$newnum/g" ${LOGFILE}
> fi
>
> [me at ourmaster1 ~]$ cat /usr/local/etc/fixjabber.sql
> delete from authreg where username = 'rhn-dispatcher-sat';
> delete from "roster-items" where "collection-owner" = 'rhn-dispatcher-sat@
> ourmaster1.fqdn';
> delete from status where "collection-owner" =
> 'rhn-dispatcher-sat at ourmaster1.fqdn';
> delete from active where "collection-owner" =
> 'rhn-dispatcher-sat at ourmaster1.fqdn';
>
>
>
> On Fri, Aug 19, 2016 at 8:34 AM Daryl Rose <darylrose at outlook.com> wrote:
>
>> Basically you're doing everything that I've been doing with the exception
>> of the db_recover command.  I was not familiar with that command.
>>
>>
>> How can I tell if the clients are self registering or not?
>>
>>
>> Thank you.
>>
>>
>> Daryl
>>
>>
>> ------------------------------
>> *From:* Robert Paschedag <robert.paschedag at web.de>
>> *Sent:* Friday, August 19, 2016 3:42 AM
>> *To:* Daryl Rose
>> *Cc:* spacewalk-list at redhat.com
>> *Subject:* Re: [Spacewalk-list] Ongoing jabberd/osad issues.
>>
>> Now I had also a problem with the database. Just wanted to check, which
>> logfiles of the jabber db are not needed anymore as stated in
>>
>> http://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/transapp/logfile.html
>> Berkeley DB Reference Guide: Log file removal
>> <http://web.stanford.edu/class/cs276a/projects/docs/berkeleydb/ref/transapp/logfile.html>
>> web.stanford.edu
>> Log file removal. The fourth component of the infrastructure, log file
>> removal, concerns the ongoing disk consumption of the database log files.
>>
>>
>>
>> Just running "db_archive" killed my jabber db. And....fixing it with
>>
>> db_recover -v or
>> db_recover -c
>>
>> within /var/lib/jabber/db
>>
>> did not work.
>>
>> So I was also in the situation to "clean" the database.
>>
>> 1. Stop jabberd (/etc/init.d/jabberd stop)
>> 2. Stop osa-dispatcher (/etc/init.d/osa-dispatcher stop)
>> 3. Remove contents of /var/lib/jabber/db (rm -f /var/lib/jabber/db/*)
>> 4. Start jabber (/etc/init.d/jabberd start)
>> 5. Start osa-dispatcher (/etc/init.d/osa-dispatcher start)
>>
>> I thought, I should restart the osad client everywhere....but no....the
>> clients are just re-registering themeselves automatically. Of course, I
>> have to check this on every client but what I have checked so far is
>> looking good.
>>
>> Regards
>> Robert
>>
>>
>> Am 18.08.2016 um 09:30 schrieb Robert Paschedag:
>> > Hi Daryl,
>> >
>> > as long as there are no error messages within the logs that there seems
>> to be an error with the jabber db, I wouldn't do anything with the db.
>> >
>> > As I earlier wrote, I only had to repair the db once within about 3 1/2
>> years.
>> >
>> > So, what I would do now is to really delete the jabber db (back it
>> up... just in case) to start up with a"clean " install. If the clients
>> (that already have authentication information) do not re-register
>> automatically, you should go to the client, stop osad, remove
>> /etc/sysconfig/rhn/osad-auth.conf and start osad again. The client should
>> then register and you should see the status on the web GUI as "online". If
>> not, check the /var/log/rhn/osad.log on the client  (if I remember correct
>> right now) and osa-dispatcher logs in the server.
>> >
>> > I also wrote, that my spacewalk servers are NOT clients of themselves.
>> I don't think, that should be a problem but just for " testing " you should
>> deactivate osad "client" on the spacewalk server.
>> >
>> > Start with one test server.
>> >
>> > Good luck.
>> >
>> > Regards
>> > Robert
>> > Am 17.08.2016 20:43 schrieb Daryl Rose <darylrose at outlook.com>:
>> >>
>> >> I've posted here issues that I've had with jabberd and osad, as have
>> others.  But I haven't gotten things resolved, so I am posting additional
>> information.
>> >>
>> >>
>> >> I put SW into production about a year ago.  After a period of time, I
>> noticed issues with the WUI and servers not reporting correctly and other
>> issues.  Google searches show that I need to shutdown spacewalk and remove
>> all the contents in /var/lib/jabberd/db.   This seemed to work, but after a
>> few months, I realized that osad was no longer communicating with
>> osa-dispatcher.
>> >>
>> >>
>> >> I started doing some additional research and learned that was not a
>> good way to resolve this issue.  According to the official Spacewalk
>> documentation, I should create a checkpoint and then clean up log files
>> keeping the database and auth database files.
>> >>
>> >>
>> >> https://fedorahosted.org/spacewalk/wiki/JabberDatabase
>> >>
>> >> JabberDatabase – spacewalk - Fedora Hosted
>> >> fedorahosted.org
>> >> Jabber Database. Spacewalk utilizes Jabber to facilitate
>> communications between the server and the clients for osa-dispatcher/osad.
>> The Jabber program uses the ...
>> >> These are the steps that I followed:
>> >>
>> >>
>> >> /usr/bin/db_checkpoint -1 -h /var/lib/jabberd/db/ ## mark logs for
>> deletion
>> >> /usr/bin/db_archive -d -h /var/lib/jabberd/db/  ## delete logs
>> >> service jabberd restart
>> >>
>> >> However, this also causes problems with jabberd and osad.  If I use
>> the commands as the documentation instructs, then osa-dispatcher will
>> start, but die, and I get errors in the log that there is an invalid
>> password.
>> >>
>> >>
>> >> So to help explain my issue, I ran a test and tried to capture
>> everything that I could and I'll post it here.
>> >>
>> >>
>> >> 1. Listing of /var/lib/jabberd/db
>> >>
>> >> [root@<spwalk-server> db]# ls
>> >> __db.001  __db.006        log.0000000004  log.0000000009
>> log.0000000014  log.0000000019  log.0000000024  sm.db
>> >> __db.002  authreg.db      log.0000000005  log.0000000010
>> log.0000000015  log.0000000020  log.0000000025
>> >> __db.003  log.0000000001  log.0000000006  log.0000000011
>> log.0000000016  log.0000000021  log.0000000026
>> >> __db.004  log.0000000002  log.0000000007  log.0000000012
>> log.0000000017  log.0000000022  log.0000000027
>> >> __db.005  log.0000000003  log.0000000008  log.0000000013
>> log.0000000018  log.0000000023  log.0000000028
>> >>
>> >> 2. Spacewalk Server Status
>> >>
>> >> [root@<spwalk-server> db]# spacewalk-service status
>> >> postmaster (pid  1175) is running...
>> >> router (pid 21431) is running...
>> >> sm (pid 21441) is running...
>> >> c2s (pid 21451) is running...
>> >> s2s (pid 21461) is running...
>> >> tomcat6 (pid 1304) is running...                           [  OK  ]
>> >> httpd (pid  1385) is running...
>> >> osa-dispatcher (pid  21479) is running...
>> >> rhn-search is running (1441).
>> >> cobblerd (pid 1491) is running...
>> >> RHN Taskomatic is running (1515).
>> >>
>> >> 3.  Most recent log file entry:
>> >>
>> >> 2016/08/17 07:44:13 -05:00 21476 0.0.0.0: osad/jabber_lib.__init__
>> >> 2016/08/17 07:44:13 -05:00 21476 0.0.0.0:
>> osad/jabber_lib.setup_connection('Connected to jabber server',
>> '<spwalk-server>.com')
>> >> 2016/08/17 07:44:13 -05:00 21476 0.0.0.0:
>> osad/osa_dispatcher.fix_connection('Upstream notification server started on
>> port', 1290)
>> >> 2016/08/17 07:44:14 -05:00 21476 0.0.0.0:
>> osad/jabber_lib.process_forever
>> >>
>> >> 4.  Ran the commands as instructed in the jabberd documentation.
>> >>
>> >> /usr/bin/db_checkpoint -1 -h /var/lib/jabberd/db/ ## mark logs for
>> deletion
>> >> /usr/bin/db_archive -d -h /var/lib/jabberd/db/  ## delete logs
>> >> service jabberd restart
>> >>
>> >> 5.  Log file entry:
>> >>
>> >> 2016/08/17 13:28:19 -05:00 21476 0.0.0.0:
>> osad/jabber_lib.main('ERROR', 'Traceback (most recent call last):\n  File
>> "/usr/share/rhn/osad/jabber_lib.py", line 121, in main\n
>> self.process_forever(c)\n  File "/usr/share/rhn/osad/jabber_lib.py", line
>> 179, in process_forever\n    self.process_once(client)\n  File
>> "/usr/share/rhn/osad/osa_dispatcher.py", line 187, in process_once\n
>> client.retrieve_roster()\n  File "/usr/share/rhn/osad/jabber_lib.py", line
>> 729, in retrieve_roster\n    stanza = self.get_one_stanza()\n  File
>> "/usr/share/rhn/osad/jabber_lib.py", line 801, in get_one_stanza\n
>> self.process(timeout=tm)\n  File "/usr/share/rhn/osad/jabber_lib.py", line
>> 1055, in process\n    data = self._read(self.BLOCK_SIZE)\nSSLError:
>> (\'OpenSSL error; will retry\', "(-1, \'Unexpected EOF\')")\n')
>> >> 2016/08/17 13:28:29 -05:00 21476 0.0.0.0: osad/jabber_lib.__init__
>> >> 2016/08/17 13:28:29 -05:00 21476 0.0.0.0:
>> osad/jabber_lib.setup_connection('Connected to jabber server',
>> '<spwalk-server>.com')
>> >> 2016/08/17 13:28:29 -05:00 21476 0.0.0.0:
>> osad/jabber_lib.register('ERROR', 'Invalid password')
>> >>
>> >> 6.  Spacewalk server status
>> >>
>> >> [root@<spwalk-server> db]# spacewalk-service status
>> >> postmaster (pid  1175) is running...
>> >> router (pid 27119) is running...
>> >> sm (pid 27129) is running...
>> >> c2s (pid 27139) is running...
>> >> s2s (pid 27149) is running...
>> >> tomcat6 (pid 1304) is running...                           [  OK  ]
>> >> httpd (pid  1385) is running...
>> >> osa-dispatcher dead but pid file exists
>> >> rhn-search is running (1441).
>> >> cobblerd (pid 1491) is running...
>> >> RHN Taskomatic is running (1515).
>> >>
>> >> 7. Long listing of /var/lib/jabberd/db
>> >>
>> >> [root@<spwalk-server> db]# ls -l
>> >> total 7536
>> >> -rw-r-----. 1 jabber jabber    24576 Aug 17 13:28 __db.001
>> >> -rw-r-----. 1 jabber jabber   204800 Aug 17 13:29 __db.002
>> >> -rw-r-----. 1 jabber jabber   270336 Aug 17 13:29 __db.003
>> >> -rw-r-----. 1 jabber jabber    98304 Aug 17 13:29 __db.004
>> >> -rw-r-----. 1 jabber jabber   753664 Aug 17 13:29 __db.005
>> >> -rw-r-----. 1 jabber jabber    57344 Aug 17 13:29 __db.006
>> >> -rw-r-----. 1 jabber jabber   368640 Aug 17 07:46 authreg.db
>> >> -rw-r-----. 1 jabber jabber 10485760 Aug 17 13:29 log.0000000031
>> >> -rw-r-----. 1 jabber jabber   487424 Aug 17 13:29 sm.db
>> >>
>> >> So, neither completely cleaning out jabberd database/log files works,
>> and creating a checkpoint and removing log files that need to be cleaned
>> out doesn't' work, so what can I do to get jabberd and osad to work, and to
>> push out updates when I need to push them out?
>> >>
>> >>
>> >> Thank you.
>> >>
>> >>
>> >> Daryl
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >
>> > _______________________________________________
>> > Spacewalk-list mailing list
>> > Spacewalk-list at redhat.com
>> > https://www.redhat.com/mailman/listinfo/spacewalk-list
>> >
>> _______________________________________________
>> Spacewalk-list mailing list
>> Spacewalk-list at redhat.com
>> https://www.redhat.com/mailman/listinfo/spacewalk-list
>
> _______________________________________________
> Spacewalk-list mailing list
> Spacewalk-list at redhat.com
> https://www.redhat.com/mailman/listinfo/spacewalk-list
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/spacewalk-list/attachments/20160822/ed9c254a/attachment.htm>