[Freeipa-users] Replication woes

Tue Aug 27 14:45:03 UTC 2013

Thanks, Rob. I restarted named and I can see that it's only loading these
(timestamps omitted for clarity):

zone 0.in-addr.arpa/IN: loaded serial 0
zone 1.0.0.127.in-addr.arpa/IN: loaded serial 0
zone 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.wholebunchofzeros.0.ip6.arpa/IN:loaded
serial 0
zone localhost/IN: loaded serial 0
zone localhost.localdomain/IN: loaded serial 0
all zones loaded
running
zone foo.net/IN: sending notifies (serial 2012102269)
zone 3.2.1.in-addr.arpa/IN: sending notifies (serial 2012101940)

Looking at some earlier /var/log/messages files, it used to emit "sending
notifies" messages for all the other zones as well -- what might have
changed to restrict it to just these two?

To your other question, we didn't try deleting replicas until after
upgrading the master to IPA-3.1.5-1.fc18. The replicas _are_ still at a
lower level; they began having problems when one crashed and needed to be
rebuilt on the new baseline, which wasn't possible because the original
couldn't be deleted because of a communication problem with a third
replica. And the dominoes began to fall....

Hey, it's still a better system than what we had before! I think we just
hit a real oddball situation which is making recovery difficult.

*
*
*Bret Wortman*

http://damascusgrp.com/
http://about.me/wortmanbret

On Tue, Aug 27, 2013 at 10:23 AM, Rob Crittenden <rcritten at redhat.com>wrote:

> Bret Wortman wrote:
>
>> Here's a bit more about what I'm seeing today.
>>
>> My master _is_ serving some DNS, but it appears that it's only serving
>> those zones that it knew about before all this trouble started 7-10 days
>> ago. In particular, it can only do reverse DNS on one zone (its own),
>> but can't serve reverse DNS for any other zones, even those that are in
>> its database and visible (and enabled) from the web UI.
>>
>> # nslookup 4.3.2.1 ipamaster
>> ;; connection timed out; no servers could be reached
>> # nslookup 6.5.2.1 ipamaster
>> Server:            ipamaster
>> Address:          10.9.2.1
>>
>> 1.2.5.6.in-addr.arpa
>> <http://bl-1.com/click/load/**VmcPPFAzWmtVM1M-b0169Umw-b0231<http://bl-1.com/click/load/VmcPPFAzWmtVM1M-b0169Umw-b0231>
>> **>     name =
>> host1.foo.com <http://host1.foo.com>.
>>
>> #
>>
>> Is this something that's easily rectified? The logs aren't giving me
>> anything obviously wrong -- nothing in /var/log/dirsrv-FOO-COM/errors
>> seems significant; just the same CLEANALLRUV errors I've been seeing for
>> the past week.
>>
>
> You might try restarting named. At a minimum it is going to log all the
> zones it manages so you can compare what it thinks it has vs what IPA has.
> A pattern might emerge.
>
> You can delete an existing replica and re-create it, but with the 389-ds
> errors I'm not sure what the repercussions would be, if any. You could end
> up with more dead replicas. It could be that all the RUV you have are
> because the deletions were done prior to 389-ds adding support for
> CLEANALLRUV (and it getting into IPA).
>
> rob
>
>>
>>
>>
>> _
>> _
>> *Bret Wortman*
>>
>> http://damascusgrp.com/
>> http://about.me/wortmanbret
>>
>>
>> On Tue, Aug 27, 2013 at 7:24 AM, Bret Wortman
>> <bret.wortman at damascusgrp.com <mailto:bret.wortman@**damascusgrp.com<bret.wortman at damascusgrp.com>>>
>> wrote:
>>
>>     I managed to gather some data for Rich and others to review and
>>     updated a bug for them about a week ago. Now I am getting a lot of
>>     internal pressure to resolve our problems and get our infrastructure
>>     stable again. As of yesterday, our master IPA server would accept
>>     changes to DNS but isn't actually serving DNS, nor is it pushing
>>     data to any replicas. The replicas are acting as DNS servers but
>>     aren't getting any updates, nor can updates be made locally on them.
>>     Fortunately, we aren't adding users very often, but if anyone's
>>     password expires soon, I'm worried that I'll have an account lockout
>>     situation.
>>
>>     So I'll ask again -- can anyone see a way to preserve just the
>>     actual DNS and authentication data within IPA while dumping its
>>     other data (replication and so on), restart it cleanly, verify it's
>>     working in all respects, and set up the replicas from scratch again?
>>     I'm hearing rumblings about going back to passwd files and host
>>     tables (which is what we were doing until about 12 months ago when I
>>     brought IPA in) and I'd really rather not go back to the stone
>> ages....
>>
>>     Thanks!
>>
>>
>>
>>
>>     _
>>     _
>>     *Bret Wortman*
>>
>>
>>     http://damascusgrp.com/
>>     http://about.me/wortmanbret
>>
>>
>>     On Tue, Aug 20, 2013 at 11:15 AM, Bret Wortman
>>     <bret.wortman at damascusgrp.com <mailto:bret.wortman@**damascusgrp.com<bret.wortman at damascusgrp.com>
>> >>
>>
>>     wrote:
>>
>>         If I were going to attempt to restore to an old backup, what
>>         directories/files should I make sure to restore? I've got a
>>         backup script that tars up:
>>
>>         /usr/share/ipa
>>         /usr/lib64/ipa
>>         /var/lib/pia
>>         /var/lib/ipa-client
>>         /var/lib/dirsrv
>>         /etc
>>
>>         Is that enough to "roll back" to a few days ago before I started
>>         down this path? I'm now seeing messages about having the max
>>         number of CleanAllRUV tasks (4) and not being able to enqueue
>>         any more. So I'm really stuck now and don't know how soon I can
>>         get the files requested over to Rich for analysis.
>>
>>
>>         _
>>         _
>>         *Bret Wortman*
>>
>>
>>         http://damascusgrp.com/
>>         http://about.me/wortmanbret
>>
>>
>>         On Tue, Aug 20, 2013 at 9:46 AM, Rich Megginson
>>         <rmeggins at redhat.com <mailto:rmeggins at redhat.com>> wrote:
>>
>>             On 08/20/2013 05:55 AM, Bret Wortman wrote:
>>
>>>             Okay, now I'm thinking I need to dump all my replicas and
>>>             start them fresh. My /var/log/slapd-FOO-COM/errors is
>>>             filled with messages like this:
>>>
>>>             NSMMReplicationPlugin - changelog program -
>>>             agmt="cn=meTogood1.foo.com <http://meTogood1.foo.com>"
>>>
>>>             (good1:389): CSN 520a49640000001d0000 not found, we aren't
>>>             as up to date, or we purged
>>>             agmt="cn=meTogood1.foo.com <http://meTogood1.foo.com>"
>>>
>>>             (good1:389) - Can't locate CSN 520a49640000001d0000 in the
>>>             changelog (DB rc=-30988). The consumer may need to be
>>>             reinitialized.
>>>
>>>             I assume the "consumer" is the replica, right? At present,
>>>             I have two replicas known to my master that are simply
>>>             gone. Another is there but they can't talk. Three more
>>>             have good communication but I'm getting errors like these.
>>>             Is there a good, clean way to just clobber all the
>>>             replicas and start over without trashing the DNS and other
>>>             identity data that is inside my master and which /is/
>>>
>>>             working? Deleting them from the master hasn't been
>>>             working; it tends to hang the master's DNS and other
>>>             services until I Ctrl-C out and "ipactl restart" it.
>>>
>>>             I'm afraid to venture out without a net here and make
>>>             things worse....
>>>
>>
>>             This looks like https://fedorahosted.org/389/**ticket/47386<https://fedorahosted.org/389/ticket/47386>
>>
>>             We've never been able to reproduce this in a "controlled"
>>             environment.
>>
>>             The original reporter has been able to get this to work in
>>             some cases by restarting ipa (ipactl restart).
>>
>>             Before you do that, would you be able to provide some
>>             information for me?
>>
>>             On the supplier and consumer:
>>             ldapsearch -xLLL -D "cn=directory manager" -W -b
>>             "dc=FOO,dc=COM"
>>             '(&(objectclass=nstombstone)(**nsuniqueid=ffffffff-ffffffff-*
>> *ffffffff-ffffffff))'
>>              > ruv.ldif
>>
>>             ldapsearch -xLLL -D "cn=directory manager" -W -b "cn=config"
>>             '(objectclass=**nsds5replicationagreement)' > agmt.ldif
>>
>>             dbscan -f /var/lib/dirsrv/slapd-FOO-COM/**cldb/*.db4 | head
>>             -200 > cldb.txt
>>
>>             Be sure to obscure any sensitive data in ruv.ldif,
>>             agmt.ldif, and cldb.txt - you can either attach to
>>             https://fedorahosted.org/389/**ticket/47386<https://fedorahosted.org/389/ticket/47386>or email to me
>>             directly.
>>
>>
>>
>>>
>>>
>>>             _
>>>             _
>>>             *Bret Wortman*
>>>
>>>
>>>             http://damascusgrp.com/
>>>             http://about.me/wortmanbret
>>>
>>>
>>>             On Mon, Aug 19, 2013 at 2:21 PM, Bret Wortman
>>>             <bret.wortman at damascusgrp.com
>>>             <mailto:bret.wortman@**damascusgrp.com<bret.wortman at damascusgrp.com>>>
>>> wrote:
>>>
>>>                 On my master (where this error is occurring), I've
>>>                 got, in /etc/hosts:
>>>
>>>                 127.0.0.1 localhost localhost.localdomain
>>>                 ::1      localhost localhost.localdomain
>>>                 1.2.3.4 ipamaster.foo.net <http://ipamaster.foo.net>
>>>
>>>                 ipamaster
>>>
>>>                 So that should be okay, right?
>>>
>>>                 # host ipamaster.foo.net <http://ipamaster.foo.net>
>>>                 ipamaster.foo.net <http://ipamaster.foo.net> has
>>>
>>>                 address 1.2.3.4
>>>                 # host ipamaster
>>>                 ipamaster.foo.net <http://ipamaster.foo.net> has
>>>
>>>                 address 1.2.3.4
>>>                 # host localhost
>>>                 localhost has address 127.0.0.1
>>>                 localhost has IPv6 address ::1
>>>                 #
>>>
>>>                 I checked the other system (the one I can't connect
>>>                 to) to be safe, and its /etc/hosts is similarly
>>>                 configured. It even has the master listed with its
>>>                 correct IP address.
>>>
>>>
>>>
>>>                 _
>>>                 _
>>>                 *Bret Wortman*
>>>
>>>
>>>                 http://damascusgrp.com/
>>>                 http://about.me/wortmanbret
>>>
>>>
>>>                 On Mon, Aug 19, 2013 at 2:02 PM, Simo Sorce
>>>                 <simo at redhat.com <mailto:simo at redhat.com>> wrote:
>>>
>>>                     On Mon, 2013-08-19 at 13:51 -0400, Bret Wortman
>>> wrote:
>>>                     > So, any idea how to fix the Kerberos problem?
>>>                     >
>>>
>>>                     If your server is trying to get a tgt for
>>>                     ldap/localhost it probably
>>>                     means your /etc/hosts file is broken and has a
>>>                     line like this:
>>>
>>>                     1.2.3.4 localhost my.real.name <http://my.real.name>
>>>
>>>
>>>                     When GSSAPI tries to resolve my.realm.name
>>>                     <http://my.realm.name> it gets back that 'localhost'
>>>
>>>                     is the canonical name so it tries to get a TGT
>>>                     with that name and it
>>>                     fails.
>>>
>>>                     If /etc/host sis fine then the DNS server may be
>>>                     returning an IP address
>>>                     that later resolves to localhost again.
>>>
>>>                     To unbreak make sure that if you have your fully
>>>                     qualified name
>>>                     in /etc/hosts that it is on its own line pointing
>>>                     at the right IP
>>>                     address and where the FQDN name is the first in line:
>>>                     eg:
>>>
>>>                     this is ok:
>>>                     1.2.3.4 server.full.name <http://server.full.name>
>>>
>>>                     server
>>>
>>>                     this is not:
>>>                     1.2.3.4 server server.full.name
>>>                     <http://server.full.name>
>>>
>>>
>>>                     Simo.
>>>                     >
>>>                     > Bret Wortman
>>>                     >
>>>                     >
>>>                     > http://damascusgrp.com/
>>>                     >
>>>                     > http://about.me/wortmanbret
>>>                     >
>>>                     >
>>>                     >
>>>                     > On Mon, Aug 19, 2013 at 12:19 PM, Bret Wortman
>>>                     > <bret.wortman at damascusgrp.com
>>>                     <mailto:bret.wortman@**damascusgrp.com<bret.wortman at damascusgrp.com>>>
>>> wrote:
>>>                     >         ...and I got the web UI, authentication
>>>                     and sudo back via:
>>>                     >
>>>                     >
>>>                     >         # ipactl stop
>>>                     >         # ipactl start
>>>                     >
>>>                     >
>>>                     >         Not sure why that worked, but it did. I
>>>                     was grasping at
>>>                     >         straws, honestly.
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     >         Bret Wortman
>>>                     >
>>>                     >
>>>                     > http://damascusgrp.com/
>>>                     >
>>>                     > http://about.me/wortmanbret
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     >         On Mon, Aug 19, 2013 at 12:18 PM, Bret
>>>                     Wortman
>>>                     >         <bret.wortman at damascusgrp.com
>>>                     <mailto:bret.wortman@**damascusgrp.com<bret.wortman at damascusgrp.com>>>
>>> wrote:
>>>                     >                 Digging further, I think this
>>>                     log entry might be the
>>>                     >                 problem between the two servers
>>>                     that aren't talking:
>>>                     >
>>>                     >
>>>                     > slapd_ldap_sasl_interactive_**bind - Error: could
>>> not
>>>                     >                 perform interactive bind for
>>>                     id[] mech [GSSAPI]: LDAP
>>>                     >                 error -2 (Local error)
>>>                     (SASL(-1): generic failure:
>>>                     >                 GSSAPI Error: Unspecified GSS
>>>                     failure. Minor code may
>>>                     >                 provide more information (Server
>>>                     >                 ldap/localhost at SPX.NET
>>>                     <mailto:localhost at SPX.NET> not found in Kerberos
>>>
>>>                     >                 database)) errno 2 (No such file
>>>                     or directory)
>>>                     >
>>>                     >
>>>                     >                 Did I build something
>>>                     incorrectly when that server was
>>>                     >                 set up originally?
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     >                 Bret Wortman
>>>                     >
>>>                     >
>>>                     > http://damascusgrp.com/
>>>                     >
>>>                     > http://about.me/wortmanbret
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     >                 On Mon, Aug 19, 2013 at 12:02
>>>                     PM, Bret Wortman
>>>                     >                 <bret.wortman at damascusgrp.com
>>>                     <mailto:bret.wortman@**damascusgrp.com<bret.wortman at damascusgrp.com>>>
>>> wrote:
>>>                     >                         I ran it on a good
>>>                     master, against a bad one.
>>>                     >                         As in, I ran this
>>>                     command on my master IPA
>>>                     >                         node:
>>>                     >
>>>                     >
>>>                     >                         # ipa-replica-manage del
>>>                     --force bad1.foo.net <http://bad1.foo.net>
>>>
>>>                     >                         --cleanup
>>>                     >
>>>                     >
>>>                     >                         Was that wrong? I was
>>>                     trying to delete the bad
>>>                     >                         replica from the master,
>>>                     so I figured the
>>>                     >                         command needed to be run
>>>                     on the master. But
>>>                     >                         again, my master is now
>>>                     in a state where it's
>>>                     >                         not resolving DNS, user
>>>                     logins, or sudo at the
>>>                     >                         very least.
>>>                     >
>>>                     >
>>>                     >                         Oh, and I checked the
>>>                     node that it was
>>>                     >                         complaining about
>>>                     earlier. The network
>>>                     >                         connection to it is the
>>>                     pits, but it's there.
>>>                     >                         And it resolves.
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     >                         Bret Wortman
>>>                     >
>>>                     >
>>>                     > http://damascusgrp.com/
>>>                     >
>>>                     > http://about.me/wortmanbret
>>>                     >
>>>                     >
>>>                     >
>>>                     >                         On Mon, Aug 19, 2013 at
>>>                     11:58 AM, Rob
>>>                     > Crittenden <rcritten at redhat.com
>>>                     <mailto:rcritten at redhat.com>> wrote:
>>>                     > Rob Crittenden wrote:
>>>                     >       Bret Wortman wrote:
>>>                     >               Well, my master ground
>>>                     >               to a halt and wasn't
>>>                     >               responding. I rebooted
>>>                     >               the
>>>                     >               system and now I can't
>>>                     >               access the web UI or
>>>                     >               ssh to the master
>>>                     >               either. I
>>>                     >               have console access
>>>                     >               but that's it.
>>>                     >
>>>                     >               The services all say
>>>                     >               they're running, but
>>>                     >               the web UI gives an
>>>                     >               "Unknown
>>>                     >               Error" dialog and ssh
>>>                     >               fails with
>>>                     > "ssh_exchange_identification:
>>>                     >               Connection closed by
>>>                     >               remote host" whenever
>>>                     >               I try to ssh to
>>>                     >               ipamaster. I
>>>                     >               think something has
>>>                     >               gone really wrong
>>>                     >               inside my master. Any
>>>                     >               ideas? Even
>>>                     >               after the reboot,
>>>                     >               --cleanup isn't
>>>                     >               helping and just
>>>                     >               hangs.
>>>                     >
>>>                     >               The logfiles end (as
>>>                     >               of the time I ^C'd the
>>>                     >               process) with:
>>>                     >
>>>                     >               NSMMReplicationPlugin
>>>                     >               -
>>>                     >               agmt="cn=meTogood3.spx.net
>>>                     <http://meTogood3.spx.net>
>>>                     >               <http://meTogood3.spx.net>"
>>>                     (good3:389): Replication bind with GSSAPI
>>>                     >               auth failed: LDAP
>>>                     >               error -2 (Local error)
>>>                     >               (SASL(-1): generic
>>>                     >               failure:
>>>                     >               GSSAPI Error:
>>>                     >               Unspecified GSS
>>>                     >               failure. Minor code
>>>                     >               may provide more
>>>                     >               information (Cannot
>>>                     >               determine realm for
>>>                     >               numeric host address))
>>>                     >               NSMMReplicationPlugin
>>>                     >               - CleanAllRUV Task:
>>>                     >               Replica not online
>>>                     >               (agmt="cn=meTogood3.foo.net
>>>                     <http://meTogood3.foo.net>
>>>                     <http://meTogood3.foo.net>" (good3:389))
>>>                     >               NSMMReplicationPlugin
>>>                     >               - CleanAllRUV Task:
>>>                     >               Not all replicas
>>>                     >               online,
>>>                     >               retrying in 160
>>>                     >               seconds...,
>>>                     >
>>>                     >               So it looks like it's
>>>                     >               having trouble talking
>>>                     >               with one of my
>>>                     >               replicas and
>>>                     >               is doggedly trying to
>>>                     >               get the job done. Any
>>>                     >               idea how to get the
>>>                     >               master
>>>                     >               back working again
>>>                     >               while I troubleshoot
>>>                     >               this connectivity
>>>                     >               issue?
>>>                     >
>>>                     >       That suggests a DNS problem,
>>>                     >       and it might explain ssh as
>>>                     >       well depending
>>>                     >       on your configuration.
>>>                     >
>>>                     >
>>>                     > To be clear, you ran --cleanup against
>>>                     > one of the bad masters, not a good
>>>                     > one, right?
>>>                     >
>>>                     > rob
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     >
>>>                     > ______________________________**_________________
>>>                     > Freeipa-users mailing list
>>>                     > Freeipa-users at redhat.com
>>>                     <mailto:Freeipa-users at redhat.**com<Freeipa-users at redhat.com>
>>> >
>>>
>>>                     >
>>>                     https://www.redhat.com/**
>>> mailman/listinfo/freeipa-users<https://www.redhat.com/mailman/listinfo/freeipa-users>
>>>
>>>
>>>                     --
>>>                     Simo Sorce * Red Hat, Inc * New York
>>>
>>>
>>>
>>>
>>>
>>>
>>>             ______________________________**_________________
>>>             Freeipa-users mailing list
>>>             Freeipa-users at redhat.com  <mailto:Freeipa-users at redhat.**com<Freeipa-users at redhat.com>
>>> >
>>>             https://www.redhat.com/**mailman/listinfo/freeipa-users<https://www.redhat.com/mailman/listinfo/freeipa-users>
>>>
>>
>>
>>
>>
>>
>>
>> ______________________________**_________________
>> Freeipa-users mailing list
>> Freeipa-users at redhat.com
>> https://www.redhat.com/**mailman/listinfo/freeipa-users<https://www.redhat.com/mailman/listinfo/freeipa-users>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/freeipa-users/attachments/20130827/497db42e/attachment.htm>