[Freeipa-users] Replication woes
Bret Wortman
bret.wortman at damascusgrp.com
Tue Aug 27 14:45:03 UTC 2013
Thanks, Rob. I restarted named and I can see that it's only loading these
(timestamps omitted for clarity):
zone 0.in-addr.arpa/IN: loaded serial 0
zone 1.0.0.127.in-addr.arpa/IN: loaded serial 0
zone 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.wholebunchofzeros.0.ip6.arpa/IN:loaded
serial 0
zone localhost/IN: loaded serial 0
zone localhost.localdomain/IN: loaded serial 0
all zones loaded
running
zone foo.net/IN: sending notifies (serial 2012102269)
zone 3.2.1.in-addr.arpa/IN: sending notifies (serial 2012101940)
Looking at some earlier /var/log/messages files, it used to emit "sending
notifies" messages for all the other zones as well -- what might have
changed to restrict it to just these two?
To your other question, we didn't try deleting replicas until after
upgrading the master to IPA-3.1.5-1.fc18. The replicas _are_ still at a
lower level; they began having problems when one crashed and needed to be
rebuilt on the new baseline, which wasn't possible because the original
couldn't be deleted because of a communication problem with a third
replica. And the dominoes began to fall....
Hey, it's still a better system than what we had before! I think we just
hit a real oddball situation which is making recovery difficult.
*
*
*Bret Wortman*
http://damascusgrp.com/
http://about.me/wortmanbret
On Tue, Aug 27, 2013 at 10:23 AM, Rob Crittenden <rcritten at redhat.com>wrote:
> Bret Wortman wrote:
>
>> Here's a bit more about what I'm seeing today.
>>
>> My master _is_ serving some DNS, but it appears that it's only serving
>> those zones that it knew about before all this trouble started 7-10 days
>> ago. In particular, it can only do reverse DNS on one zone (its own),
>> but can't serve reverse DNS for any other zones, even those that are in
>> its database and visible (and enabled) from the web UI.
>>
>> # nslookup 4.3.2.1 ipamaster
>> ;; connection timed out; no servers could be reached
>> # nslookup 6.5.2.1 ipamaster
>> Server: ipamaster
>> Address: 10.9.2.1
>>
>> 1.2.5.6.in-addr.arpa
>> <http://bl-1.com/click/load/**VmcPPFAzWmtVM1M-b0169Umw-b0231<http://bl-1.com/click/load/VmcPPFAzWmtVM1M-b0169Umw-b0231>
>> **> name =
>> host1.foo.com <http://host1.foo.com>.
>>
>> #
>>
>> Is this something that's easily rectified? The logs aren't giving me
>> anything obviously wrong -- nothing in /var/log/dirsrv-FOO-COM/errors
>> seems significant; just the same CLEANALLRUV errors I've been seeing for
>> the past week.
>>
>
> You might try restarting named. At a minimum it is going to log all the
> zones it manages so you can compare what it thinks it has vs what IPA has.
> A pattern might emerge.
>
> You can delete an existing replica and re-create it, but with the 389-ds
> errors I'm not sure what the repercussions would be, if any. You could end
> up with more dead replicas. It could be that all the RUV you have are
> because the deletions were done prior to 389-ds adding support for
> CLEANALLRUV (and it getting into IPA).
>
> rob
>
>>
>>
>>
>> _
>> _
>> *Bret Wortman*
>>
>> http://damascusgrp.com/
>> http://about.me/wortmanbret
>>
>>
>> On Tue, Aug 27, 2013 at 7:24 AM, Bret Wortman
>> <bret.wortman at damascusgrp.com <mailto:bret.wortman@**damascusgrp.com<bret.wortman at damascusgrp.com>>>
>> wrote:
>>
>> I managed to gather some data for Rich and others to review and
>> updated a bug for them about a week ago. Now I am getting a lot of
>> internal pressure to resolve our problems and get our infrastructure
>> stable again. As of yesterday, our master IPA server would accept
>> changes to DNS but isn't actually serving DNS, nor is it pushing
>> data to any replicas. The replicas are acting as DNS servers but
>> aren't getting any updates, nor can updates be made locally on them.
>> Fortunately, we aren't adding users very often, but if anyone's
>> password expires soon, I'm worried that I'll have an account lockout
>> situation.
>>
>> So I'll ask again -- can anyone see a way to preserve just the
>> actual DNS and authentication data within IPA while dumping its
>> other data (replication and so on), restart it cleanly, verify it's
>> working in all respects, and set up the replicas from scratch again?
>> I'm hearing rumblings about going back to passwd files and host
>> tables (which is what we were doing until about 12 months ago when I
>> brought IPA in) and I'd really rather not go back to the stone
>> ages....
>>
>> Thanks!
>>
>>
>>
>>
>> _
>> _
>> *Bret Wortman*
>>
>>
>> http://damascusgrp.com/
>> http://about.me/wortmanbret
>>
>>
>> On Tue, Aug 20, 2013 at 11:15 AM, Bret Wortman
>> <bret.wortman at damascusgrp.com <mailto:bret.wortman@**damascusgrp.com<bret.wortman at damascusgrp.com>
>> >>
>>
>> wrote:
>>
>> If I were going to attempt to restore to an old backup, what
>> directories/files should I make sure to restore? I've got a
>> backup script that tars up:
>>
>> /usr/share/ipa
>> /usr/lib64/ipa
>> /var/lib/pia
>> /var/lib/ipa-client
>> /var/lib/dirsrv
>> /etc
>>
>> Is that enough to "roll back" to a few days ago before I started
>> down this path? I'm now seeing messages about having the max
>> number of CleanAllRUV tasks (4) and not being able to enqueue
>> any more. So I'm really stuck now and don't know how soon I can
>> get the files requested over to Rich for analysis.
>>
>>
>> _
>> _
>> *Bret Wortman*
>>
>>
>> http://damascusgrp.com/
>> http://about.me/wortmanbret
>>
>>
>> On Tue, Aug 20, 2013 at 9:46 AM, Rich Megginson
>> <rmeggins at redhat.com <mailto:rmeggins at redhat.com>> wrote:
>>
>> On 08/20/2013 05:55 AM, Bret Wortman wrote:
>>
>>> Okay, now I'm thinking I need to dump all my replicas and
>>> start them fresh. My /var/log/slapd-FOO-COM/errors is
>>> filled with messages like this:
>>>
>>> NSMMReplicationPlugin - changelog program -
>>> agmt="cn=meTogood1.foo.com <http://meTogood1.foo.com>"
>>>
>>> (good1:389): CSN 520a49640000001d0000 not found, we aren't
>>> as up to date, or we purged
>>> agmt="cn=meTogood1.foo.com <http://meTogood1.foo.com>"
>>>
>>> (good1:389) - Can't locate CSN 520a49640000001d0000 in the
>>> changelog (DB rc=-30988). The consumer may need to be
>>> reinitialized.
>>>
>>> I assume the "consumer" is the replica, right? At present,
>>> I have two replicas known to my master that are simply
>>> gone. Another is there but they can't talk. Three more
>>> have good communication but I'm getting errors like these.
>>> Is there a good, clean way to just clobber all the
>>> replicas and start over without trashing the DNS and other
>>> identity data that is inside my master and which /is/
>>>
>>> working? Deleting them from the master hasn't been
>>> working; it tends to hang the master's DNS and other
>>> services until I Ctrl-C out and "ipactl restart" it.
>>>
>>> I'm afraid to venture out without a net here and make
>>> things worse....
>>>
>>
>> This looks like https://fedorahosted.org/389/**ticket/47386<https://fedorahosted.org/389/ticket/47386>
>>
>> We've never been able to reproduce this in a "controlled"
>> environment.
>>
>> The original reporter has been able to get this to work in
>> some cases by restarting ipa (ipactl restart).
>>
>> Before you do that, would you be able to provide some
>> information for me?
>>
>> On the supplier and consumer:
>> ldapsearch -xLLL -D "cn=directory manager" -W -b
>> "dc=FOO,dc=COM"
>> '(&(objectclass=nstombstone)(**nsuniqueid=ffffffff-ffffffff-*
>> *ffffffff-ffffffff))'
>> > ruv.ldif
>>
>> ldapsearch -xLLL -D "cn=directory manager" -W -b "cn=config"
>> '(objectclass=**nsds5replicationagreement)' > agmt.ldif
>>
>> dbscan -f /var/lib/dirsrv/slapd-FOO-COM/**cldb/*.db4 | head
>> -200 > cldb.txt
>>
>> Be sure to obscure any sensitive data in ruv.ldif,
>> agmt.ldif, and cldb.txt - you can either attach to
>> https://fedorahosted.org/389/**ticket/47386<https://fedorahosted.org/389/ticket/47386>or email to me
>> directly.
>>
>>
>>
>>>
>>>
>>> _
>>> _
>>> *Bret Wortman*
>>>
>>>
>>> http://damascusgrp.com/
>>> http://about.me/wortmanbret
>>>
>>>
>>> On Mon, Aug 19, 2013 at 2:21 PM, Bret Wortman
>>> <bret.wortman at damascusgrp.com
>>> <mailto:bret.wortman@**damascusgrp.com<bret.wortman at damascusgrp.com>>>
>>> wrote:
>>>
>>> On my master (where this error is occurring), I've
>>> got, in /etc/hosts:
>>>
>>> 127.0.0.1 localhost localhost.localdomain
>>> ::1 localhost localhost.localdomain
>>> 1.2.3.4 ipamaster.foo.net <http://ipamaster.foo.net>
>>>
>>> ipamaster
>>>
>>> So that should be okay, right?
>>>
>>> # host ipamaster.foo.net <http://ipamaster.foo.net>
>>> ipamaster.foo.net <http://ipamaster.foo.net> has
>>>
>>> address 1.2.3.4
>>> # host ipamaster
>>> ipamaster.foo.net <http://ipamaster.foo.net> has
>>>
>>> address 1.2.3.4
>>> # host localhost
>>> localhost has address 127.0.0.1
>>> localhost has IPv6 address ::1
>>> #
>>>
>>> I checked the other system (the one I can't connect
>>> to) to be safe, and its /etc/hosts is similarly
>>> configured. It even has the master listed with its
>>> correct IP address.
>>>
>>>
>>>
>>> _
>>> _
>>> *Bret Wortman*
>>>
>>>
>>> http://damascusgrp.com/
>>> http://about.me/wortmanbret
>>>
>>>
>>> On Mon, Aug 19, 2013 at 2:02 PM, Simo Sorce
>>> <simo at redhat.com <mailto:simo at redhat.com>> wrote:
>>>
>>> On Mon, 2013-08-19 at 13:51 -0400, Bret Wortman
>>> wrote:
>>> > So, any idea how to fix the Kerberos problem?
>>> >
>>>
>>> If your server is trying to get a tgt for
>>> ldap/localhost it probably
>>> means your /etc/hosts file is broken and has a
>>> line like this:
>>>
>>> 1.2.3.4 localhost my.real.name <http://my.real.name>
>>>
>>>
>>> When GSSAPI tries to resolve my.realm.name
>>> <http://my.realm.name> it gets back that 'localhost'
>>>
>>> is the canonical name so it tries to get a TGT
>>> with that name and it
>>> fails.
>>>
>>> If /etc/host sis fine then the DNS server may be
>>> returning an IP address
>>> that later resolves to localhost again.
>>>
>>> To unbreak make sure that if you have your fully
>>> qualified name
>>> in /etc/hosts that it is on its own line pointing
>>> at the right IP
>>> address and where the FQDN name is the first in line:
>>> eg:
>>>
>>> this is ok:
>>> 1.2.3.4 server.full.name <http://server.full.name>
>>>
>>> server
>>>
>>> this is not:
>>> 1.2.3.4 server server.full.name
>>> <http://server.full.name>
>>>
>>>
>>> Simo.
>>> >
>>> > Bret Wortman
>>> >
>>> >
>>> > http://damascusgrp.com/
>>> >
>>> > http://about.me/wortmanbret
>>> >
>>> >
>>> >
>>> > On Mon, Aug 19, 2013 at 12:19 PM, Bret Wortman
>>> > <bret.wortman at damascusgrp.com
>>> <mailto:bret.wortman@**damascusgrp.com<bret.wortman at damascusgrp.com>>>
>>> wrote:
>>> > ...and I got the web UI, authentication
>>> and sudo back via:
>>> >
>>> >
>>> > # ipactl stop
>>> > # ipactl start
>>> >
>>> >
>>> > Not sure why that worked, but it did. I
>>> was grasping at
>>> > straws, honestly.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Bret Wortman
>>> >
>>> >
>>> > http://damascusgrp.com/
>>> >
>>> > http://about.me/wortmanbret
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, Aug 19, 2013 at 12:18 PM, Bret
>>> Wortman
>>> > <bret.wortman at damascusgrp.com
>>> <mailto:bret.wortman@**damascusgrp.com<bret.wortman at damascusgrp.com>>>
>>> wrote:
>>> > Digging further, I think this
>>> log entry might be the
>>> > problem between the two servers
>>> that aren't talking:
>>> >
>>> >
>>> > slapd_ldap_sasl_interactive_**bind - Error: could
>>> not
>>> > perform interactive bind for
>>> id[] mech [GSSAPI]: LDAP
>>> > error -2 (Local error)
>>> (SASL(-1): generic failure:
>>> > GSSAPI Error: Unspecified GSS
>>> failure. Minor code may
>>> > provide more information (Server
>>> > ldap/localhost at SPX.NET
>>> <mailto:localhost at SPX.NET> not found in Kerberos
>>>
>>> > database)) errno 2 (No such file
>>> or directory)
>>> >
>>> >
>>> > Did I build something
>>> incorrectly when that server was
>>> > set up originally?
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Bret Wortman
>>> >
>>> >
>>> > http://damascusgrp.com/
>>> >
>>> > http://about.me/wortmanbret
>>> >
>>> >
>>> >
>>> >
>>> > On Mon, Aug 19, 2013 at 12:02
>>> PM, Bret Wortman
>>> > <bret.wortman at damascusgrp.com
>>> <mailto:bret.wortman@**damascusgrp.com<bret.wortman at damascusgrp.com>>>
>>> wrote:
>>> > I ran it on a good
>>> master, against a bad one.
>>> > As in, I ran this
>>> command on my master IPA
>>> > node:
>>> >
>>> >
>>> > # ipa-replica-manage del
>>> --force bad1.foo.net <http://bad1.foo.net>
>>>
>>> > --cleanup
>>> >
>>> >
>>> > Was that wrong? I was
>>> trying to delete the bad
>>> > replica from the master,
>>> so I figured the
>>> > command needed to be run
>>> on the master. But
>>> > again, my master is now
>>> in a state where it's
>>> > not resolving DNS, user
>>> logins, or sudo at the
>>> > very least.
>>> >
>>> >
>>> > Oh, and I checked the
>>> node that it was
>>> > complaining about
>>> earlier. The network
>>> > connection to it is the
>>> pits, but it's there.
>>> > And it resolves.
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > Bret Wortman
>>> >
>>> >
>>> > http://damascusgrp.com/
>>> >
>>> > http://about.me/wortmanbret
>>> >
>>> >
>>> >
>>> > On Mon, Aug 19, 2013 at
>>> 11:58 AM, Rob
>>> > Crittenden <rcritten at redhat.com
>>> <mailto:rcritten at redhat.com>> wrote:
>>> > Rob Crittenden wrote:
>>> > Bret Wortman wrote:
>>> > Well, my master ground
>>> > to a halt and wasn't
>>> > responding. I rebooted
>>> > the
>>> > system and now I can't
>>> > access the web UI or
>>> > ssh to the master
>>> > either. I
>>> > have console access
>>> > but that's it.
>>> >
>>> > The services all say
>>> > they're running, but
>>> > the web UI gives an
>>> > "Unknown
>>> > Error" dialog and ssh
>>> > fails with
>>> > "ssh_exchange_identification:
>>> > Connection closed by
>>> > remote host" whenever
>>> > I try to ssh to
>>> > ipamaster. I
>>> > think something has
>>> > gone really wrong
>>> > inside my master. Any
>>> > ideas? Even
>>> > after the reboot,
>>> > --cleanup isn't
>>> > helping and just
>>> > hangs.
>>> >
>>> > The logfiles end (as
>>> > of the time I ^C'd the
>>> > process) with:
>>> >
>>> > NSMMReplicationPlugin
>>> > -
>>> > agmt="cn=meTogood3.spx.net
>>> <http://meTogood3.spx.net>
>>> > <http://meTogood3.spx.net>"
>>> (good3:389): Replication bind with GSSAPI
>>> > auth failed: LDAP
>>> > error -2 (Local error)
>>> > (SASL(-1): generic
>>> > failure:
>>> > GSSAPI Error:
>>> > Unspecified GSS
>>> > failure. Minor code
>>> > may provide more
>>> > information (Cannot
>>> > determine realm for
>>> > numeric host address))
>>> > NSMMReplicationPlugin
>>> > - CleanAllRUV Task:
>>> > Replica not online
>>> > (agmt="cn=meTogood3.foo.net
>>> <http://meTogood3.foo.net>
>>> <http://meTogood3.foo.net>" (good3:389))
>>> > NSMMReplicationPlugin
>>> > - CleanAllRUV Task:
>>> > Not all replicas
>>> > online,
>>> > retrying in 160
>>> > seconds...,
>>> >
>>> > So it looks like it's
>>> > having trouble talking
>>> > with one of my
>>> > replicas and
>>> > is doggedly trying to
>>> > get the job done. Any
>>> > idea how to get the
>>> > master
>>> > back working again
>>> > while I troubleshoot
>>> > this connectivity
>>> > issue?
>>> >
>>> > That suggests a DNS problem,
>>> > and it might explain ssh as
>>> > well depending
>>> > on your configuration.
>>> >
>>> >
>>> > To be clear, you ran --cleanup against
>>> > one of the bad masters, not a good
>>> > one, right?
>>> >
>>> > rob
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > ______________________________**_________________
>>> > Freeipa-users mailing list
>>> > Freeipa-users at redhat.com
>>> <mailto:Freeipa-users at redhat.**com<Freeipa-users at redhat.com>
>>> >
>>>
>>> >
>>> https://www.redhat.com/**
>>> mailman/listinfo/freeipa-users<https://www.redhat.com/mailman/listinfo/freeipa-users>
>>>
>>>
>>> --
>>> Simo Sorce * Red Hat, Inc * New York
>>>
>>>
>>>
>>>
>>>
>>>
>>> ______________________________**_________________
>>> Freeipa-users mailing list
>>> Freeipa-users at redhat.com <mailto:Freeipa-users at redhat.**com<Freeipa-users at redhat.com>
>>> >
>>> https://www.redhat.com/**mailman/listinfo/freeipa-users<https://www.redhat.com/mailman/listinfo/freeipa-users>
>>>
>>
>>
>>
>>
>>
>>
>> ______________________________**_________________
>> Freeipa-users mailing list
>> Freeipa-users at redhat.com
>> https://www.redhat.com/**mailman/listinfo/freeipa-users<https://www.redhat.com/mailman/listinfo/freeipa-users>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/freeipa-users/attachments/20130827/497db42e/attachment.htm>
More information about the Freeipa-users
mailing list