[Freeipa-users] Replication woes

Tue Aug 27 11:24:57 UTC 2013

I managed to gather some data for Rich and others to review and updated a
bug for them about a week ago. Now I am getting a lot of internal pressure
to resolve our problems and get our infrastructure stable again. As of
yesterday, our master IPA server would accept changes to DNS but isn't
actually serving DNS, nor is it pushing data to any replicas. The replicas
are acting as DNS servers but aren't getting any updates, nor can updates
be made locally on them. Fortunately, we aren't adding users very often,
but if anyone's password expires soon, I'm worried that I'll have an
account lockout situation.

So I'll ask again -- can anyone see a way to preserve just the actual DNS
and authentication data within IPA while dumping its other data
(replication and so on), restart it cleanly, verify it's working in all
respects, and set up the replicas from scratch again? I'm hearing rumblings
about going back to passwd files and host tables (which is what we were
doing until about 12 months ago when I brought IPA in) and I'd really
rather not go back to the stone ages....

Thanks!

*
*
*Bret Wortman*

http://damascusgrp.com/
http://about.me/wortmanbret

On Tue, Aug 20, 2013 at 11:15 AM, Bret Wortman <bret.wortman at damascusgrp.com
> wrote:

> If I were going to attempt to restore to an old backup, what
> directories/files should I make sure to restore? I've got a backup script
> that tars up:
>
> /usr/share/ipa
> /usr/lib64/ipa
> /var/lib/pia
> /var/lib/ipa-client
> /var/lib/dirsrv
> /etc
>
> Is that enough to "roll back" to a few days ago before I started down this
> path? I'm now seeing messages about having the max number of CleanAllRUV
> tasks (4) and not being able to enqueue any more. So I'm really stuck now
> and don't know how soon I can get the files requested over to Rich for
> analysis.
>
>
> *
> *
> *Bret Wortman*
>
> http://damascusgrp.com/
> http://about.me/wortmanbret
>
>
> On Tue, Aug 20, 2013 at 9:46 AM, Rich Megginson <rmeggins at redhat.com>wrote:
>
>>  On 08/20/2013 05:55 AM, Bret Wortman wrote:
>>
>> Okay, now I'm thinking I need to dump all my replicas and start them
>> fresh. My /var/log/slapd-FOO-COM/errors is filled with messages like this:
>>
>>  NSMMReplicationPlugin - changelog program - agmt="cn=meTogood1.foo.com"
>> (good1:389): CSN 520a49640000001d0000 not found, we aren't as up to date,
>> or we purged
>> agmt="cn=meTogood1.foo.com" (good1:389) - Can't locate CSN
>> 520a49640000001d0000 in the changelog (DB rc=-30988). The consumer may need
>> to be reinitialized.
>>
>>  I assume the "consumer" is the replica, right? At present, I have two
>> replicas known to my master that are simply gone. Another is there but they
>> can't talk. Three more have good communication but I'm getting errors like
>> these. Is there a good, clean way to just clobber all the replicas and
>> start over without trashing the DNS and other identity data that is inside
>> my master and which *is* working? Deleting them from the master hasn't
>> been working; it tends to hang the master's DNS and other services until I
>> Ctrl-C out and "ipactl restart" it.
>>
>>  I'm afraid to venture out without a net here and make things worse....
>>
>>
>> This looks like https://fedorahosted.org/389/ticket/47386
>>
>> We've never been able to reproduce this in a "controlled" environment.
>>
>> The original reporter has been able to get this to work in some cases by
>> restarting ipa (ipactl restart).
>>
>> Before you do that, would you be able to provide some information for me?
>>
>> On the supplier and consumer:
>> ldapsearch -xLLL -D "cn=directory manager" -W -b "dc=FOO,dc=COM"
>> '(&(objectclass=nstombstone)(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff))'
>> > ruv.ldif
>>
>> ldapsearch -xLLL -D "cn=directory manager" -W -b "cn=config"
>> '(objectclass=nsds5replicationagreement)' > agmt.ldif
>>
>> dbscan -f /var/lib/dirsrv/slapd-FOO-COM/cldb/*.db4 | head -200 > cldb.txt
>>
>> Be sure to obscure any sensitive data in ruv.ldif, agmt.ldif, and
>> cldb.txt - you can either attach to
>> https://fedorahosted.org/389/ticket/47386 or email to me directly.
>>
>>
>>
>>
>>
>>  *
>> *
>>  *Bret Wortman*
>>
>>  http://damascusgrp.com/
>>  http://about.me/wortmanbret
>>
>>
>> On Mon, Aug 19, 2013 at 2:21 PM, Bret Wortman <
>> bret.wortman at damascusgrp.com> wrote:
>>
>>>    On my master (where this error is occurring), I've got, in
>>> /etc/hosts:
>>>
>>>  127.0.0.1 localhost localhost.localdomain
>>> ::1      localhost localhost.localdomain
>>> 1.2.3.4    ipamaster.foo.net ipamaster
>>>
>>>  So that should be okay, right?
>>>
>>>  # host ipamaster.foo.net
>>> ipamaster.foo.net has address 1.2.3.4
>>> # host ipamaster
>>> ipamaster.foo.net has address 1.2.3.4
>>> # host localhost
>>> localhost has address 127.0.0.1
>>> localhost has IPv6 address ::1
>>>  #
>>>
>>>  I checked the other system (the one I can't connect to) to be safe,
>>> and its /etc/hosts is similarly configured. It even has the master listed
>>> with its correct IP address.
>>>
>>>
>>>
>>>  *
>>> *
>>> *Bret Wortman*
>>>
>>>  http://damascusgrp.com/
>>>  http://about.me/wortmanbret
>>>
>>>
>>> On Mon, Aug 19, 2013 at 2:02 PM, Simo Sorce <simo at redhat.com> wrote:
>>>
>>>> On Mon, 2013-08-19 at 13:51 -0400, Bret Wortman wrote:
>>>> > So, any idea how to fix the Kerberos problem?
>>>> >
>>>>
>>>>  If your server is trying to get a tgt for ldap/localhost it probably
>>>> means your /etc/hosts file is broken and has a line like this:
>>>>
>>>> 1.2.3.4 localhost my.real.name
>>>>
>>>> When GSSAPI tries to resolve my.realm.name it gets back that
>>>> 'localhost'
>>>> is the canonical name so it tries to get a TGT with that name and it
>>>> fails.
>>>>
>>>> If /etc/host sis fine then the DNS server may be returning an IP address
>>>> that later resolves to localhost again.
>>>>
>>>> To unbreak make sure that if you have your fully qualified name
>>>> in /etc/hosts that it is on its own line pointing at the right IP
>>>> address and where the FQDN name is the first in line:
>>>> eg:
>>>>
>>>> this is ok:
>>>> 1.2.3.4 server.full.name server
>>>>
>>>> this is not:
>>>> 1.2.3.4 server server.full.name
>>>>
>>>> Simo.
>>>> >
>>>> > Bret Wortman
>>>> >
>>>> >
>>>> > http://damascusgrp.com/
>>>> >
>>>> > http://about.me/wortmanbret
>>>> >
>>>> >
>>>> >
>>>> > On Mon, Aug 19, 2013 at 12:19 PM, Bret Wortman
>>>> > <bret.wortman at damascusgrp.com> wrote:
>>>> >         ...and I got the web UI, authentication and sudo back via:
>>>> >
>>>> >
>>>> >         # ipactl stop
>>>> >         # ipactl start
>>>> >
>>>> >
>>>> >         Not sure why that worked, but it did. I was grasping at
>>>> >         straws, honestly.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >         Bret Wortman
>>>> >
>>>> >
>>>>  >         http://damascusgrp.com/
>>>> >
>>>> >         http://about.me/wortmanbret
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >         On Mon, Aug 19, 2013 at 12:18 PM, Bret Wortman
>>>> >         <bret.wortman at damascusgrp.com> wrote:
>>>> >                 Digging further, I think this log entry might be the
>>>> >                 problem between the two servers that aren't talking:
>>>> >
>>>> >
>>>> >                 slapd_ldap_sasl_interactive_bind - Error: could not
>>>> >                 perform interactive bind for id[] mech [GSSAPI]: LDAP
>>>> >                 error -2 (Local error) (SASL(-1): generic failure:
>>>> >                 GSSAPI Error: Unspecified GSS failure. Minor code may
>>>> >                 provide more information (Server
>>>> >                 ldap/localhost at SPX.NET not found in Kerberos
>>>> >                 database)) errno 2 (No such file or directory)
>>>> >
>>>> >
>>>> >                 Did I build something incorrectly when that server was
>>>> >                 set up originally?
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >                 Bret Wortman
>>>> >
>>>> >
>>>>  >                 http://damascusgrp.com/
>>>> >
>>>> >                 http://about.me/wortmanbret
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >                 On Mon, Aug 19, 2013 at 12:02 PM, Bret Wortman
>>>> >                 <bret.wortman at damascusgrp.com> wrote:
>>>> >                         I ran it on a good master, against a bad one.
>>>> >                         As in, I ran this command on my master IPA
>>>> >                         node:
>>>> >
>>>> >
>>>> >                         # ipa-replica-manage del --force bad1.foo.net
>>>> >                         --cleanup
>>>> >
>>>> >
>>>> >                         Was that wrong? I was trying to delete the bad
>>>> >                         replica from the master, so I figured the
>>>> >                         command needed to be run on the master. But
>>>> >                         again, my master is now in a state where it's
>>>> >                         not resolving DNS, user logins, or sudo at the
>>>> >                         very least.
>>>> >
>>>> >
>>>> >                         Oh, and I checked the node that it was
>>>> >                         complaining about earlier. The network
>>>> >                         connection to it is the pits, but it's there.
>>>> >                         And it resolves.
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >                         Bret Wortman
>>>> >
>>>> >
>>>>  >                         http://damascusgrp.com/
>>>> >
>>>> >                         http://about.me/wortmanbret
>>>> >
>>>> >
>>>> >
>>>> >                         On Mon, Aug 19, 2013 at 11:58 AM, Rob
>>>> >                         Crittenden <rcritten at redhat.com> wrote:
>>>> >                                 Rob Crittenden wrote:
>>>> >                                         Bret Wortman wrote:
>>>> >                                                 Well, my master ground
>>>> >                                                 to a halt and wasn't
>>>> >                                                 responding. I rebooted
>>>> >                                                 the
>>>> >                                                 system and now I can't
>>>> >                                                 access the web UI or
>>>> >                                                 ssh to the master
>>>> >                                                 either. I
>>>> >                                                 have console access
>>>> >                                                 but that's it.
>>>> >
>>>> >                                                 The services all say
>>>> >                                                 they're running, but
>>>> >                                                 the web UI gives an
>>>> >                                                 "Unknown
>>>> >                                                 Error" dialog and ssh
>>>> >                                                 fails with
>>>> >
>>>> "ssh_exchange_identification:
>>>> >                                                 Connection closed by
>>>> >                                                 remote host" whenever
>>>> >                                                 I try to ssh to
>>>> >                                                 ipamaster. I
>>>> >                                                 think something has
>>>> >                                                 gone really wrong
>>>> >                                                 inside my master. Any
>>>> >                                                 ideas? Even
>>>> >                                                 after the reboot,
>>>> >                                                 --cleanup isn't
>>>> >                                                 helping and just
>>>> >                                                 hangs.
>>>> >
>>>> >                                                 The logfiles end (as
>>>> >                                                 of the time I ^C'd the
>>>> >                                                 process) with:
>>>> >
>>>> >                                                 NSMMReplicationPlugin
>>>> >                                                 -
>>>> >                                                 agmt="cn=
>>>> meTogood3.spx.net
>>>> >                                                 <
>>>> http://meTogood3.spx.net>" (good3:389): Replication bind with GSSAPI
>>>> >                                                 auth failed: LDAP
>>>> >                                                 error -2 (Local error)
>>>> >                                                 (SASL(-1): generic
>>>> >                                                 failure:
>>>> >                                                 GSSAPI Error:
>>>> >                                                 Unspecified GSS
>>>> >                                                 failure. Minor code
>>>> >                                                 may provide more
>>>> >                                                 information (Cannot
>>>> >                                                 determine realm for
>>>> >                                                 numeric host address))
>>>> >                                                 NSMMReplicationPlugin
>>>> >                                                 - CleanAllRUV Task:
>>>> >                                                 Replica not online
>>>> >                                                 (agmt="cn=
>>>> meTogood3.foo.net <http://meTogood3.foo.net>" (good3:389))
>>>> >                                                 NSMMReplicationPlugin
>>>> >                                                 - CleanAllRUV Task:
>>>> >                                                 Not all replicas
>>>> >                                                 online,
>>>> >                                                 retrying in 160
>>>> >                                                 seconds...,
>>>> >
>>>> >                                                 So it looks like it's
>>>> >                                                 having trouble talking
>>>> >                                                 with one of my
>>>> >                                                 replicas and
>>>> >                                                 is doggedly trying to
>>>> >                                                 get the job done. Any
>>>> >                                                 idea how to get the
>>>> >                                                 master
>>>> >                                                 back working again
>>>> >                                                 while I troubleshoot
>>>> >                                                 this connectivity
>>>> >                                                 issue?
>>>> >
>>>> >                                         That suggests a DNS problem,
>>>> >                                         and it might explain ssh as
>>>> >                                         well depending
>>>> >                                         on your configuration.
>>>> >
>>>> >
>>>> >                                 To be clear, you ran --cleanup against
>>>> >                                 one of the bad masters, not a good
>>>> >                                 one, right?
>>>> >
>>>> >                                 rob
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>> >
>>>>   > _______________________________________________
>>>> > Freeipa-users mailing list
>>>> > Freeipa-users at redhat.com
>>>> > https://www.redhat.com/mailman/listinfo/freeipa-users
>>>>
>>>>
>>>>  --
>>>> Simo Sorce * Red Hat, Inc * New York
>>>>
>>>>
>>>
>>>
>>
>>
>> _______________________________________________
>> Freeipa-users mailing listFreeipa-users at redhat.comhttps://www.redhat.com/mailman/listinfo/freeipa-users
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/freeipa-users/attachments/20130827/11dac56c/attachment.htm>