[linux-lvm] LVM snapshot merge and corrupted file
guilherme.moro at gmail.com
Mon Dec 2 15:46:36 UTC 2013
Thanks for the response.
On Mon, Dec 2, 2013 at 2:39 PM, Mike Snitzer <snitzer at redhat.com> wrote:
> On Mon, Dec 02 2013 at 6:41am -0500,
> Guilherme Moro <guilherme.moro at gmail.com> wrote:
>> I know that is a too broad question, but please be kind ;)
>> The scenario:
>> RHEL 6.2 - snapshot a disk mounted over multipath device mapper
>> Upgrade system to RHEL 6.4
>> Merge the snapshot to return the system to previous state.
>> System get unstable and rebooting cyclic (not reaching user-level, at
>> least the logs don't show it)
>> Spot a file that got more or less 1200 bytes corrupted (mostly turned to 0).
> The first rollback attempt was done in production?
No, this is a test system, and the actual procedure was tested dozen
of times without any issue (we never checksummed the files, but the
system never got in a failed state before), so this is why we think is
probably hardware related.
>> Sadly, I got called to the machine too late to recover the console
>> output of the reboot (it's a blade and no console logs was
>> configured), and could figure out if some hardware failure happened.
>> As I don't have proper logs to further investigate my questions is:
>> - There are any know issues around snapshotting in this conditions
>> (RHEL 6.2 -> RHEL 6.4, multipath)?
> Not aware of any.
This is great, the main reason for the e-mail was to confirm that no
known issue exists.
>> - There's any chance of this being a software failure (bug?) and do
>> the restore procedure warn me in the logs (/var/log/message?) about
>> any failure during the restore (even if hardware related).
>> My main suspicion for now is a hardware failure somewhere, but I was
>> kindly asked to be sure that this can't be a bug.
>> Any thoughts or pointers (docs, pieces of code, testing reports) would
>> be appreciate, so don't be shy :)
> The lvm2 testsuite has support for testing snapshot-merge; but it
> doesn't test layering snapshot ontop of multipath.
I supposed that, just confirming :)
> Without context (e.g. logs) for what happened it is really hard to say
> definitively whether or not you hit some software bug or if your problem
> was hardware failure like you suspect.
A snippet of the messages log is here http://pastebin.com/3k1y358N
But I couldn't spot anything weird, besides the fact that the logs
never go past that until some 4 hours later. (the syslog error goes
away after 2 hours, probably the right file get delivered by puppet in
the meantime, don't know how tho, but even this is not enough to get
logs further than that immediately). Anyway, didn't send the logs
before because they seem useless :)
Just on the other question, does LVM spit out any output if things
goes wrong during the restore?
We are hooking on our CI a test to snapshot -> upgrade -> restore,
with proper file checksum in place, so let's see if we can ever
reproduced it in normal operation.
More information about the linux-lvm