[Linux-cluster] Problem with GFS2 - Kernel Panic - Can NOT erase directory
Theophanis Kontogiannis
theophanis_kontogiannis at yahoo.gr
Wed Jul 2 10:20:43 UTC 2008
Hello again,
Becoming queries why only once service fails, I tried to encircle the root
cause.
I ended up that files in only one directory (were the failing service keeps
its files), are corrupted.
Trying to ls -l in the directory gives the following output:
ls: reading directory .: Input/output error
total 192
?--------- ? ? ? ? ?
account_boinc.bakerlab.org_rosetta.xml
?--------- ? ? ? ? ?
account_climateprediction.net.xml
?--------- ? ? ? ? ?
account_predictor.chem.lsa.umich.edu.xml
?--------- ? ? ? ? ? all_projects_list.xml
-rw-r--r-- 1 boinc boinc 159796 Jun 22 22:47 client_state_prev.xml
?--------- ? ? ? ? ? client_state.xml
-rw-r--r-- 1 boinc boinc 5141 Jun 13 23:21 get_current_version.xml
?--------- ? ? ? ? ? get_project_config.xml
-rw-r--r-- 1 boinc boinc 899 Apr 4 17:06 global_prefs.xml
?--------- ? ? ? ? ? gui_rpc_auth.cfg
?--------- ? ? ? ? ?
job_log_boinc.bakerlab.org_rosetta.txt
?--------- ? ? ? ? ?
job_log_predictor.chem.lsa.umich.edu.txt
?--------- ? ? ? ? ? lockfile
?--------- ? ? ? ? ? lookup_account.xml
?--------- ? ? ? ? ? lookup_website.html
?--------- ? ? ? ? ?
master_boinc.bakerlab.org_rosetta.xml
?--------- ? ? ? ? ?
master_climateprediction.net.xml
?--------- ? ? ? ? ?
master_predictor.chem.lsa.umich.edu.xml
?--------- ? ? ? ? ? projects
?--------- ? ? ? ? ?
sched_reply_boinc.bakerlab.org_rosetta.xml
?--------- ? ? ? ? ?
sched_reply_climateprediction.net.xml
?--------- ? ? ? ? ?
sched_reply_predictor.chem.lsa.umich.edu.xml
?--------- ? ? ? ? ?
sched_request_boinc.bakerlab.org_rosetta.xml
-rw-r--r-- 1 boinc boinc 6766 Jun 22 21:27
sched_request_climateprediction.net.xml
?--------- ? ? ? ? ?
sched_request_predictor.chem.lsa.umich.edu.xml
?--------- ? ? ? ? ? slots
?--------- ? ? ? ? ?
statistics_boinc.bakerlab.org_rosetta.xml
?--------- ? ? ? ? ?
statistics_climateprediction.net.xml
?--------- ? ? ? ? ?
statistics_predictor.chem.lsa.umich.edu.xml
?--------- ? ? ? ? ? stderrdae.txt
?--------- ? ? ? ? ? stdoutdae.txt
?--------- ? ? ? ? ? time_stats_log
At the same moment the kernel reports what is following below (attached the
previous e-mail).
Trying to rm -rf the directory fails with the same kernel message.
Any ideas on how to erase the problematic directory?
Also the other node (the one on which I do not try to make any actions on
the file system in question, gives the following message:
GFS2: fsid=tweety:gfs2-00.0: jid=1: Trying to acquire journal lock...
GFS2: fsid=tweety:gfs2-00.0: jid=1: Busy
And the file system becomes inaccessible forever. Any one knows why is that?
Thank you all for your time
T. Kontogiannis
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Theophanis
Kontogiannis
Sent: Monday, June 30, 2008 5:52 PM
To: 'linux clustering'
Subject: [Linux-cluster] Problem with GFS2 - Kernel Panic
Hello all,
I have a two node cluster with DRBD running in Primary/Primary.
Both nodes are running:
· Kernel 2.6.18-92.1.6.el5.centos.plus
· GFS2 fsck 0.1.44
· cman_tool 2.0.84
· Cluster LVM daemon version: 2.02.32-RHEL5 (2008-03-04)
Protocol version: 0.2.1
· DRBD Version: 8.2.6 (api:88)
After a corruption (which was the result of combining updating and rebooting
with the FS mounted, network interruption during the reboot and like issues,
I keep on getting the following on one node:
Jun 30 00:13:40 tweety1 clurgmgrd[5283]: <notice> stop on script "BOINC"
returned 1 (generic error)
Jun 30 00:13:40 tweety1 clurgmgrd[5283]: <info> Services Initialized
Jun 30 00:13:40 tweety1 clurgmgrd[5283]: <info> State change: Local UP
Jun 30 00:13:45 tweety1 clurgmgrd[5283]: <notice> Starting stopped service
service:BOINC-t1
Jun 30 00:13:45 tweety1 kernel: GFS2: fsid=tweety:gfs2-00.0: fatal: invalid
metadata block
Jun 30 00:13:45 tweety1 kernel: GFS2: fsid=tweety:gfs2-00.0: bh = 21879736
(magic number)
Jun 30 00:13:45 tweety1 kernel: GFS2: fsid=tweety:gfs2-00.0: function =
gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line = 332
Jun 30 00:13:45 tweety1 kernel: GFS2: fsid=tweety:gfs2-00.0: about to
withdraw this file system
Jun 30 00:13:45 tweety1 kernel: GFS2: fsid=tweety:gfs2-00.0: telling LM to
withdraw
Jun 30 00:13:46 tweety1 clurgmgrd[5283]: <notice> Service service:BOINC-t1
started
Jun 30 00:13:46 tweety1 kernel: GFS2: fsid=tweety:gfs2-00.0: withdrawn
Jun 30 00:13:46 tweety1 kernel:
Jun 30 00:13:46 tweety1 kernel: Call Trace:
Jun 30 00:13:46 tweety1 kernel: [<ffffffff88629146>]
:gfs2:gfs2_lm_withdraw+0xc1/0xd0
Jun 30 00:13:46 tweety1 kernel: [<ffffffff800639de>]
__wait_on_bit+0x60/0x6e
Jun 30 00:13:46 tweety1 kernel: [<ffffffff80014eec>] sync_buffer+0x0/0x3f
Jun 30 00:13:46 tweety1 kernel: [<ffffffff80063a58>]
out_of_line_wait_on_bit+0x6c/0x78
Jun 30 00:13:46 tweety1 kernel: [<ffffffff8009d1bb>]
wake_bit_function+0x0/0x23
Jun 30 00:13:46 tweety1 kernel: [<ffffffff8863af7f>]
:gfs2:gfs2_meta_check_ii+0x2c/0x38
Jun 30 00:13:46 tweety1 kernel: [<ffffffff8862ca06>]
:gfs2:gfs2_meta_indirect_buffer+0x104/0x15e
Jun 30 00:13:46 tweety1 kernel: [<ffffffff8862795a>]
:gfs2:gfs2_inode_refresh+0x22/0x2ca
Jun 30 00:13:46 tweety1 kernel: [<ffffffff8009d1bb>]
wake_bit_function+0x0/0x23
Jun 30 00:13:46 tweety1 kernel: [<ffffffff88626d9c>]
:gfs2:inode_go_lock+0x29/0x57
Jun 30 00:13:47 tweety1 kernel: [<ffffffff88625f04>]
:gfs2:glock_wait_internal+0x1d4/0x23f
Jun 30 00:13:47 tweety1 kernel: [<ffffffff8862611d>]
:gfs2:gfs2_glock_nq+0x1ae/0x1d4
Jun 30 00:13:47 tweety1 kernel: [<ffffffff88632053>]
:gfs2:gfs2_lookup+0x58/0xa7
Jun 30 00:13:47 tweety1 kernel: [<ffffffff8863204b>]
:gfs2:gfs2_lookup+0x50/0xa7
Jun 30 00:13:47 tweety1 kernel: [<ffffffff80022663>] d_alloc+0x174/0x1a9
Jun 30 00:13:47 tweety1 kernel: [<ffffffff8000cbb4>] do_lookup+0xd3/0x1d4
Jun 30 00:13:47 tweety1 kernel: [<ffffffff80009f73>]
__link_path_walk+0xa01/0xf42
Jun 30 00:13:47 tweety1 kernel: [<ffffffff8861fd37>]
:gfs2:compare_dents+0x0/0x57
Jun 30 00:13:47 tweety1 kernel: [<ffffffff8000e782>]
link_path_walk+0x5c/0xe5
Jun 30 00:13:47 tweety1 kernel: [<ffffffff88624d6f>]
:gfs2:gfs2_glock_put+0x26/0x133
After that, the machine freezes completely. The only way to recover is to
power-cycle / reset.
"gfs2-fsck -vy /dev/mapper/vg0-data0" ends (not terminates, it just look
like it finishes) with:
Pass5 complete
Writing changes to disk
gfs2_fsck: buffer still held for block: 21875415 (0x14dcad7)
After remounting the file system and having a service start (that has its
files on this gfs2 filesystem), the kernel again crasses with the same
message and the node freezes up.
Unfortunately due to bad handling, I failed to DRBD invalidate the
problematic node, and instead of making it sync target (which theoretically
would solve the problem, since the good node, would sync the bad node).
Instead I made the bad node, sync source and now both nodes have the same
issue L
Any ideas of how can I resolve this issue?
Sincerely,
Theophanis Kontogiannis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20080702/8c3b1f80/attachment.htm>
More information about the Linux-cluster
mailing list