From mita at miraclelinux.com Fri Sep 9 08:42:14 2005 From: mita at miraclelinux.com (Akinobu Mita) Date: Fri, 9 Sep 2005 17:42:14 +0900 Subject: [PATCH 0/6] jbd cleanup Message-ID: <20050909084214.GB14205@miraclelinux.com> The following 6 patches cleanup the jbd code and kill about 200 lines. First of 4 patches can apply to 2.6.13-git8 and 2.6.13-mm2. The rest of them can apply to 2.6.13-mm2. fs/jbd/checkpoint.c | 179 +++++++++++-------------------------------- fs/jbd/commit.c | 101 ++++++++++-------------- fs/jbd/journal.c | 11 +- fs/jbd/revoke.c | 158 ++++++++++++++----------------------- fs/jbd/transaction.c | 113 +++++---------------------- include/linux/jbd.h | 28 +++--- include/linux/journal-head.h | 4 7 files changed, 201 insertions(+), 393 deletions(-) From mita at miraclelinux.com Fri Sep 9 08:43:42 2005 From: mita at miraclelinux.com (Akinobu Mita) Date: Fri, 9 Sep 2005 17:43:42 +0900 Subject: [PATCH 1/6] jbd: remove duplicated debug print In-Reply-To: <20050909084214.GB14205@miraclelinux.com> References: <20050909084214.GB14205@miraclelinux.com> Message-ID: <20050909084342.GC14205@miraclelinux.com> remove duplicated debug print Signed-off-by: Akinobu Mita --- commit.c | 2 -- 1 files changed, 2 deletions(-) --- 2.6-mm/fs/jbd/commit.c.orig 2005-09-02 00:53:49.000000000 +0900 +++ 2.6-mm/fs/jbd/commit.c 2005-09-02 00:54:11.000000000 +0900 @@ -425,8 +425,6 @@ write_out_data: journal_write_revoke_records(journal, commit_transaction); - jbd_debug(3, "JBD: commit phase 2\n"); - /* * If we found any dirty or locked buffers, then we should have * looped back up to the write_out_data label. If there weren't From mita at miraclelinux.com Fri Sep 9 08:44:41 2005 From: mita at miraclelinux.com (Akinobu Mita) Date: Fri, 9 Sep 2005 17:44:41 +0900 Subject: [PATCH 2/6] jbd: use hlist for the revoke tables In-Reply-To: <20050909084214.GB14205@miraclelinux.com> References: <20050909084214.GB14205@miraclelinux.com> Message-ID: <20050909084441.GD14205@miraclelinux.com> use struct hlist_head and hlist_node for the revoke tables. Signed-off-by: Akinobu Mita --- revoke.c | 56 ++++++++++++++++++++++++++------------------------------ 1 files changed, 26 insertions(+), 30 deletions(-) diff -Nurp 2.6.13-mm1.old/fs/jbd/revoke.c 2.6.13-mm1/fs/jbd/revoke.c --- 2.6.13-mm1.old/fs/jbd/revoke.c 2005-09-04 21:46:35.000000000 +0900 +++ 2.6.13-mm1/fs/jbd/revoke.c 2005-09-04 21:50:25.000000000 +0900 @@ -79,7 +79,7 @@ static kmem_cache_t *revoke_table_cache; struct jbd_revoke_record_s { - struct list_head hash; + struct hlist_node hash; tid_t sequence; /* Used for recovery only */ unsigned long blocknr; }; @@ -92,7 +92,7 @@ struct jbd_revoke_table_s * for recovery. Must be a power of two. */ int hash_size; int hash_shift; - struct list_head *hash_table; + struct hlist_head *hash_table; }; @@ -119,7 +119,6 @@ static inline int hash(journal_t *journa static int insert_revoke_hash(journal_t *journal, unsigned long blocknr, tid_t seq) { - struct list_head *hash_list; struct jbd_revoke_record_s *record; repeat: @@ -129,9 +128,9 @@ repeat: record->sequence = seq; record->blocknr = blocknr; - hash_list = &journal->j_revoke->hash_table[hash(journal, blocknr)]; spin_lock(&journal->j_revoke_lock); - list_add(&record->hash, hash_list); + hlist_add_head(&record->hash, + &journal->j_revoke->hash_table[hash(journal, blocknr)]); spin_unlock(&journal->j_revoke_lock); return 0; @@ -148,19 +147,16 @@ oom: static struct jbd_revoke_record_s *find_revoke_record(journal_t *journal, unsigned long blocknr) { - struct list_head *hash_list; + struct hlist_node *node; struct jbd_revoke_record_s *record; - hash_list = &journal->j_revoke->hash_table[hash(journal, blocknr)]; - spin_lock(&journal->j_revoke_lock); - record = (struct jbd_revoke_record_s *) hash_list->next; - while (&(record->hash) != hash_list) { + hlist_for_each_entry(record, node, + &journal->j_revoke->hash_table[hash(journal, blocknr)], hash) { if (record->blocknr == blocknr) { spin_unlock(&journal->j_revoke_lock); return record; } - record = (struct jbd_revoke_record_s *) record->hash.next; } spin_unlock(&journal->j_revoke_lock); return NULL; @@ -219,7 +215,7 @@ int journal_init_revoke(journal_t *journ journal->j_revoke->hash_shift = shift; journal->j_revoke->hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct hlist_head), GFP_KERNEL); if (!journal->j_revoke->hash_table) { kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]); journal->j_revoke = NULL; @@ -227,7 +223,7 @@ int journal_init_revoke(journal_t *journ } for (tmp = 0; tmp < hash_size; tmp++) - INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]); + INIT_HLIST_HEAD(&journal->j_revoke->hash_table[tmp]); journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); if (!journal->j_revoke_table[1]) { @@ -246,7 +242,7 @@ int journal_init_revoke(journal_t *journ journal->j_revoke->hash_shift = shift; journal->j_revoke->hash_table = - kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL); + kmalloc(hash_size * sizeof(struct hlist_head), GFP_KERNEL); if (!journal->j_revoke->hash_table) { kfree(journal->j_revoke_table[0]->hash_table); kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]); @@ -256,7 +252,7 @@ int journal_init_revoke(journal_t *journ } for (tmp = 0; tmp < hash_size; tmp++) - INIT_LIST_HEAD(&journal->j_revoke->hash_table[tmp]); + INIT_HLIST_HEAD(&journal->j_revoke->hash_table[tmp]); spin_lock_init(&journal->j_revoke_lock); @@ -268,7 +264,7 @@ int journal_init_revoke(journal_t *journ void journal_destroy_revoke(journal_t *journal) { struct jbd_revoke_table_s *table; - struct list_head *hash_list; + struct hlist_head *hash_list; int i; table = journal->j_revoke_table[0]; @@ -277,7 +273,7 @@ void journal_destroy_revoke(journal_t *j for (i=0; ihash_size; i++) { hash_list = &table->hash_table[i]; - J_ASSERT (list_empty(hash_list)); + J_ASSERT (hlist_empty(hash_list)); } kfree(table->hash_table); @@ -290,7 +286,7 @@ void journal_destroy_revoke(journal_t *j for (i=0; ihash_size; i++) { hash_list = &table->hash_table[i]; - J_ASSERT (list_empty(hash_list)); + J_ASSERT (hlist_empty(hash_list)); } kfree(table->hash_table); @@ -445,7 +441,7 @@ int journal_cancel_revoke(handle_t *hand jbd_debug(4, "cancelled existing revoke on " "blocknr %llu\n", (unsigned long long)bh->b_blocknr); spin_lock(&journal->j_revoke_lock); - list_del(&record->hash); + hlist_del(&record->hash); spin_unlock(&journal->j_revoke_lock); kmem_cache_free(revoke_record_cache, record); did_revoke = 1; @@ -488,7 +484,7 @@ void journal_switch_revoke_table(journal journal->j_revoke = journal->j_revoke_table[0]; for (i = 0; i < journal->j_revoke->hash_size; i++) - INIT_LIST_HEAD(&journal->j_revoke->hash_table[i]); + INIT_HLIST_HEAD(&journal->j_revoke->hash_table[i]); } /* @@ -504,7 +500,6 @@ void journal_write_revoke_records(journa struct journal_head *descriptor; struct jbd_revoke_record_s *record; struct jbd_revoke_table_s *revoke; - struct list_head *hash_list; int i, offset, count; descriptor = NULL; @@ -516,16 +511,16 @@ void journal_write_revoke_records(journa journal->j_revoke_table[1] : journal->j_revoke_table[0]; for (i = 0; i < revoke->hash_size; i++) { - hash_list = &revoke->hash_table[i]; + struct hlist_head *hash_list = &revoke->hash_table[i]; - while (!list_empty(hash_list)) { - record = (struct jbd_revoke_record_s *) - hash_list->next; + while (!hlist_empty(hash_list)) { + record = hlist_entry(hash_list->first, + struct jbd_revoke_record_s, hash); write_one_revoke_record(journal, transaction, &descriptor, &offset, record); count++; - list_del(&record->hash); + hlist_del(&record->hash); kmem_cache_free(revoke_record_cache, record); } } @@ -686,7 +681,7 @@ int journal_test_revoke(journal_t *journ void journal_clear_revoke(journal_t *journal) { int i; - struct list_head *hash_list; + struct hlist_head *hash_list; struct jbd_revoke_record_s *record; struct jbd_revoke_table_s *revoke; @@ -694,9 +689,10 @@ void journal_clear_revoke(journal_t *jou for (i = 0; i < revoke->hash_size; i++) { hash_list = &revoke->hash_table[i]; - while (!list_empty(hash_list)) { - record = (struct jbd_revoke_record_s*) hash_list->next; - list_del(&record->hash); + while (!hlist_empty(hash_list)) { + record = hlist_entry(hash_list->first, + struct jbd_revoke_record_s, hash); + hlist_del(&record->hash); kmem_cache_free(revoke_record_cache, record); } } From mita at miraclelinux.com Fri Sep 9 08:46:00 2005 From: mita at miraclelinux.com (Akinobu Mita) Date: Fri, 9 Sep 2005 17:46:00 +0900 Subject: [PATCH 3/6] jbd: cleanup for initializing/destroying the revoke tables In-Reply-To: <20050909084214.GB14205@miraclelinux.com> References: <20050909084214.GB14205@miraclelinux.com> Message-ID: <20050909084600.GE14205@miraclelinux.com> use loop counter for initializing/destroying a pair of the revoke tables. Signed-off-by: Akinobu Mita --- revoke.c | 116 ++++++++++++++++++++++----------------------------------------- 1 files changed, 42 insertions(+), 74 deletions(-) diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/fs/jbd/revoke.c 2.6.13-mm1/fs/jbd/revoke.c --- 2.6.13-mm1.old/fs/jbd/revoke.c 2005-09-05 03:21:00.000000000 +0900 +++ 2.6.13-mm1/fs/jbd/revoke.c 2005-09-05 11:16:04.000000000 +0900 @@ -193,108 +193,76 @@ void journal_destroy_revoke_caches(void) int journal_init_revoke(journal_t *journal, int hash_size) { - int shift, tmp; + int shift = 0; + int tmp = hash_size; + int i; + /* Check that the hash_size is a power of two */ + J_ASSERT ((hash_size & (hash_size-1)) == 0); J_ASSERT (journal->j_revoke_table[0] == NULL); - shift = 0; - tmp = hash_size; - while((tmp >>= 1UL) != 0UL) + while ((tmp >>= 1UL) != 0UL) shift++; - journal->j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); - if (!journal->j_revoke_table[0]) - return -ENOMEM; - journal->j_revoke = journal->j_revoke_table[0]; - - /* Check that the hash_size is a power of two */ - J_ASSERT ((hash_size & (hash_size-1)) == 0); + for (i = 0; i < 2; i++) { + struct jbd_revoke_table_s *table; - journal->j_revoke->hash_size = hash_size; + table = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); + if (!table) + goto nomem; + + table->hash_size = hash_size; + table->hash_shift = shift; + table->hash_table = kmalloc(hash_size * sizeof(struct hlist_head), GFP_KERNEL); + if (!table->hash_table) { + kmem_cache_free(revoke_table_cache, table); + goto nomem; + } - journal->j_revoke->hash_shift = shift; + for (tmp = 0; tmp < hash_size; tmp++) + INIT_HLIST_HEAD(&table->hash_table[tmp]); - journal->j_revoke->hash_table = - kmalloc(hash_size * sizeof(struct hlist_head), GFP_KERNEL); - if (!journal->j_revoke->hash_table) { - kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]); - journal->j_revoke = NULL; - return -ENOMEM; - } - - for (tmp = 0; tmp < hash_size; tmp++) - INIT_HLIST_HEAD(&journal->j_revoke->hash_table[tmp]); - - journal->j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, GFP_KERNEL); - if (!journal->j_revoke_table[1]) { - kfree(journal->j_revoke_table[0]->hash_table); - kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]); - return -ENOMEM; + journal->j_revoke_table[i] = table; } - journal->j_revoke = journal->j_revoke_table[1]; + spin_lock_init(&journal->j_revoke_lock); - /* Check that the hash_size is a power of two */ - J_ASSERT ((hash_size & (hash_size-1)) == 0); - - journal->j_revoke->hash_size = hash_size; - - journal->j_revoke->hash_shift = shift; + return 0; - journal->j_revoke->hash_table = - kmalloc(hash_size * sizeof(struct hlist_head), GFP_KERNEL); - if (!journal->j_revoke->hash_table) { - kfree(journal->j_revoke_table[0]->hash_table); - kmem_cache_free(revoke_table_cache, journal->j_revoke_table[0]); - kmem_cache_free(revoke_table_cache, journal->j_revoke_table[1]); - journal->j_revoke = NULL; - return -ENOMEM; +nomem: + while (i--) { + kfree(journal->j_revoke_table[i]->hash_table); + kmem_cache_free(revoke_table_cache, journal->j_revoke_table[i]); } - for (tmp = 0; tmp < hash_size; tmp++) - INIT_HLIST_HEAD(&journal->j_revoke->hash_table[tmp]); - - spin_lock_init(&journal->j_revoke_lock); - - return 0; + return -ENOMEM; } /* Destoy a journal's revoke table. The table must already be empty! */ void journal_destroy_revoke(journal_t *journal) { - struct jbd_revoke_table_s *table; - struct hlist_head *hash_list; - int i; + int j; - table = journal->j_revoke_table[0]; - if (!table) - return; + journal->j_revoke = NULL; - for (i=0; ihash_size; i++) { - hash_list = &table->hash_table[i]; - J_ASSERT (hlist_empty(hash_list)); - } + for (j = 0; j < 2; j++) { + int i; + struct jbd_revoke_table_s *table = journal->j_revoke_table[j]; - kfree(table->hash_table); - kmem_cache_free(revoke_table_cache, table); - journal->j_revoke = NULL; + if (!table) + return; - table = journal->j_revoke_table[1]; - if (!table) - return; + for (i = 0; i < table->hash_size; i++) { + struct hlist_head *hash_list = &table->hash_table[i]; + J_ASSERT (hlist_empty(hash_list)); + } - for (i=0; ihash_size; i++) { - hash_list = &table->hash_table[i]; - J_ASSERT (hlist_empty(hash_list)); + kfree(table->hash_table); + kmem_cache_free(revoke_table_cache, table); } - - kfree(table->hash_table); - kmem_cache_free(revoke_table_cache, table); - journal->j_revoke = NULL; } - #ifdef __KERNEL__ /* From mita at miraclelinux.com Fri Sep 9 08:47:23 2005 From: mita at miraclelinux.com (Akinobu Mita) Date: Fri, 9 Sep 2005 17:47:23 +0900 Subject: [PATCH 4/6] jbd: use list_head for the list of buffers on a transaction's data In-Reply-To: <20050909084214.GB14205@miraclelinux.com> References: <20050909084214.GB14205@miraclelinux.com> Message-ID: <20050909084723.GF14205@miraclelinux.com> use struct list_head for doubly-linked list of buffers on a transaction's data, metadata or forget queue. Signed-off-by: Akinobu Mita --- fs/jbd/checkpoint.c | 12 ++-- fs/jbd/commit.c | 79 ++++++++++++++++-------------- fs/jbd/journal.c | 1 fs/jbd/transaction.c | 110 ++++++++----------------------------------- include/linux/jbd.h | 20 +++---- include/linux/journal-head.h | 2 6 files changed, 80 insertions(+), 144 deletions(-) diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/fs/jbd/checkpoint.c 2.6.13-mm1/fs/jbd/checkpoint.c --- 2.6.13-mm1.old/fs/jbd/checkpoint.c 2005-09-05 03:15:17.000000000 +0900 +++ 2.6.13-mm1/fs/jbd/checkpoint.c 2005-09-05 03:15:35.000000000 +0900 @@ -684,12 +684,12 @@ void __journal_drop_transaction(journal_ } J_ASSERT(transaction->t_state == T_FINISHED); - J_ASSERT(transaction->t_buffers == NULL); - J_ASSERT(transaction->t_sync_datalist == NULL); - J_ASSERT(transaction->t_forget == NULL); - J_ASSERT(transaction->t_iobuf_list == NULL); - J_ASSERT(transaction->t_shadow_list == NULL); - J_ASSERT(transaction->t_log_list == NULL); + J_ASSERT(list_empty(&transaction->t_metadata_list)); + J_ASSERT(list_empty(&transaction->t_syncdata_list)); + J_ASSERT(list_empty(&transaction->t_forget_list)); + J_ASSERT(list_empty(&transaction->t_io_list)); + J_ASSERT(list_empty(&transaction->t_shadow_list)); + J_ASSERT(list_empty(&transaction->t_logctl_list)); J_ASSERT(transaction->t_checkpoint_list == NULL); J_ASSERT(transaction->t_checkpoint_io_list == NULL); J_ASSERT(transaction->t_updates == 0); diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/fs/jbd/commit.c 2.6.13-mm1/fs/jbd/commit.c --- 2.6.13-mm1.old/fs/jbd/commit.c 2005-09-05 03:16:12.000000000 +0900 +++ 2.6.13-mm1/fs/jbd/commit.c 2005-09-05 03:15:35.000000000 +0900 @@ -250,8 +250,9 @@ void journal_commit_transaction(journal_ * that multiple journal_get_write_access() calls to the same * buffer are perfectly permissable. */ - while (commit_transaction->t_reserved_list) { - jh = commit_transaction->t_reserved_list; + while (!list_empty(&commit_transaction->t_reserved_list)) { + jh = list_entry(commit_transaction->t_reserved_list.next, + struct journal_head, b_list); JBUFFER_TRACE(jh, "reserved, unused: refile"); /* * A journal_get_undo_access()+journal_release_buffer() may @@ -300,14 +301,9 @@ void journal_commit_transaction(journal_ * will be tracked for a new trasaction only -bzzz */ spin_lock(&journal->j_list_lock); - if (commit_transaction->t_buffers) { - new_jh = jh = commit_transaction->t_buffers->b_tnext; - do { - J_ASSERT_JH(new_jh, new_jh->b_modified == 1 || - new_jh->b_modified == 0); - new_jh->b_modified = 0; - new_jh = new_jh->b_tnext; - } while (new_jh != jh); + list_for_each_entry(jh, &commit_transaction->t_metadata_list, b_list) { + J_ASSERT_JH(jh, jh->b_modified == 1 || jh->b_modified == 0); + jh->b_modified = 0; } spin_unlock(&journal->j_list_lock); @@ -319,7 +315,7 @@ void journal_commit_transaction(journal_ err = 0; /* * Whenever we unlock the journal and sleep, things can get added - * onto ->t_sync_datalist, so we have to keep looping back to + * onto ->t_syncdata_list, so we have to keep looping back to * write_out_data until we *know* that the list is empty. */ bufs = 0; @@ -331,11 +327,12 @@ write_out_data: cond_resched(); spin_lock(&journal->j_list_lock); - while (commit_transaction->t_sync_datalist) { + while (!list_empty(&commit_transaction->t_syncdata_list)) { struct buffer_head *bh; - jh = commit_transaction->t_sync_datalist; - commit_transaction->t_sync_datalist = jh->b_tnext; + jh = list_entry(commit_transaction->t_syncdata_list.next, + struct journal_head, b_list); + list_move_tail(&jh->b_list, &commit_transaction->t_syncdata_list); bh = jh2bh(jh); if (buffer_locked(bh)) { BUFFER_TRACE(bh, "locked"); @@ -389,10 +386,11 @@ write_out_data: /* * Wait for all previously submitted IO to complete. */ - while (commit_transaction->t_locked_list) { + while (!list_empty(&commit_transaction->t_locked_list)) { struct buffer_head *bh; - jh = commit_transaction->t_locked_list->b_tprev; + jh = list_entry(commit_transaction->t_locked_list.prev, + struct journal_head, b_list); bh = jh2bh(jh); get_bh(bh); if (buffer_locked(bh)) { @@ -431,7 +429,7 @@ write_out_data: * any then journal_clean_data_list should have wiped the list * clean by now, so check that it is in fact empty. */ - J_ASSERT (commit_transaction->t_sync_datalist == NULL); + J_ASSERT (list_empty(&commit_transaction->t_syncdata_list)); jbd_debug (3, "JBD: commit phase 3\n"); @@ -444,11 +442,12 @@ write_out_data: descriptor = NULL; bufs = 0; - while (commit_transaction->t_buffers) { + while (!list_empty(&commit_transaction->t_metadata_list)) { /* Find the next buffer to be journaled... */ - jh = commit_transaction->t_buffers; + jh = list_entry(commit_transaction->t_metadata_list.next, + struct journal_head, b_list); /* If we're in abort mode, we just un-journal the buffer and release it for background writing. */ @@ -460,7 +459,7 @@ write_out_data: * any descriptor buffers which may have been * already allocated, even if we are now * aborting. */ - if (!commit_transaction->t_buffers) + if (list_empty(&commit_transaction->t_metadata_list)) goto start_journal_io; continue; } @@ -569,7 +568,7 @@ write_out_data: let the IO rip! */ if (bufs == journal->j_wbufsize || - commit_transaction->t_buffers == NULL || + list_empty(&commit_transaction->t_metadata_list) || space_left < sizeof(journal_block_tag_t) + 16) { jbd_debug(4, "JBD: Submit %d IOs\n", bufs); @@ -601,8 +600,8 @@ start_journal_io: /* Lo and behold: we have just managed to send a transaction to the log. Before we can commit it, wait for the IO so far to complete. Control buffers being written are on the - transaction's t_log_list queue, and metadata buffers are on - the t_iobuf_list queue. + transaction's t_logctl_list queue, and metadata buffers are on + the t_io_list queue. Wait for the buffers in reverse order. That way we are less likely to be woken up until all IOs have completed, and @@ -616,10 +615,11 @@ start_journal_io: * See __journal_try_to_free_buffer. */ wait_for_iobuf: - while (commit_transaction->t_iobuf_list != NULL) { + while (!list_empty(&commit_transaction->t_io_list)) { struct buffer_head *bh; - jh = commit_transaction->t_iobuf_list->b_tprev; + jh = list_entry(commit_transaction->t_io_list.prev, + struct journal_head, b_list); bh = jh2bh(jh); if (buffer_locked(bh)) { wait_on_buffer(bh); @@ -637,7 +637,7 @@ wait_for_iobuf: journal_unfile_buffer(journal, jh); /* - * ->t_iobuf_list should contain only dummy buffer_heads + * ->t_io_list should contain only dummy buffer_heads * which were created by journal_write_metadata_buffer(). */ BUFFER_TRACE(bh, "dumping temporary bh"); @@ -648,7 +648,8 @@ wait_for_iobuf: /* We also have to unlock and free the corresponding shadowed buffer */ - jh = commit_transaction->t_shadow_list->b_tprev; + jh = list_entry(commit_transaction->t_shadow_list.prev, + struct journal_head, b_list); bh = jh2bh(jh); clear_bit(BH_JWrite, &bh->b_state); J_ASSERT_BH(bh, buffer_jbddirty(bh)); @@ -666,16 +667,17 @@ wait_for_iobuf: __brelse(bh); } - J_ASSERT (commit_transaction->t_shadow_list == NULL); + J_ASSERT (list_empty(&commit_transaction->t_shadow_list)); jbd_debug(3, "JBD: commit phase 5\n"); /* Here we wait for the revoke record and descriptor record buffers */ wait_for_ctlbuf: - while (commit_transaction->t_log_list != NULL) { + while (!list_empty(&commit_transaction->t_logctl_list)) { struct buffer_head *bh; - jh = commit_transaction->t_log_list->b_tprev; + jh = list_entry(commit_transaction->t_logctl_list.prev, + struct journal_head, b_list); bh = jh2bh(jh); if (buffer_locked(bh)) { wait_on_buffer(bh); @@ -710,12 +712,12 @@ wait_for_iobuf: jbd_debug(3, "JBD: commit phase 7\n"); - J_ASSERT(commit_transaction->t_sync_datalist == NULL); - J_ASSERT(commit_transaction->t_buffers == NULL); + J_ASSERT(list_empty(&commit_transaction->t_syncdata_list)); + J_ASSERT(list_empty(&commit_transaction->t_metadata_list)); J_ASSERT(commit_transaction->t_checkpoint_list == NULL); - J_ASSERT(commit_transaction->t_iobuf_list == NULL); - J_ASSERT(commit_transaction->t_shadow_list == NULL); - J_ASSERT(commit_transaction->t_log_list == NULL); + J_ASSERT(list_empty(&commit_transaction->t_io_list)); + J_ASSERT(list_empty(&commit_transaction->t_shadow_list)); + J_ASSERT(list_empty(&commit_transaction->t_logctl_list)); restart_loop: /* @@ -723,11 +725,12 @@ restart_loop: * to this list we have to be careful and hold the j_list_lock. */ spin_lock(&journal->j_list_lock); - while (commit_transaction->t_forget) { + while (!list_empty(&commit_transaction->t_forget_list)) { transaction_t *cp_transaction; struct buffer_head *bh; - jh = commit_transaction->t_forget; + jh = list_entry(commit_transaction->t_forget_list.next, + struct journal_head, b_list); spin_unlock(&journal->j_list_lock); bh = jh2bh(jh); jbd_lock_bh_state(bh); @@ -811,7 +814,7 @@ restart_loop: * Now recheck if some buffers did not get attached to the transaction * while the lock was dropped... */ - if (commit_transaction->t_forget) { + if (!list_empty(&commit_transaction->t_forget_list)) { spin_unlock(&journal->j_list_lock); spin_unlock(&journal->j_state_lock); goto restart_loop; diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/fs/jbd/journal.c 2.6.13-mm1/fs/jbd/journal.c --- 2.6.13-mm1.old/fs/jbd/journal.c 2005-09-05 03:15:17.000000000 +0900 +++ 2.6.13-mm1/fs/jbd/journal.c 2005-09-05 03:15:39.000000000 +0900 @@ -1761,6 +1761,7 @@ repeat: set_buffer_jbd(bh); bh->b_private = jh; jh->b_bh = bh; + INIT_LIST_HEAD(&jh->b_list); get_bh(bh); BUFFER_TRACE(bh, "added journal_head"); } diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/fs/jbd/transaction.c 2.6.13-mm1/fs/jbd/transaction.c --- 2.6.13-mm1.old/fs/jbd/transaction.c 2005-09-05 03:15:17.000000000 +0900 +++ 2.6.13-mm1/fs/jbd/transaction.c 2005-09-05 03:15:35.000000000 +0900 @@ -51,6 +51,14 @@ get_transaction(journal_t *journal, tran transaction->t_tid = journal->j_transaction_sequence++; transaction->t_expires = jiffies + journal->j_commit_interval; spin_lock_init(&transaction->t_handle_lock); + INIT_LIST_HEAD(&transaction->t_reserved_list); + INIT_LIST_HEAD(&transaction->t_locked_list); + INIT_LIST_HEAD(&transaction->t_metadata_list); + INIT_LIST_HEAD(&transaction->t_syncdata_list); + INIT_LIST_HEAD(&transaction->t_forget_list); + INIT_LIST_HEAD(&transaction->t_io_list); + INIT_LIST_HEAD(&transaction->t_shadow_list); + INIT_LIST_HEAD(&transaction->t_logctl_list); /* Set up the commit timer for the new transaction. */ journal->j_commit_timer->expires = transaction->t_expires; @@ -1414,64 +1422,12 @@ int journal_force_commit(journal_t *jour return ret; } -/* - * - * List management code snippets: various functions for manipulating the - * transaction buffer lists. - * - */ - -/* - * Append a buffer to a transaction list, given the transaction's list head - * pointer. - * - * j_list_lock is held. - * - * jbd_lock_bh_state(jh2bh(jh)) is held. - */ - -static inline void -__blist_add_buffer(struct journal_head **list, struct journal_head *jh) -{ - if (!*list) { - jh->b_tnext = jh->b_tprev = jh; - *list = jh; - } else { - /* Insert at the tail of the list to preserve order */ - struct journal_head *first = *list, *last = first->b_tprev; - jh->b_tprev = last; - jh->b_tnext = first; - last->b_tnext = first->b_tprev = jh; - } -} - -/* - * Remove a buffer from a transaction list, given the transaction's list - * head pointer. - * - * Called with j_list_lock held, and the journal may not be locked. - * - * jbd_lock_bh_state(jh2bh(jh)) is held. - */ - -static inline void -__blist_del_buffer(struct journal_head **list, struct journal_head *jh) -{ - if (*list == jh) { - *list = jh->b_tnext; - if (*list == jh) - *list = NULL; - } - jh->b_tprev->b_tnext = jh->b_tnext; - jh->b_tnext->b_tprev = jh->b_tprev; -} - /* * Remove a buffer from the appropriate transaction list. * * Note that this function can *change* the value of - * bh->b_transaction->t_sync_datalist, t_buffers, t_forget, - * t_iobuf_list, t_shadow_list, t_log_list or t_reserved_list. If the caller + * bh->b_transaction->t_syncdata_list, t_metadata_list, t_forget_list, + * t_io_list, t_shadow_list, t_logctl_list or t_reserved_list. If the caller * is holding onto a copy of one of thee pointers, it could go bad. * Generally the caller needs to re-read the pointer from the transaction_t. * @@ -1479,7 +1435,6 @@ __blist_del_buffer(struct journal_head * */ void __journal_temp_unlink_buffer(struct journal_head *jh) { - struct journal_head **list = NULL; transaction_t *transaction; struct buffer_head *bh = jh2bh(jh); @@ -1495,35 +1450,12 @@ void __journal_temp_unlink_buffer(struct switch (jh->b_jlist) { case BJ_None: return; - case BJ_SyncData: - list = &transaction->t_sync_datalist; - break; case BJ_Metadata: - transaction->t_nr_buffers--; - J_ASSERT_JH(jh, transaction->t_nr_buffers >= 0); - list = &transaction->t_buffers; - break; - case BJ_Forget: - list = &transaction->t_forget; - break; - case BJ_IO: - list = &transaction->t_iobuf_list; - break; - case BJ_Shadow: - list = &transaction->t_shadow_list; - break; - case BJ_LogCtl: - list = &transaction->t_log_list; - break; - case BJ_Reserved: - list = &transaction->t_reserved_list; - break; - case BJ_Locked: - list = &transaction->t_locked_list; + transaction->t_nr_metadata--; + J_ASSERT_JH(jh, transaction->t_nr_metadata >= 0); break; } - - __blist_del_buffer(list, jh); + list_del(&jh->b_list); jh->b_jlist = BJ_None; if (test_clear_buffer_jbddirty(bh)) mark_buffer_dirty(bh); /* Expose it to the VM */ @@ -1924,7 +1856,7 @@ int journal_invalidatepage(journal_t *jo void __journal_file_buffer(struct journal_head *jh, transaction_t *transaction, int jlist) { - struct journal_head **list = NULL; + struct list_head *list = NULL; int was_dirty = 0; struct buffer_head *bh = jh2bh(jh); @@ -1959,23 +1891,23 @@ void __journal_file_buffer(struct journa J_ASSERT_JH(jh, !jh->b_frozen_data); return; case BJ_SyncData: - list = &transaction->t_sync_datalist; + list = &transaction->t_syncdata_list; break; case BJ_Metadata: - transaction->t_nr_buffers++; - list = &transaction->t_buffers; + transaction->t_nr_metadata++; + list = &transaction->t_metadata_list; break; case BJ_Forget: - list = &transaction->t_forget; + list = &transaction->t_forget_list; break; case BJ_IO: - list = &transaction->t_iobuf_list; + list = &transaction->t_io_list; break; case BJ_Shadow: list = &transaction->t_shadow_list; break; case BJ_LogCtl: - list = &transaction->t_log_list; + list = &transaction->t_logctl_list; break; case BJ_Reserved: list = &transaction->t_reserved_list; @@ -1985,7 +1917,7 @@ void __journal_file_buffer(struct journa break; } - __blist_add_buffer(list, jh); + list_add_tail(&jh->b_list, list); jh->b_jlist = jlist; if (was_dirty) diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/include/linux/jbd.h 2.6.13-mm1/include/linux/jbd.h --- 2.6.13-mm1.old/include/linux/jbd.h 2005-09-05 03:15:24.000000000 +0900 +++ 2.6.13-mm1/include/linux/jbd.h 2005-09-05 03:15:35.000000000 +0900 @@ -459,39 +459,39 @@ struct transaction_s */ unsigned long t_log_start; - /* Number of buffers on the t_buffers list [j_list_lock] */ - int t_nr_buffers; + /* Number of buffers on the t_metadata_list [j_list_lock] */ + int t_nr_metadata; /* * Doubly-linked circular list of all buffers reserved but not yet * modified by this transaction [j_list_lock] */ - struct journal_head *t_reserved_list; + struct list_head t_reserved_list; /* * Doubly-linked circular list of all buffers under writeout during * commit [j_list_lock] */ - struct journal_head *t_locked_list; + struct list_head t_locked_list; /* * Doubly-linked circular list of all metadata buffers owned by this * transaction [j_list_lock] */ - struct journal_head *t_buffers; + struct list_head t_metadata_list; /* * Doubly-linked circular list of all data buffers still to be * flushed before this transaction can be committed [j_list_lock] */ - struct journal_head *t_sync_datalist; + struct list_head t_syncdata_list; /* * Doubly-linked circular list of all forget buffers (superseded * buffers which we can un-checkpoint once this transaction commits) * [j_list_lock] */ - struct journal_head *t_forget; + struct list_head t_forget_list; /* * Doubly-linked circular list of all buffers still to be flushed before @@ -509,20 +509,20 @@ struct transaction_s * Doubly-linked circular list of temporary buffers currently undergoing * IO in the log [j_list_lock] */ - struct journal_head *t_iobuf_list; + struct list_head t_io_list; /* * Doubly-linked circular list of metadata buffers being shadowed by log * IO. The IO buffers on the iobuf list and the shadow buffers on this * list match each other one for one at all times. [j_list_lock] */ - struct journal_head *t_shadow_list; + struct list_head t_shadow_list; /* * Doubly-linked circular list of control buffers being written to the * log. [j_list_lock] */ - struct journal_head *t_log_list; + struct list_head t_logctl_list; /* * Protects info related to handles diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/include/linux/journal-head.h 2.6.13-mm1/include/linux/journal-head.h --- 2.6.13-mm1.old/include/linux/journal-head.h 2005-09-05 03:15:24.000000000 +0900 +++ 2.6.13-mm1/include/linux/journal-head.h 2005-09-05 03:15:35.000000000 +0900 @@ -72,7 +72,7 @@ struct journal_head { * Doubly-linked list of buffers on a transaction's data, metadata or * forget queue. [t_list_lock] [jbd_lock_bh_state()] */ - struct journal_head *b_tnext, *b_tprev; + struct list_head b_list; /* * Pointer to the compound transaction against which this buffer From mita at miraclelinux.com Fri Sep 9 08:48:51 2005 From: mita at miraclelinux.com (Akinobu Mita) Date: Fri, 9 Sep 2005 17:48:51 +0900 Subject: [-mm PATCH 5/6] jbd: use list_head for the list of all transactions waiting for In-Reply-To: <20050909084214.GB14205@miraclelinux.com> References: <20050909084214.GB14205@miraclelinux.com> Message-ID: <20050909084851.GG14205@miraclelinux.com> use struct list_head for a linked circular list of all transactions waiting for checkpointing on a journal control structure. Signed-off-by: Akinobu Mita --- fs/jbd/checkpoint.c | 48 ++++++++++++++++++++---------------------------- fs/jbd/commit.c | 16 ++-------------- fs/jbd/journal.c | 9 +++++---- fs/jbd/transaction.c | 1 + include/linux/jbd.h | 4 ++-- 5 files changed, 30 insertions(+), 48 deletions(-) diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/fs/jbd/checkpoint.c 2.6.13-mm1/fs/jbd/checkpoint.c --- 2.6.13-mm1.old/fs/jbd/checkpoint.c 2005-09-04 23:31:48.000000000 +0900 +++ 2.6.13-mm1/fs/jbd/checkpoint.c 2005-09-05 00:23:28.000000000 +0900 @@ -180,8 +180,10 @@ static void __wait_cp_io(journal_t *jour this_tid = transaction->t_tid; restart: /* Didn't somebody clean up the transaction in the meanwhile */ - if (journal->j_checkpoint_transactions != transaction || - transaction->t_tid != this_tid) + if (list_empty(&journal->j_checkpoint_transactions) || + list_entry(journal->j_checkpoint_transactions.next, transaction_t, + t_cplist) != transaction || + transaction->t_tid != this_tid) return; while (!released && transaction->t_checkpoint_io_list) { jh = transaction->t_checkpoint_io_list; @@ -328,9 +330,10 @@ int log_do_checkpoint(journal_t *journal * and write it. */ spin_lock(&journal->j_list_lock); - if (!journal->j_checkpoint_transactions) + if (list_empty(&journal->j_checkpoint_transactions)) goto out; - transaction = journal->j_checkpoint_transactions; + transaction = list_entry(journal->j_checkpoint_transactions.next, + transaction_t, t_cplist); this_tid = transaction->t_tid; restart: /* @@ -338,8 +341,10 @@ restart: * done (maybe it's a new transaction, but it fell at the same * address). */ - if (journal->j_checkpoint_transactions == transaction || - transaction->t_tid == this_tid) { + if ((!list_empty(&journal->j_checkpoint_transactions) && + list_entry(journal->j_checkpoint_transactions.next, + transaction_t, t_cplist) == transaction) || + transaction->t_tid == this_tid) { int batch_count = 0; struct buffer_head *bhs[NR_BATCH]; struct journal_head *jh; @@ -410,7 +415,7 @@ out: int cleanup_journal_tail(journal_t *journal) { - transaction_t * transaction; + transaction_t * transaction = NULL; tid_t first_tid; unsigned long blocknr, freed; @@ -423,7 +428,9 @@ int cleanup_journal_tail(journal_t *jour spin_lock(&journal->j_state_lock); spin_lock(&journal->j_list_lock); - transaction = journal->j_checkpoint_transactions; + if (!list_empty(&journal->j_checkpoint_transactions)) + transaction = list_entry(journal->j_checkpoint_transactions.next, + transaction_t, t_cplist); if (transaction) { first_tid = transaction->t_tid; blocknr = transaction->t_log_start; @@ -530,18 +537,11 @@ static int journal_clean_one_cp_list(str int __journal_clean_checkpoint_list(journal_t *journal) { - transaction_t *transaction, *last_transaction, *next_transaction; + transaction_t *transaction, *next_transaction; int ret = 0, released; - transaction = journal->j_checkpoint_transactions; - if (!transaction) - goto out; - - last_transaction = transaction->t_cpprev; - next_transaction = transaction; - do { - transaction = next_transaction; - next_transaction = transaction->t_cpnext; + list_for_each_entry_safe(transaction, next_transaction, + &journal->j_checkpoint_transactions, t_cplist) { ret += journal_clean_one_cp_list(transaction-> t_checkpoint_list, &released); if (need_resched()) @@ -557,7 +557,7 @@ int __journal_clean_checkpoint_list(jour t_checkpoint_io_list, &released); if (need_resched()) goto out; - } while (transaction != last_transaction); + } out: return ret; } @@ -673,15 +673,7 @@ void __journal_insert_checkpoint(struct void __journal_drop_transaction(journal_t *journal, transaction_t *transaction) { assert_spin_locked(&journal->j_list_lock); - if (transaction->t_cpnext) { - transaction->t_cpnext->t_cpprev = transaction->t_cpprev; - transaction->t_cpprev->t_cpnext = transaction->t_cpnext; - if (journal->j_checkpoint_transactions == transaction) - journal->j_checkpoint_transactions = - transaction->t_cpnext; - if (journal->j_checkpoint_transactions == transaction) - journal->j_checkpoint_transactions = NULL; - } + list_del(&transaction->t_cplist); J_ASSERT(transaction->t_state == T_FINISHED); J_ASSERT(list_empty(&transaction->t_metadata_list)); diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/fs/jbd/commit.c 2.6.13-mm1/fs/jbd/commit.c --- 2.6.13-mm1.old/fs/jbd/commit.c 2005-09-04 23:31:48.000000000 +0900 +++ 2.6.13-mm1/fs/jbd/commit.c 2005-09-04 23:41:01.000000000 +0900 @@ -835,20 +835,8 @@ restart_loop: if (commit_transaction->t_checkpoint_list == NULL) { __journal_drop_transaction(journal, commit_transaction); } else { - if (journal->j_checkpoint_transactions == NULL) { - journal->j_checkpoint_transactions = commit_transaction; - commit_transaction->t_cpnext = commit_transaction; - commit_transaction->t_cpprev = commit_transaction; - } else { - commit_transaction->t_cpnext = - journal->j_checkpoint_transactions; - commit_transaction->t_cpprev = - commit_transaction->t_cpnext->t_cpprev; - commit_transaction->t_cpnext->t_cpprev = - commit_transaction; - commit_transaction->t_cpprev->t_cpnext = - commit_transaction; - } + list_add_tail(&commit_transaction->t_cplist, + &journal->j_checkpoint_transactions); } spin_unlock(&journal->j_list_lock); diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/fs/jbd/journal.c 2.6.13-mm1/fs/jbd/journal.c --- 2.6.13-mm1.old/fs/jbd/journal.c 2005-09-04 23:31:48.000000000 +0900 +++ 2.6.13-mm1/fs/jbd/journal.c 2005-09-04 23:33:19.000000000 +0900 @@ -653,6 +653,7 @@ static journal_t * journal_init_common ( goto fail; memset(journal, 0, sizeof(*journal)); + INIT_LIST_HEAD(&journal->j_checkpoint_transactions); init_waitqueue_head(&journal->j_wait_transaction_locked); init_waitqueue_head(&journal->j_wait_logspace); init_waitqueue_head(&journal->j_wait_done_commit); @@ -1130,7 +1131,7 @@ void journal_destroy(journal_t *journal) /* Totally anal locking here... */ spin_lock(&journal->j_list_lock); - while (journal->j_checkpoint_transactions != NULL) { + while (!list_empty(&journal->j_checkpoint_transactions)) { spin_unlock(&journal->j_list_lock); log_do_checkpoint(journal); spin_lock(&journal->j_list_lock); @@ -1138,7 +1139,7 @@ void journal_destroy(journal_t *journal) J_ASSERT(journal->j_running_transaction == NULL); J_ASSERT(journal->j_committing_transaction == NULL); - J_ASSERT(journal->j_checkpoint_transactions == NULL); + J_ASSERT(list_empty(&journal->j_checkpoint_transactions)); spin_unlock(&journal->j_list_lock); /* We can now mark the journal as empty. */ @@ -1352,7 +1353,7 @@ int journal_flush(journal_t *journal) /* ...and flush everything in the log out to disk. */ spin_lock(&journal->j_list_lock); - while (!err && journal->j_checkpoint_transactions != NULL) { + while (!err && !list_empty(&journal->j_checkpoint_transactions)) { spin_unlock(&journal->j_list_lock); err = log_do_checkpoint(journal); spin_lock(&journal->j_list_lock); @@ -1375,7 +1376,7 @@ int journal_flush(journal_t *journal) J_ASSERT(!journal->j_running_transaction); J_ASSERT(!journal->j_committing_transaction); - J_ASSERT(!journal->j_checkpoint_transactions); + J_ASSERT(list_empty(&journal->j_checkpoint_transactions)); J_ASSERT(journal->j_head == journal->j_tail); J_ASSERT(journal->j_tail_sequence == journal->j_transaction_sequence); spin_unlock(&journal->j_state_lock); diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/fs/jbd/transaction.c 2.6.13-mm1/fs/jbd/transaction.c --- 2.6.13-mm1.old/fs/jbd/transaction.c 2005-09-04 23:31:47.000000000 +0900 +++ 2.6.13-mm1/fs/jbd/transaction.c 2005-09-04 23:33:19.000000000 +0900 @@ -59,6 +59,7 @@ get_transaction(journal_t *journal, tran INIT_LIST_HEAD(&transaction->t_io_list); INIT_LIST_HEAD(&transaction->t_shadow_list); INIT_LIST_HEAD(&transaction->t_logctl_list); + INIT_LIST_HEAD(&transaction->t_cplist); /* Set up the commit timer for the new transaction. */ journal->j_commit_timer->expires = transaction->t_expires; diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/include/linux/jbd.h 2.6.13-mm1/include/linux/jbd.h --- 2.6.13-mm1.old/include/linux/jbd.h 2005-09-04 23:32:35.000000000 +0900 +++ 2.6.13-mm1/include/linux/jbd.h 2005-09-04 23:33:15.000000000 +0900 @@ -545,7 +545,7 @@ struct transaction_s * Forward and backward links for the circular list of all transactions * awaiting checkpoint. [j_list_lock] */ - transaction_t *t_cpnext, *t_cpprev; + struct list_head t_cplist; /* * When will the transaction expire (become due for commit), in jiffies? @@ -667,7 +667,7 @@ struct journal_s * ... and a linked circular list of all transactions waiting for * checkpointing. [j_list_lock] */ - transaction_t *j_checkpoint_transactions; + struct list_head j_checkpoint_transactions; /* * Wait queue for waiting for a locked transaction to start committing, From mita at miraclelinux.com Fri Sep 9 08:50:07 2005 From: mita at miraclelinux.com (Akinobu Mita) Date: Fri, 9 Sep 2005 17:50:07 +0900 Subject: [-mm PATCH 6/6] jbd: use list_head for a transaction checkpoint list In-Reply-To: <20050909084214.GB14205@miraclelinux.com> References: <20050909084214.GB14205@miraclelinux.com> Message-ID: <20050909085007.GH14205@miraclelinux.com> use struct list_head for doubly-linked list of buffers still remaining to be flushed before an old transaction can be checkpointed. Signed-off-by: Akinobu Mita --- fs/jbd/checkpoint.c | 119 +++++++------------------------------------ fs/jbd/commit.c | 4 - fs/jbd/journal.c | 1 fs/jbd/transaction.c | 2 include/linux/jbd.h | 4 - include/linux/journal-head.h | 2 6 files changed, 30 insertions(+), 102 deletions(-) diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/fs/jbd/checkpoint.c 2.6.13-mm1/fs/jbd/checkpoint.c --- 2.6.13-mm1.old/fs/jbd/checkpoint.c 2005-09-05 03:21:20.000000000 +0900 +++ 2.6.13-mm1/fs/jbd/checkpoint.c 2005-09-05 03:21:33.000000000 +0900 @@ -22,71 +22,7 @@ #include #include #include - -/* - * Unlink a buffer from a transaction checkpoint list. - * - * Called with j_list_lock held. - */ - -static void __buffer_unlink_first(struct journal_head *jh) -{ - transaction_t *transaction; - - transaction = jh->b_cp_transaction; - - jh->b_cpnext->b_cpprev = jh->b_cpprev; - jh->b_cpprev->b_cpnext = jh->b_cpnext; - if (transaction->t_checkpoint_list == jh) { - transaction->t_checkpoint_list = jh->b_cpnext; - if (transaction->t_checkpoint_list == jh) - transaction->t_checkpoint_list = NULL; - } -} - -/* - * Unlink a buffer from a transaction checkpoint(io) list. - * - * Called with j_list_lock held. - */ - -static inline void __buffer_unlink(struct journal_head *jh) -{ - transaction_t *transaction; - - transaction = jh->b_cp_transaction; - - __buffer_unlink_first(jh); - if (transaction->t_checkpoint_io_list == jh) { - transaction->t_checkpoint_io_list = jh->b_cpnext; - if (transaction->t_checkpoint_io_list == jh) - transaction->t_checkpoint_io_list = NULL; - } -} - -/* - * Move a buffer from the checkpoint list to the checkpoint io list - * - * Called with j_list_lock held - */ - -static inline void __buffer_relink_io(struct journal_head *jh) -{ - transaction_t *transaction; - - transaction = jh->b_cp_transaction; - __buffer_unlink_first(jh); - - if (!transaction->t_checkpoint_io_list) { - jh->b_cpnext = jh->b_cpprev = jh; - } else { - jh->b_cpnext = transaction->t_checkpoint_io_list; - jh->b_cpprev = transaction->t_checkpoint_io_list->b_cpprev; - jh->b_cpprev->b_cpnext = jh; - jh->b_cpnext->b_cpprev = jh; - } - transaction->t_checkpoint_io_list = jh; -} +#include /* * Try to release a checkpointed buffer from its transaction. @@ -185,8 +121,9 @@ restart: t_cplist) != transaction || transaction->t_tid != this_tid) return; - while (!released && transaction->t_checkpoint_io_list) { - jh = transaction->t_checkpoint_io_list; + while (!released && !list_empty(&transaction->t_checkpoint_io_list)) { + jh = list_entry(transaction->t_checkpoint_io_list.next, + struct journal_head, b_cplist); bh = jh2bh(jh); if (!jbd_trylock_bh_state(bh)) { jbd_sync_bh(journal, bh); @@ -288,7 +225,9 @@ static int __process_buffer(journal_t *j J_ASSERT_BH(bh, !buffer_jwrite(bh)); set_buffer_jwrite(bh); bhs[*batch_count] = bh; - __buffer_relink_io(jh); + list_del(&jh->b_cplist); + list_add(&jh->b_cplist, + &jh->b_cp_transaction->t_checkpoint_io_list); jbd_unlock_bh_state(bh); (*batch_count)++; if (*batch_count == NR_BATCH) { @@ -350,10 +289,11 @@ restart: struct journal_head *jh; int retry = 0; - while (!retry && transaction->t_checkpoint_list) { + while (!retry && !list_empty(&transaction->t_checkpoint_list)) { struct buffer_head *bh; - jh = transaction->t_checkpoint_list; + jh = list_entry(transaction->t_checkpoint_list.next, + struct journal_head, b_cplist); bh = jh2bh(jh); if (!jbd_trylock_bh_state(bh)) { jbd_sync_bh(journal, bh); @@ -488,20 +428,14 @@ int cleanup_journal_tail(journal_t *jour * Returns number of bufers reaped (for debug) */ -static int journal_clean_one_cp_list(struct journal_head *jh, int *released) +static int journal_clean_one_cp_list(struct list_head *head, int *released) { - struct journal_head *last_jh; - struct journal_head *next_jh = jh; + struct journal_head *jh, *next_jh; int ret, freed = 0; *released = 0; - if (!jh) - return 0; - last_jh = jh->b_cpprev; - do { - jh = next_jh; - next_jh = jh->b_cpnext; + list_for_each_entry_safe(jh, next_jh, head, b_cplist) { /* Use trylock because of the ranking */ if (jbd_trylock_bh_state(jh2bh(jh))) { ret = __try_to_free_cp_buf(jh); @@ -520,7 +454,7 @@ static int journal_clean_one_cp_list(str */ if (need_resched()) return freed; - } while (jh != last_jh); + } return freed; } @@ -542,7 +476,7 @@ int __journal_clean_checkpoint_list(jour list_for_each_entry_safe(transaction, next_transaction, &journal->j_checkpoint_transactions, t_cplist) { - ret += journal_clean_one_cp_list(transaction-> + ret += journal_clean_one_cp_list(&transaction-> t_checkpoint_list, &released); if (need_resched()) goto out; @@ -553,7 +487,7 @@ int __journal_clean_checkpoint_list(jour * t_checkpoint_list with removing the buffer from the list as * we can possibly see not yet submitted buffers on io_list */ - ret += journal_clean_one_cp_list(transaction-> + ret += journal_clean_one_cp_list(&transaction-> t_checkpoint_io_list, &released); if (need_resched()) goto out; @@ -596,11 +530,11 @@ int __journal_remove_checkpoint(struct j } journal = transaction->t_journal; - __buffer_unlink(jh); + list_del(&jh->b_cplist); jh->b_cp_transaction = NULL; - if (transaction->t_checkpoint_list != NULL || - transaction->t_checkpoint_io_list != NULL) + if (!list_empty(&transaction->t_checkpoint_list) || + !list_empty(&transaction->t_checkpoint_io_list)) goto out; JBUFFER_TRACE(jh, "transaction has no more buffers"); @@ -648,16 +582,7 @@ void __journal_insert_checkpoint(struct J_ASSERT_JH(jh, jh->b_cp_transaction == NULL); jh->b_cp_transaction = transaction; - - if (!transaction->t_checkpoint_list) { - jh->b_cpnext = jh->b_cpprev = jh; - } else { - jh->b_cpnext = transaction->t_checkpoint_list; - jh->b_cpprev = transaction->t_checkpoint_list->b_cpprev; - jh->b_cpprev->b_cpnext = jh; - jh->b_cpnext->b_cpprev = jh; - } - transaction->t_checkpoint_list = jh; + list_add(&jh->b_cplist, &transaction->t_checkpoint_list); } /* @@ -682,8 +607,8 @@ void __journal_drop_transaction(journal_ J_ASSERT(list_empty(&transaction->t_io_list)); J_ASSERT(list_empty(&transaction->t_shadow_list)); J_ASSERT(list_empty(&transaction->t_logctl_list)); - J_ASSERT(transaction->t_checkpoint_list == NULL); - J_ASSERT(transaction->t_checkpoint_io_list == NULL); + J_ASSERT(list_empty(&transaction->t_checkpoint_list)); + J_ASSERT(list_empty(&transaction->t_checkpoint_io_list)); J_ASSERT(transaction->t_updates == 0); J_ASSERT(journal->j_committing_transaction != transaction); J_ASSERT(journal->j_running_transaction != transaction); diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/fs/jbd/commit.c 2.6.13-mm1/fs/jbd/commit.c --- 2.6.13-mm1.old/fs/jbd/commit.c 2005-09-05 03:21:20.000000000 +0900 +++ 2.6.13-mm1/fs/jbd/commit.c 2005-09-05 03:21:33.000000000 +0900 @@ -714,7 +714,7 @@ wait_for_iobuf: J_ASSERT(list_empty(&commit_transaction->t_syncdata_list)); J_ASSERT(list_empty(&commit_transaction->t_metadata_list)); - J_ASSERT(commit_transaction->t_checkpoint_list == NULL); + J_ASSERT(list_empty(&commit_transaction->t_checkpoint_list)); J_ASSERT(list_empty(&commit_transaction->t_io_list)); J_ASSERT(list_empty(&commit_transaction->t_shadow_list)); J_ASSERT(list_empty(&commit_transaction->t_logctl_list)); @@ -832,7 +832,7 @@ restart_loop: journal->j_committing_transaction = NULL; spin_unlock(&journal->j_state_lock); - if (commit_transaction->t_checkpoint_list == NULL) { + if (list_empty(&commit_transaction->t_checkpoint_list)) { __journal_drop_transaction(journal, commit_transaction); } else { list_add_tail(&commit_transaction->t_cplist, diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/fs/jbd/journal.c 2.6.13-mm1/fs/jbd/journal.c --- 2.6.13-mm1.old/fs/jbd/journal.c 2005-09-05 03:21:20.000000000 +0900 +++ 2.6.13-mm1/fs/jbd/journal.c 2005-09-05 03:21:36.000000000 +0900 @@ -1763,6 +1763,7 @@ repeat: bh->b_private = jh; jh->b_bh = bh; INIT_LIST_HEAD(&jh->b_list); + INIT_LIST_HEAD(&jh->b_cplist); get_bh(bh); BUFFER_TRACE(bh, "added journal_head"); } diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/fs/jbd/transaction.c 2.6.13-mm1/fs/jbd/transaction.c --- 2.6.13-mm1.old/fs/jbd/transaction.c 2005-09-05 03:21:20.000000000 +0900 +++ 2.6.13-mm1/fs/jbd/transaction.c 2005-09-05 03:21:36.000000000 +0900 @@ -60,6 +60,8 @@ get_transaction(journal_t *journal, tran INIT_LIST_HEAD(&transaction->t_shadow_list); INIT_LIST_HEAD(&transaction->t_logctl_list); INIT_LIST_HEAD(&transaction->t_cplist); + INIT_LIST_HEAD(&transaction->t_checkpoint_list); + INIT_LIST_HEAD(&transaction->t_checkpoint_io_list); /* Set up the commit timer for the new transaction. */ journal->j_commit_timer->expires = transaction->t_expires; diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/include/linux/jbd.h 2.6.13-mm1/include/linux/jbd.h --- 2.6.13-mm1.old/include/linux/jbd.h 2005-09-05 03:21:20.000000000 +0900 +++ 2.6.13-mm1/include/linux/jbd.h 2005-09-05 03:21:33.000000000 +0900 @@ -497,13 +497,13 @@ struct transaction_s * Doubly-linked circular list of all buffers still to be flushed before * this transaction can be checkpointed. [j_list_lock] */ - struct journal_head *t_checkpoint_list; + struct list_head t_checkpoint_list; /* * Doubly-linked circular list of all buffers submitted for IO while * checkpointing. [j_list_lock] */ - struct journal_head *t_checkpoint_io_list; + struct list_head t_checkpoint_io_list; /* * Doubly-linked circular list of temporary buffers currently undergoing diff -X 2.6.13-mm1/Documentation/dontdiff -Nurp 2.6.13-mm1.old/include/linux/journal-head.h 2.6.13-mm1/include/linux/journal-head.h --- 2.6.13-mm1.old/include/linux/journal-head.h 2005-09-05 03:20:41.000000000 +0900 +++ 2.6.13-mm1/include/linux/journal-head.h 2005-09-05 03:21:33.000000000 +0900 @@ -86,7 +86,7 @@ struct journal_head { * before an old transaction can be checkpointed. * [j_list_lock] */ - struct journal_head *b_cpnext, *b_cpprev; + struct list_head b_cplist; }; #endif /* JOURNAL_HEAD_H_INCLUDED */ From akpm at osdl.org Fri Sep 9 09:15:22 2005 From: akpm at osdl.org (Andrew Morton) Date: Fri, 9 Sep 2005 02:15:22 -0700 Subject: [PATCH 0/6] jbd cleanup In-Reply-To: <20050909084214.GB14205@miraclelinux.com> References: <20050909084214.GB14205@miraclelinux.com> Message-ID: <20050909021522.1a271e4b.akpm@osdl.org> Akinobu Mita wrote: > > The following 6 patches cleanup the jbd code and kill about 200 lines. > Thanks, but I'm not inclined to apply them. a) Maybe 70-80% of the Linux world uses this filesystem. We need to be very cautious in making changes to it. b) A relatively large number of people are carrying quite large out-of-tree patches, some of which they're hoping to merge sometime. Admittedly more against ext3 than JBD, but there is potential here to cause those people trouble. Plus the switch to list_heads in journal_s has some impact on type safety and debuggability - I considered doing it years ago but decided not to because I found I _used_ those pointers fairly commonly in development. list_heads are a bit of a pain in gdb (kgdb and kernel core dumps), for example. From tytso at mit.edu Fri Sep 9 18:16:49 2005 From: tytso at mit.edu (Theodore Ts'o) Date: Fri, 9 Sep 2005 14:16:49 -0400 Subject: [PATCH 1/6] jbd: remove duplicated debug print In-Reply-To: <20050909084342.GC14205@miraclelinux.com> References: <20050909084214.GB14205@miraclelinux.com> <20050909084342.GC14205@miraclelinux.com> Message-ID: <20050909181649.GC24228@thunk.org> On Fri, Sep 09, 2005 at 05:43:42PM +0900, Akinobu Mita wrote: > remove duplicated debug print > - jbd_debug(3, "JBD: commit phase 2\n"); > - If you're going to do this, please renumber the rest of the "commit phase n" messages. Or the debugging messages will look very funny. - Ted From mita at miraclelinux.com Sat Sep 10 14:36:04 2005 From: mita at miraclelinux.com (Akinobu Mita) Date: Sat, 10 Sep 2005 23:36:04 +0900 Subject: [PATCH 1/6] jbd: remove duplicated debug print In-Reply-To: <20050909181649.GC24228@thunk.org> References: <20050909084214.GB14205@miraclelinux.com> <20050909084342.GC14205@miraclelinux.com> <20050909181649.GC24228@thunk.org> Message-ID: <20050910143604.GA7593@miraclelinux.com> On Fri, Sep 09, 2005 at 02:16:49PM -0400, Theodore Ts'o wrote: > On Fri, Sep 09, 2005 at 05:43:42PM +0900, Akinobu Mita wrote: > > remove duplicated debug print > > > - jbd_debug(3, "JBD: commit phase 2\n"); > > - > > If you're going to do this, please renumber the rest of the "commit > phase n" messages. Or the debugging messages will look very funny. The second duplicated "commit phase 2" only does: J_ASSERT (commit_transaction->t_sync_datalist == NULL); So I thought it might be accidentaly inserted. diff -U 9 : --- ./fs/jbd/commit.c.orig 2005-09-10 22:09:05.000000000 +0900 +++ ./fs/jbd/commit.c 2005-09-10 22:09:25.000000000 +0900 @@ -419,20 +419,18 @@ write_out_data: cond_resched_lock(&journal->j_list_lock); } spin_unlock(&journal->j_list_lock); if (err) __journal_abort_hard(journal); journal_write_revoke_records(journal, commit_transaction); - jbd_debug(3, "JBD: commit phase 2\n"); - /* * If we found any dirty or locked buffers, then we should have * looped back up to the write_out_data label. If there weren't * any then journal_clean_data_list should have wiped the list * clean by now, so check that it is in fact empty. */ J_ASSERT (commit_transaction->t_sync_datalist == NULL); jbd_debug (3, "JBD: commit phase 3\n"); From mita at miraclelinux.com Sat Sep 10 14:55:25 2005 From: mita at miraclelinux.com (Akinobu Mita) Date: Sat, 10 Sep 2005 23:55:25 +0900 Subject: [PATCH 0/6] jbd cleanup In-Reply-To: <20050909021522.1a271e4b.akpm@osdl.org> References: <20050909084214.GB14205@miraclelinux.com> <20050909021522.1a271e4b.akpm@osdl.org> Message-ID: <20050910145525.GB7593@miraclelinux.com> On Fri, Sep 09, 2005 at 02:15:22AM -0700, Andrew Morton wrote: > Akinobu Mita wrote: > > > > The following 6 patches cleanup the jbd code and kill about 200 lines. > > > > Thanks, but I'm not inclined to apply them. > > a) Maybe 70-80% of the Linux world uses this filesystem. We need to be > very cautious in making changes to it. And we need many eyeballs. (I've tried to understand how the jbd works several times. But I always failed.) > b) A relatively large number of people are carrying quite large > out-of-tree patches, some of which they're hoping to merge sometime. > Admittedly more against ext3 than JBD, but there is potential here to > cause those people trouble. > > Plus the switch to list_heads in journal_s has some impact on type safety > and debuggability - I considered doing it years ago but decided not to > because I found I _used_ those pointers fairly commonly in development. > list_heads are a bit of a pain in gdb (kgdb and kernel core dumps), for > example. About the debuggability of list_heads, how about adding the kind of the following gdb macros in .gdbinit? --- define list_entry set $ptr=$arg0 p ($arg1 *)((char *)$ptr - (size_t) &(($arg1 *)0)->$arg2) end define list_entry_s set $ptr=$arg0 p (struct $arg1 *)((char *)$ptr - (size_t) &((struct $arg1 *)0)->$arg2) end define to_journal_head list_entry_s $arg0 journal_head b_list end From akpm at osdl.org Sat Sep 10 21:58:48 2005 From: akpm at osdl.org (Andrew Morton) Date: Sat, 10 Sep 2005 14:58:48 -0700 Subject: [PATCH 0/6] jbd cleanup In-Reply-To: <20050910145525.GB7593@miraclelinux.com> References: <20050909084214.GB14205@miraclelinux.com> <20050909021522.1a271e4b.akpm@osdl.org> <20050910145525.GB7593@miraclelinux.com> Message-ID: <20050910145848.51881e61.akpm@osdl.org> Akinobu Mita wrote: > > On Fri, Sep 09, 2005 at 02:15:22AM -0700, Andrew Morton wrote: > > Akinobu Mita wrote: > > > > > > The following 6 patches cleanup the jbd code and kill about 200 lines. > > > > > > > Thanks, but I'm not inclined to apply them. > > > > a) Maybe 70-80% of the Linux world uses this filesystem. We need to be > > very cautious in making changes to it. > > And we need many eyeballs. True. And the only way to really learn code is to make changes to it. > (I've tried to understand how the jbd works several times. > But I always failed.) It's very hard to reverse engineer the high-level design concepts from the implementation. And the design concepts in JBD are really complex, which is a problem fo us. When I first had to learn the thing 4-5 years back I sat down for a solid week and wrote a 40-odd page how-it-works document for myself, just to force it into my head. It was probably about 50% accurate, but it was a useful exercise. > About the debuggability of list_heads, how about adding the kind of > the following gdb macros in .gdbinit? > > --- > > define list_entry > set $ptr=$arg0 > p ($arg1 *)((char *)$ptr - (size_t) &(($arg1 *)0)->$arg2) > end > > define list_entry_s > set $ptr=$arg0 > p (struct $arg1 *)((char *)$ptr - (size_t) &((struct $arg1 *)0)->$arg2) > end > > define to_journal_head > list_entry_s $arg0 journal_head b_list > end Here's mine ;) # list_entry list type member define list_entry set $off = (int)&(((struct $arg1 *)0)->$arg2) set $addr = (int)$arg0 set $res = $addr - $off printf "0x%x\n", (struct $arg1 *)$res end From myLC at gmx.net Sat Sep 17 15:27:18 2005 From: myLC at gmx.net (myLC at gmx.net) Date: Sat, 17 Sep 2005 17:27:18 +0200 Subject: turning off journaling on the fly? Message-ID: <432C35D6.9070402@gmx.net> Dear penguin lovers, =) I'm running Linux (2.6) on a satellite receiver with harddrive. The latter is formated in ext3. So far, everthing works fine. Now here's the small problem: The receiver is in the same room I sleep in and when it records at night time I can hear the journaling going on (heads clicking - even though the HD is set to silent mode via hdparm). This is surely due to the journaling as during playback the heads remain quiet. Is there a way to disable journaling on the fly (some option in /sys)? Or can I remount the harddisk from ext3 to ext2 on the fly - and does this work when it's being written to? And - last but not least - would the solution (if there is any) be "riskless"? Thank you very much for any help! myLC at gmx.net PS.: I'm not subscribed to the mailing list, thus I can only read direct replies. From evilninja at gmx.net Mon Sep 19 13:35:15 2005 From: evilninja at gmx.net (evilninja) Date: Mon, 19 Sep 2005 15:35:15 +0200 Subject: turning off journaling on the fly? In-Reply-To: <432C35D6.9070402@gmx.net> References: <432C35D6.9070402@gmx.net> Message-ID: <432EBE93.1060509@gmx.net> myLC at gmx.net schrieb: > Is there a way to disable journaling on the fly (some option > in /sys)? i'm not aware of such a switch, but two things come into my mind: 1) there is a mount option "commit": "commit=nrsec Sync all data and metadata every nrsec seconds. The default value is 5 seconds. Zero means default. 2) the laptop-mode module [1] both should reduce disk-activity by 1) enlarging the commit-interval and 2) by grouping write activities. you can turn off the journal with tune2fs(8). hth, Christian. [1] http://www.xs4all.nl/~bsamwel/laptop_mode/ -- BOFH excuse #78: Yes, yes, its called a design limitation From camilo at mesias.co.uk Tue Sep 20 11:47:41 2005 From: camilo at mesias.co.uk (Cam) Date: Tue, 20 Sep 2005 12:47:41 +0100 Subject: ext3 incompatability between linux 2.4/ppc and linux 2.6/x86 Message-ID: <432FF6DD.5000803@mesias.co.uk> Hi, I'm using ext3 filesystems in embedded devices (storage is on 512Mb or 1Gb CF cards). A typical development cycle would see the filesystem created on the desktop PC running linux 2.4 (eg. RedHat 9). The CF card would be installed in the hardware and linux 2.4 (eg. Montavista Pro 3.1, on PPC) would boot from the CF. Recently I tried a linux 2.6 desktop (CentOS) for the same task and found problems. Specifically the embedded device won't boot from the CF anymore. Since we use several partitions it's possible to boot from an old partition. We can then mount the new partition but attempts to write to it fail and the partition becomes RO mounted. Here are the logs associated with those operations: boot: kjournald starting. Commit interval 5 seconds EXT3-fs: mounted filesystem with ordered data mode. VFS: Mounted root (ext3 filesystem) readonly. Freeing unused kernel memory: 212k init ?attempt to access beyond end of device 03:02: rw=0, want=841835629, limit=151200 attempt to access beyond end of device 03:02: rw=0, want=841835629, limit=151200 Kernel panic: No init found. Try passing init= option to kernel. <0>Rebooting in 180 seconds.. mount/write: e2fsck 1.35 (28-Feb-2004) /dev/hda2 has gone 36663 days without being checked, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /dev/hda2: 2297/37848 files (1.9% non-contiguous), 101563/151200 blocks ... kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.19, 19 August 2002 on ide0(3,2), internal journal EXT3-fs: mounted filesystem with ordered data mode. /dev/hda2 on /file-system/root2 type ext3 (rw,noatime,errors=remount-ro) ... # rm -rf /file-system/root2/* EXT3-fs error (device ide0(3,2)): ext3_free_blocks: Freeing blocks not in datazone - block = 1752392034, count = 1 Aborting journal on device ide0(3,2). Remounting filesystem read-only ext3_reserve_inode_write: aborting transaction: Journal has aborted in __ext3_jdEXT3-fs error (device ide0(3,2)) in ext3_truncate: Journal has aborted ext3_reserve_inode_write: aborting transaction: Journal has aborted in __ext3_jdEXT3-fs error (device ide0(3,2)) in ext3_orphan_del: Journal has aborted ext3_reserve_inode_write: aborting transaction: Journal has aborted in __ext3_jdEXT3-fs error (device ide0(3,2)) in ext3_delete_inode: Journal has aborted rm: cannot unlink `/file-system/root2/bin/chroot': Read-only file system rm: cannot unlink `/file-system/root2/bin/run-parts': Read-only file system rm: cannot unlink `/file-system/root2/bin/tempfile': Read-only file system Looking at the versions, on the 2.4 desktop I have e2fsprogs-1.32-6, on embedded I have e2fsprogs-1.27-1. On the 2.6 desktop it's e2fsprogs-1.35-12. I built e2fsprogs-1.38 for the desktop and the result was the same. I used dumpe2fs on the working and non-working filesystems and found that the newer FS has different features: < Filesystem features: has_journal filetype sparse_super > Filesystem features: has_journal resize_inode filetype sparse_super After writing to a new FS on the desktop a further feature is added, < Filesystem features: has_journal resize_inode filetype sparse_super > Filesystem features: has_journal ext_attr resize_inode filetype sparse_super I'm not convinced the features are relevant though because if I mkfs with -O to restrict the features, the result is the same. I wonder if it could be an endianness issue? What should I do to investigate this further? Are there known incompatabilities with ext3 between different kernels? And are there any tricks I can use in 2.6 to make a 2.4 compatible filesystem? Thanks in advance for any help, -Cam -- camilo at mesias.co.uk <-- From adilger at clusterfs.com Tue Sep 20 13:26:18 2005 From: adilger at clusterfs.com (Andreas Dilger) Date: Tue, 20 Sep 2005 07:26:18 -0600 Subject: ext3 incompatability between linux 2.4/ppc and linux 2.6/x86 In-Reply-To: <432FF6DD.5000803@mesias.co.uk> References: <432FF6DD.5000803@mesias.co.uk> Message-ID: <20050920132618.GI12946@schatzie.adilger.int> On Sep 20, 2005 12:47 +0100, Cam wrote: > Looking at the versions, on the 2.4 desktop I have e2fsprogs-1.32-6, on > embedded I have e2fsprogs-1.27-1. On the 2.6 desktop it's e2fsprogs-1.35-12. > > I built e2fsprogs-1.38 for the desktop and the result was the same. > > I used dumpe2fs on the working and non-working filesystems and found > that the newer FS has different features: > > < Filesystem features: has_journal filetype sparse_super > > Filesystem features: has_journal resize_inode filetype sparse_super The resize_inode feature is relatively new, but _should_ be harmless for a kernel that doesn't understand it (it is just a file in the filesystem). That said, it is quite unlikely that you will ever need this for embedded systems, so you can turn it off at mke2fs time or afterward with tune2fs with "-O ^resize_inode". > After writing to a new FS on the desktop a further feature is added, > > < Filesystem features: has_journal resize_inode filetype sparse_super > > Filesystem features: has_journal ext_attr resize_inode filetype > sparse_super The ext_attr feature is probably from selinux. This can be a problem for older kernels (quite sadly, as there is a "feature" which slipped in under the radar). The problem is that selinux added support for EAs on symlinks, but this confuses older kernels into thinking that a fast symlink (stored in the inode) has an external block and is (wrongly) considered a slow symlink. The older kernel then tries to decode the EA as a symlink. I don't know if this is causing your problem though. I'm not sure if there is some way to prevent selinux from tagging all of the files in the filesystem or not (e.g. mount option or other). There is a trivial change to the ext3 code to fix this for your embedded platform - add ext3_inode_is_fast_symlink() to check for i_file_acl and change ext3_read_inode() to use this instead of just checking i_blocks). > I'm not convinced the features are relevant though because if I mkfs > with -O to restrict the features, the result is the same. I wonder if it > could be an endianness issue? Note that in newer e2fsprogs you need to use "mke2fs -O none -O {features}" to clear the default feature set. Also, it isn't clear whether this will prevent selinux from enabling the ext_attr feature. I would initially suspect an endian issue, but none of the values printed in the error messages appear to be byte-swapped values. They instead look like ASCII values (e.g. "md-2" and "bash"). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From camilo at mesias.co.uk Tue Sep 20 14:26:02 2005 From: camilo at mesias.co.uk (Cam) Date: Tue, 20 Sep 2005 15:26:02 +0100 Subject: ext3 incompatability between linux 2.4/ppc and linux 2.6/x86 In-Reply-To: <20050920132618.GI12946@schatzie.adilger.int> References: <432FF6DD.5000803@mesias.co.uk> <20050920132618.GI12946@schatzie.adilger.int> Message-ID: <43301BFA.4030701@mesias.co.uk> Andreas Thanks for the prompt and informative reply. It looks like this is a 'known fault'. > The ext_attr feature is probably from selinux. [...] > I'm not sure if there is some way to prevent selinux from tagging all > of the files in the filesystem or not (e.g. mount option or other). Strangely google for "selinux mount ext_attr disable" gave a bugzilla entry as first result: https://bugzilla.redhat.com/bugzilla/long_list.cgi?buglist=137068 > There is a trivial change to the ext3 code to fix this for your embedded > platform - add ext3_inode_is_fast_symlink() to check for i_file_acl and > change ext3_read_inode() to use this instead of just checking i_blocks). Unfortunately a change to the embedded system in the field is unattractive at the moment. > Note that in newer e2fsprogs you need to use "mke2fs -O none -O {features}" > to clear the default feature set. Also, it isn't clear whether this will > prevent selinux from enabling the ext_attr feature. It doesn't, although disabling selinux is effective. Using your mke2fs and disabling selinux is a good workaround. > none of the values printed > in the error messages appear to be byte-swapped values. They instead look > like ASCII values (e.g. "md-2" and "bash"). I see your point. I missed that but will check in future! Thanks again, -Cam -- camilo at mesias.co.uk <-- From tytso at mit.edu Tue Sep 20 21:25:55 2005 From: tytso at mit.edu (Theodore Ts'o) Date: Tue, 20 Sep 2005 17:25:55 -0400 Subject: turning off journaling on the fly? In-Reply-To: <432C35D6.9070402@gmx.net> References: <432C35D6.9070402@gmx.net> Message-ID: <20050920212555.GC6179@thunk.org> On Sat, Sep 17, 2005 at 05:27:18PM +0200, myLC at gmx.net wrote: > Dear penguin lovers, =) > > I'm running Linux (2.6) on a satellite receiver with > harddrive. The latter is formated in ext3. > So far, everthing works fine. > > Now here's the small problem: > The receiver is in the same room I sleep in and when it > records at night time I can hear the journaling going on > (heads clicking - even though the HD is set to silent mode > via hdparm). This is surely due to the journaling as during > playback the heads remain quiet. Something on your system must be causing writes to the filesystem; turning off journaling might lower the total number of writes, but it won't make this problem go away altogether. I'd check /var/log to see what might be causing log messages (the most likely cause) and see if you can disable or lower the syslog threshold so that they don't get written to disk. - Ted From eric at lammerts.org Thu Sep 22 04:53:44 2005 From: eric at lammerts.org (Eric Lammerts) Date: Thu, 22 Sep 2005 00:53:44 -0400 (EDT) Subject: repeated crashes Message-ID: Hello, I've got a problem that is not solved after an e2fsck. What happens is that the kernel (vanilla 2.6.12) does this: journal_bmap: journal block not found at offset 1036 on hda6 Aborting journal on device hda6. ext3_abort called. The filesystem is mounted with errors=panic, so the system reboots. At boot-up an e2fsck is run on /dev/hda6. Sometimes it finds errors, sometimes not. Example: e2fsck 1.35 (28-Feb-2004) data: recovering journal data contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Free blocks count wrong for group #73 (26, counted=0). Fix? yes Free blocks count wrong for group #74 (5071, counted=667). Fix? yes Free blocks count wrong for group #75 (3585, counted=2844). Fix? yes Free blocks count wrong (1503376, counted=1498205). Fix? yes data: ***** FILE SYSTEM WAS MODIFIED ***** data: 1960/1343488 files (34.2% non-contiguous), 1186650/2684855 blocks But soon after that, the same kernel message happens again. I've also tried a newer e2fsck, from the e2fsck-static 1.38-2 Debian package, but that one didn't solve the problem either. Dumpe2fs output: # dumpe2fs -h /dev/hda6 dumpe2fs 1.35 (28-Feb-2004) Filesystem volume name: data Last mounted on: Filesystem UUID: beb02481-d5a9-40b3-8d25-ff412629b14b Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal filetype needs_recovery sparse_super Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 1343488 Block count: 2684855 Reserved block count: 134242 Free blocks: 1550359 Free inodes: 1341562 First block: 0 Block size: 4096 Fragment size: 4096 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 16384 Inode blocks per group: 512 Filesystem created: Wed Jan 2 22:35:26 2002 Last mount time: Thu Sep 22 00:16:41 2005 Last write time: Thu Sep 22 00:16:41 2005 Mount count: 1 Maximum mount count: -1 Last checked: Thu Sep 22 00:16:40 2005 Check interval: 0 () Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 128 Journal inode: 8 Default directory hash: tea Directory Hash Seed: a1a3ccb8-023e-41ec-8af1-b2221c8da6b4 Journal backup: inode blocks Then when I look at the journal inode: # debugfs /dev/hda6 debugfs 1.35 (28-Feb-2004) debugfs: stat <8> Inode: 8 Type: regular Mode: 0600 Flags: 0x0 Generation: 0 User: 0 Group: 0 Size: 33554432 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 8304 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x3c33d186 -- Wed Jan 2 22:35:34 2002 atime: 0x00000000 -- Wed Dec 31 19:00:00 1969 mtime: 0x3c33d186 -- Wed Jan 2 22:35:34 2002 BLOCKS: (0-11):521-532, (IND):533, (12-1035):534-1557, (DIND):1558 TOTAL: 1038 debugfs: bmap <8> 1035 1557 debugfs: bmap <8> 1036 0 It seems a lot of blocks are not allocated! That is wrong, isn't it? Shouldn't e2fsck repair this then? Eric From adilger at clusterfs.com Thu Sep 22 05:26:54 2005 From: adilger at clusterfs.com (Andreas Dilger) Date: Wed, 21 Sep 2005 23:26:54 -0600 Subject: repeated crashes In-Reply-To: References: Message-ID: <20050922052654.GJ6289@schatzie.adilger.int> On Sep 22, 2005 00:53 -0400, Eric Lammerts wrote: > journal_bmap: journal block not found at offset 1036 on hda6 > Aborting journal on device hda6. > ext3_abort called. > > The filesystem is mounted with errors=panic, so the system reboots. At > boot-up an e2fsck is run on /dev/hda6. Sometimes it finds errors, > sometimes not. Example: > > e2fsck 1.35 (28-Feb-2004) > data: recovering journal > data contains a file system with errors, check forced. > Pass 1: Checking inodes, blocks, and sizes > Pass 2: Checking directory structure > Pass 3: Checking directory connectivity > Pass 4: Checking reference counts > Pass 5: Checking group summary information > Free blocks count wrong for group #73 (26, counted=0). > Fix? yes > Free blocks count wrong for group #74 (5071, counted=667). > Fix? yes > Free blocks count wrong for group #75 (3585, counted=2844). > Fix? yes > Free blocks count wrong (1503376, counted=1498205). > Fix? yes > data: ***** FILE SYSTEM WAS MODIFIED ***** > data: 1960/1343488 files (34.2% non-contiguous), 1186650/2684855 > blocks > > But soon after that, the same kernel message happens again. > I've also tried a newer e2fsck, from the e2fsck-static 1.38-2 Debian > package, but that one didn't solve the problem either. This sounds a LOT like your disk is going bad. Having e2fsck fix problems like this, then immediately getting errors again is something I've seen in the past and it turned out that the disk was flaky. Try running "badblocks" on the disk in non-destructive write mode and see what it finds. I'd strongly recommend a backup at this point if you don't already have it. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. From maillists at hosttuls.com Fri Sep 23 21:43:29 2005 From: maillists at hosttuls.com (Brandon Evans) Date: Fri, 23 Sep 2005 14:43:29 -0700 Subject: 17G File size limit? Message-ID: <43347701.50603@hosttuls.com> Hi everyone, This is a strange problem I have been having. I'm not sure where the problem is, so I figured I'd start here. I as having problems with Bacula stopping on 17Gig Volume sizes, so I decided to try to Just dd a 50 gig file. Sure enough, once the file hit 17 gigs dd stopped and spit out an error (pandora bacula)# dd if=/dev/zero of=bigfile bs=1M count=50000 File size limit exceeded (pandora bacula)# (pandora bacula)# ll total 20334813 -rw-r--r-- 1 root root 17247252480 Sep 23 00:44 bigfile -rw-r----- 1 root root 302323821 Sep 23 01:10 Default-0001 -rw-r----- 1 root root 156637059 Sep 18 01:08 Diff-wi0001 -rw-r----- 1 root root 46985831 Sep 6 19:38 Full-0001 -rw-r----- 1 root root 47126293 Sep 7 14:39 Full-0002 -rw-r----- 1 root root 2841621607 Sep 13 17:11 Full-wi0001 -rw-r----- 1 root root 1584252 Sep 18 01:05 Inc-0001 -rw-r----- 1 root root 97963834 Sep 14 01:05 Inc-wi0001 Filesystem Size Used Avail Use% Mounted on /dev/hda2 9.7G 5.8G 3.4G 64% / /dev/hda1 99M 20M 75M 21% /boot /dev/hda4 102G 2.2G 94G 3% /home /dev/md2 221G 90G 120G 43% /mnt/storage none 1014M 0 1014M 0% /dev/shm /dev/mapper/lvg01-coraid 812G 693G 114G 86% /mnt/coraid There are a few layers on this partation, so I figured I'd start at the top with you guys and work my way down. The partation this size limit is on looks like so... /mnt/coraid +--------+ | Ext3 | +--------+ | LVM 2 | +--------+ | Raid 5 | +--------+ So any one of these layers could be the problem. I was able to create a 100 Gig file on the /home partition, so perhaps ext3 is not the problem, but I'm really not sure. The system is CentOs 4.1 running 2.6.13.2 (also tried 2.6.12.2) Any insight would be great -- Thanks, Brandon Evans "I wouldn't recommend sex, drugs or insanity for everyone, but they've always worked for me." -Hunter S. Thompson From matts at ksu.edu Fri Sep 23 21:50:29 2005 From: matts at ksu.edu (Matt Stegman) Date: Fri, 23 Sep 2005 16:50:29 -0500 (CDT) Subject: 17G File size limit? In-Reply-To: <43347701.50603@hosttuls.com> Message-ID: What does "ulimit -a" report for your maximum allowed file size? Could you have limited yourself somehow? -- Matt Stegman On Fri, 23 Sep 2005, Brandon Evans wrote: > Hi everyone, > This is a strange problem I have been having. I'm not sure where the > problem is, so I figured I'd start here. > > I as having problems with Bacula stopping on 17Gig Volume sizes, so I > decided to try to Just dd a 50 gig file. Sure enough, once the file hit > 17 gigs dd stopped and spit out an error > > > (pandora bacula)# dd if=/dev/zero of=bigfile bs=1M count=50000 > File size limit exceeded > (pandora bacula)# > > > (pandora bacula)# ll > total 20334813 > -rw-r--r-- 1 root root 17247252480 Sep 23 00:44 bigfile > -rw-r----- 1 root root 302323821 Sep 23 01:10 Default-0001 > -rw-r----- 1 root root 156637059 Sep 18 01:08 Diff-wi0001 > -rw-r----- 1 root root 46985831 Sep 6 19:38 Full-0001 > -rw-r----- 1 root root 47126293 Sep 7 14:39 Full-0002 > -rw-r----- 1 root root 2841621607 Sep 13 17:11 Full-wi0001 > -rw-r----- 1 root root 1584252 Sep 18 01:05 Inc-0001 > -rw-r----- 1 root root 97963834 Sep 14 01:05 Inc-wi0001 > > Filesystem Size Used Avail Use% Mounted on > /dev/hda2 9.7G 5.8G 3.4G 64% / > /dev/hda1 99M 20M 75M 21% /boot > /dev/hda4 102G 2.2G 94G 3% /home > /dev/md2 221G 90G 120G 43% /mnt/storage > none 1014M 0 1014M 0% /dev/shm > /dev/mapper/lvg01-coraid > 812G 693G 114G 86% /mnt/coraid > > > > There are a few layers on this partation, so I figured I'd start at the > top with you guys and work my way down. The partation this size limit > is on looks like so... > > /mnt/coraid > +--------+ > | Ext3 | > +--------+ > | LVM 2 | > +--------+ > | Raid 5 | > +--------+ > > > So any one of these layers could be the problem. I was able to create a > 100 Gig file on the /home partition, so perhaps ext3 is not the problem, > but I'm really not sure. > > > The system is CentOs 4.1 running 2.6.13.2 (also tried 2.6.12.2) > > Any insight would be great > > From maillists at hosttuls.com Fri Sep 23 22:05:53 2005 From: maillists at hosttuls.com (Brandon Evans) Date: Fri, 23 Sep 2005 15:05:53 -0700 Subject: 17G File size limit? In-Reply-To: References: Message-ID: <43347C41.7060605@hosttuls.com> Matt Stegman wrote: > What does "ulimit -a" report for your maximum allowed file size? Could > you have limited yourself somehow? > ulimit -a shows core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited file size (blocks, -f) unlimited pending signals (-i) 16383 max locked memory (kbytes, -l) 32 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 16383 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited -- Thanks, Brandon Evans "I wouldn't recommend sex, drugs or insanity for everyone, but they've always worked for me." -Hunter S. Thompson From Richard.Wolber at boeing.com Fri Sep 23 23:11:27 2005 From: Richard.Wolber at boeing.com (EXT-Wolber, Richard) Date: Fri, 23 Sep 2005 16:11:27 -0700 Subject: Unmounted File Handle Message-ID: <8C7C41A176AC0B468BEFB2EFD9BDAB992004EB@XCH-NW-5V2.nw.nos.boeing.com> Is it practical to get a R/W file handle opened against an existing file on an unmounted ext2 filesystem? -- Chuck Wolber Electronic Flight Bag Crew Information Systems/ Linux Wonk 253.576.1154 "You can't connect the dots looking forward; you can only connect them looking backwards." --Steve Jobs From matts at ksu.edu Sat Sep 24 17:23:09 2005 From: matts at ksu.edu (Matt Stegman) Date: Sat, 24 Sep 2005 12:23:09 -0500 (CDT) Subject: 17G File size limit? In-Reply-To: <43347C41.7060605@hosttuls.com> Message-ID: Hmm, OK. The only place I've ever seen that error was when ulimited. Have you looked through the system logs for error messages? Does "dmesg" report anything that might be related? I notice that you've got three volumes with this much free space available. Do you get the same results on all three volumes? Are they all ext3 filesystems? -- Matt Stegman On Fri, 23 Sep 2005, Brandon Evans wrote: > Matt Stegman wrote: > > What does "ulimit -a" report for your maximum allowed file size? Could > > you have limited yourself somehow? > > > > ulimit -a shows > > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > file size (blocks, -f) unlimited From tytso at mit.edu Sat Sep 24 19:52:16 2005 From: tytso at mit.edu (Theodore Ts'o) Date: Sat, 24 Sep 2005 15:52:16 -0400 Subject: 17G File size limit? In-Reply-To: <43347701.50603@hosttuls.com> References: <43347701.50603@hosttuls.com> Message-ID: <20050924195216.GA6443@thunk.org> On Fri, Sep 23, 2005 at 02:43:29PM -0700, Brandon Evans wrote: > Hi everyone, > This is a strange problem I have been having. I'm not sure where the > problem is, so I figured I'd start here. > > I as having problems with Bacula stopping on 17Gig Volume sizes, so I > decided to try to Just dd a 50 gig file. Sure enough, once the file hit > 17 gigs dd stopped and spit out an error > > (pandora bacula)# dd if=/dev/zero of=bigfile bs=1M count=50000 > File size limit exceeded > (pandora bacula)# > > (pandora bacula)# ll > total 20334813 > -rw-r--r-- 1 root root 17247252480 Sep 23 00:44 bigfile If you are using a 1k filesystem, then a file can consist of ten direct blocks, plus 256 data blocks addressed via the indirect block, plus 256*256 data blocks addressed from the indirect block, plus 256*256*256 data blocks from the triple-indirect block: (10 + 256 + 256*256 + 256*256*256) * 1024 = 17247252480 Does that number look familiar? So the problem is that you created the file system using a 1k blocksize. Filesystems with a 1k blocksize are horribly inefficient for large files, and they max out at a little over a little over 16 gigabytes. (Note that 16 gigs is 17179869184 bytes, unless you are a disk drive company in which case your marketing department calls it 17 gigs. :-) - Ted > -rw-r----- 1 root root 302323821 Sep 23 01:10 Default-0001 > -rw-r----- 1 root root 156637059 Sep 18 01:08 Diff-wi0001 > -rw-r----- 1 root root 46985831 Sep 6 19:38 Full-0001 > -rw-r----- 1 root root 47126293 Sep 7 14:39 Full-0002 > -rw-r----- 1 root root 2841621607 Sep 13 17:11 Full-wi0001 > -rw-r----- 1 root root 1584252 Sep 18 01:05 Inc-0001 > -rw-r----- 1 root root 97963834 Sep 14 01:05 Inc-wi0001 > > Filesystem Size Used Avail Use% Mounted on > /dev/hda2 9.7G 5.8G 3.4G 64% / > /dev/hda1 99M 20M 75M 21% /boot > /dev/hda4 102G 2.2G 94G 3% /home > /dev/md2 221G 90G 120G 43% /mnt/storage > none 1014M 0 1014M 0% /dev/shm > /dev/mapper/lvg01-coraid > 812G 693G 114G 86% /mnt/coraid > > > > There are a few layers on this partation, so I figured I'd start at the > top with you guys and work my way down. The partation this size limit > is on looks like so... > > /mnt/coraid > +--------+ > | Ext3 | > +--------+ > | LVM 2 | > +--------+ > | Raid 5 | > +--------+ > > > So any one of these layers could be the problem. I was able to create a > 100 Gig file on the /home partition, so perhaps ext3 is not the problem, > but I'm really not sure. > > > The system is CentOs 4.1 running 2.6.13.2 (also tried 2.6.12.2) > > Any insight would be great > > -- > > Thanks, > Brandon Evans > > "I wouldn't recommend sex, drugs or insanity for everyone, but they've > always worked for me." > -Hunter S. Thompson > > _______________________________________________ > Ext3-users mailing list > Ext3-users at redhat.com > https://www.redhat.com/mailman/listinfo/ext3-users From tytso at mit.edu Sun Sep 25 02:19:59 2005 From: tytso at mit.edu (Theodore Ts'o) Date: Sat, 24 Sep 2005 22:19:59 -0400 Subject: Unmounted File Handle In-Reply-To: <8C7C41A176AC0B468BEFB2EFD9BDAB992004EB@XCH-NW-5V2.nw.nos.boeing.com> References: <8C7C41A176AC0B468BEFB2EFD9BDAB992004EB@XCH-NW-5V2.nw.nos.boeing.com> Message-ID: <20050925021959.GA19847@thunk.org> On Fri, Sep 23, 2005 at 04:11:27PM -0700, EXT-Wolber, Richard wrote: > Is it practical to get a R/W file handle opened against an existing file > on an unmounted ext2 filesystem? What do you mean by a "read/write file handle"? Do you mean opening a file descriptor using the open(2) system call? Do you mean opening a stdio stream handle using the fopen(3) library call? In either case, no, you can can only open() or fopen() a file on a mounted filesystem, and it doesn't matter which filesystem you are using. There are a set of interfaces as part of the ext2fs library which would allow you to manipulate a file on an unmounted filesystem. - Ted