Linux -> Ext4 -> OHOH

22.Jun.2015

USE YOUR QNAP ONLY TO STORE BACKUPS!

NEVER UNIQUE DATA OF WHICH NO BACKUP EXISTS! (like your holiday-pictures that you removed from your laptop harddisk to free some space… ARGH… and that you PLANNED to burn on DVD but never did.)

4.3 Ext4: delayed allocation will destroy your data

Now with ext4, the default is still to use ‘commit=5,data=ordered’ but unfortunately the safe guarantees of ext3 are not anymore valid.

To improve the performance, ext4 uses per default delayed allocation of blocks so that they only get allocated when the data is committed to disk. The rename-idiom is therefore not anymore working as many applications expect: the ‘rename()’ call changes only the metadata and may happens before the data are written to disk and as a consequence the resulting file after a crash is empty!
For ext3 this problem would only arise with ‘data=writeback’ but here on ext4 it also happens with ‘data=ordered’, because for the ext4′s author ‘data=ordered’ does not mean all data but only the data with allocated blocks…

So it works when the application overwrites a file but not when the file is enlarged with more blocks…
So on ext4 ‘data=ordered’ and ‘data=writeback’ are somehow similar when a file is enlarged, which is quite confusing and not clearly stated in the man pages.

And the delayed allocation will actually commit the data to disk only after 30-150 seconds (it is not very clear on this exact window of data loss) even when ‘commit=5′ is supposed (cf. ‘man mount’) to do it after 5 seconds.
In conclusion, ext4 with default options guarantees after a crash only the atomicity and consistency of the filesystem changes (all metadata) with a maximum loss of 5 seconds on metadata changes.

The data changes may suffer a loss of 30-150 seconds and in the majority of cases all changed files in this window will be completely wiped with zero bytes! The atomicity of file changes is not working anymore with the rename-idiom.

This dramatic situation has caused a lot of anger and ext4′s author has argued that the guarantees of safety provided by ext3 where unnecessary from a POSIX point of view and that the solution was to fix all the “broken” applications (including GNU fileutils like ‘mv’) because they should call explicitely ‘fsync()’ before each ‘rename()’… (Note: ‘fsync()’ flushes the filesystem write cache to disk and ‘flush()’ flushes the file buffers to the filesystem write cache.)

But calling ‘fsync()’ would kill the performances on ext3, and fixing 100’000 applications instead of fixing 1 filesystem is not practicable.

4.4 Linus Torvalds’s angry reaction

This leads Linus Torvalds to react in following posts:

http://lwn.net/Articles/326342/

[Mr Ts’o shows considerable arrogance saying that virtually every application on the planet is “badly written” (including GNU fileutils, meaning most frequently used OS tools such as mv).

He also seems unaware of what we might call “Hot topics in filesystem design”, such as: “POSIX is not the bible of reliability it was never supposed to be” or “Users dislike empty files”.

This dangerous combination of arrogance and ignorance is leading Mr Ts’o to quickly damage ext4 reputation and place it next to XFS in users minds, and we all know how hard it is to revert that kind of reputation.

This may leave Linux users in many years to come between a rock and a hard place when it comes to filesystem performance: use the obsolete and slow ext3, or suffer the consequences of repeated slow fsync() calls in the much-needed ext4.

> Try ext4, I think you’ll like it.

Failing that, data=writeback for single-user machines is probably your best bet.

Isn’t that the same fix? ext4 just defaults to the crappy “writeback” behavior, which is insane.

Sure, it makes things _much_ smoother, since now the actual data is no longer in the critical path for any journal writes, but anybody who thinks that’s a solution is just incompetent. We might as well go back to ext2 then.

If your data gets written out long after the metadata hit the disk, you are going to hit all kinds of bad issues if the machine ever goes down. Linus]

http://lwn.net/Articles/322823/

[Are we really saying that ext4 commits metadata changes to disk (potentially a long time) before committing the corresponding data change? That surely can’t be right.

Why on earth would you write metadata describing something which you know doesn’t exist yet – and may never exist? Especially when the existing metadata describes something that does.]

Cf. also explanation of ext4′s author at http://ostatic.com/blog/recent-bug-report-details-data-loss-in-ext4-tso-explains-cause-and-workarounds.

The author of ext4 has then developed new option to fix the problem (cf. http://lwn.net/Articles/476478/):

‘nodelalloc’ disables the delayed allocation completely
‘auto_da_alloc’ tries to detect the rename-idiom and force the block allocation a data write prior to the ‘rename()’

4.5 Solution ATTEMPT TO RECOVER A BROKEN EXT4 ON QNAP NAS (did not succeed yet)

In conclusion, ext4 should be mounted with ‘-o nodelalloc’ to make it safe against a server crash and ext3 should use ‘-o barrier=1′ (barriers are disabled by default on ext3).

mount; # latest qnap firmware does this per default (x > 4.1.2)
/dev/md0 on /share/MD0_DATA type ext4 (rw,usrjquota=aquota.user,jqfmt=vfsv0,user_xattr,data=ordered,nodelalloc,noacl)

mount;

# older qnap firmware NOT!? Firmwawre 4.0.7 @ TS 559 PRO+
/dev/md0 on /share/MD0_DATA type ext4 (rw,usrjquota=aquota.user,jqfmt=vfsv0,user_xattr,data=ordered,delalloc,noacl)

USE YOUR QNAP ONLY TO STORE BACKUPS!

NEVER UNIQUE DATA OF WHICH NO BACKUP EXISTS!

(IF YOU LOSE YOUR FILESYSTEM, THIS IS A DOWNER… BUT NOT A MAJOR ONE… HAVING TO RECREATE 2TB OF BACKUP TAKES A WHILE BUT IT’S BETTER THEN HAVING LOST 2TB OF DATA!)

nodelalloc

Disable delayed allocation.  Blocks are allocated
when the data is copied from userspace to the
page cache, either via the write(2) system call
or when an mmap'ed page which was previously
unallocated is written for the first time.

http://www.mjmwired.net/kernel/Documentation/filesystems/ext4.txt

4.6 A new serious problem of ext4

Ext4 offer in theory increased safety if you additionally use the options ‘-o journal_checksum,journal_async_commit’ but in October 2012 a serious bug was discovered that can lead to filesystem corruption with these options.

Ext4′s author called if the “Lance Armstrong bug”: when the code never fails a test, but evidence shows it’s not behaving as it should (cf. http://forums.opensuse.org/english/other-forums/news-announcements/tech-news/479881-stable-linux-kernel-hit-ext4-data-corruption-bug.html – post2499841).
The author of ext4 wrote then that these two options were experimental and dangerous and should not be used. It is hard to understand why this warning was not mentioned earlier, knowing that ext4 was used in stable Linux distro since 2009.
Despite this unfortunate stories, we should not think that ext4 is a bad filesystem. It definitely improves many things over ext3 and is suits better large partitions with a huge amount of files, the benefit is especially appreciable when you are doing a fsck.
But it is important to know exactly which dangers exist with ext4 and how to overcome them by using the right mounting options.

source & creditz: http://www.pointsoftware.ch/en/4-ext4-vs-ext3-filesystem-and-why-delayed-allocation-is-bad/

ps:

[  120.359599] EXT4-fs (md0): Mount option "noacl" will be removed by 3.5
[  120.359604] Contact linux-ext4@vger.kernel.org if you think we should keep it.

liked this article?

only together we can create a truly free world
plz support dwaves to keep it up & running!
(yes the info on the internet is (mostly) free but beer is still not free (still have to work on that))
really really hate advertisement
contribute: whenever a solution was found, blog about it for others to find!
talk about, recommend & link to this blog and articles
thanks to all who contribute!

admin