Year 2000 has passed, no flying cars yet but at least more faster (SSD and NVMe) disk space 😀

All sorts of resources also digital resources and bandwidth, RAM and DiskSpace space are STILL precious and should be treated as such. (one’s Android based smart phone has 12GBytes of RAM, because a lot of (Java)Apps are a bloat)

No matter on server, notebook or smart phone or on the whole planet: Resources should never be wasted and can not be wasted endlessly (thus keep it simple + minimalism + recycle + reuse (almost) every atom).

“WhatsApp Using Up Your Phone Storage? Here’s How to Fix It” (there should be no such problems in the first place)

Imagine two very large identical files, it would be UTTER uselessness to waste blocks on both.

when having duplicates make sense: backups across devices

of course it’s IMPORTANT to have duplicates of important files across multiple devices (harddisks, flash storage, optical disks, tapes in different places in double-and-tripple-metal-casing-to-shield-against-electro-magnetic-interference) to avoid dataloss.

But ON THE SAME DEVICE one copy of each file or data block would be enough, so it is a bit strange that storage waste by duplicate data is (still) an (massive?) issue daily and globally that needs manual intervention (surely Facebook, Amazon & Google have solved this problem server-side, otherwise even their datacenters would be overfilled fast X-D)

There are cool filesystem that do that automatically: ZFS but it’s only for EEC-servers.

There are cool features of ext4 such like hardlinks and softlinks which can help avoid duplicates.

But softlinks are utter useless with the deduplication task, as the connection between link and file is lost the moment the file is renamed or moved.

Cool: the hardlink program (hardlink.man.txt) can AUTOMATICALLY find duplicate files and simply replace one copy with a hardlink to the original data thus depending on how much duplicate data there is, saving massively on diskspace.

filesystems that handle duplicate data efficiently?

(on a filesystem level) :

ZFS

  • ZFS supports in-band block-based deduplication
  • It’s a filesystem/LVM hybrid with good support on Linux and FreeBSD
  • ZFS provides a lot of awesomeness: basically everything, from file data to filesystem metadata, is checksummed, so filesystem corruption can be detected immediately and even healed using RAID-Z (RAID managed solely by ZFS itself)
  • the price to pay:
    • performance won’t be as good as with traditional filesystems, ZFS is focused on reliability, not speed (unfortunately both are important X-D)
    • basic ZFS requires at least 1 GB of EEC (!!!) RAM
      • 1 GB of (EEC) RAM for each 1 TB of storage the user wishes to deduplicate
        • (information about available blocks must be stored somewhere for deduplication to be efficient)
        • It’s important that it must be ECC RAM X-D (usually only found in servers)
        • because unlike with traditional filesystems flipped bits won’t just damage data eg. if a metadata checksum is corrupted
        • it can irrecoverably damage the filesystem. ECC fixes this by ensuring RAM errors will never make it to disk. (based on src)

while this ZFS awesomeness might be cool for EEC-enabled-servers it fails all other mostly ext4 based computerz, solutions anyone?

sort of file system level

https://btrfs.readthedocs.io/en/latest/Deduplication.html

There are two main deduplication types:

  • in-band (sometimes also called on-line) — all newly written data are considered for deduplication before writing
  • out-of-band (sometimes also called offline) — data for deduplication have to be actively looked for and deduplicated by the user application

Both have their pros and cons. BTRFS implements only out-of-band type.

BTRFS provides the basic building blocks for deduplication allowing other tools to choose the strategy and scope of the deduplication. There are multiple tools that take different approaches to deduplication, offer additional features or make trade-offs. The following table lists tools that are known to be up-to-date, maintained and widely used.

Name File based Block based Incremental
BEES No Yes Yes
duperemove Yes No Yes

deduplicate data with hardlinks or softlinks?

where’s actually the difference?

hardlinks are pointers to the original data on disk. if one of multiple hardlinks is deleted, the data can still be accessed by the other hardlinks.

not so with softlinks. when the original data is deleted, the softlink becomes “broken”.

# for testing purposes
# create temp dir
mkdir temp
cd temp
# create a file with data
echo "this is a small text file that is not empty" > small.text.testfile
# verify that there is data
xxd -b small.text.testfile |less
# per default ln creates hardlinks
ln -v /path/file hardlinkname
# create an actual hardlink linking to the testfile
ln -v small.text.testfile hardlink.small.text.testfile
# verify it's a hardlink
ls -l hardlink.small.text.testfile
-rw-r--r-- 2 user user 44 2023-10-04 12:52 hardlink.small.text.testfile
# with -s option creates softlinks
ln -sv /path/file softlinkname
# create an actual softlink linking to the testfile
ln -sv small.text.testfile softlink.small.text.testfile

# verify it's a hardlink
ls -l softlink.small.text.testfile
lrwxrwxrwx 1 user user 19 2023-10-04 12:53 softlink.small.text.testfile -> small.text.testfile

# when deleting the original file X-D
rm -rf small.text.testfile

# what is still there?
ls -lah
# this hardlink still works (connection to data still there)
-rw-r--r--  1 user user 44 2023-10-04 12:52 hardlink.small.text.testfile

# verify data still accessible
xxd -b hardlink.small.text.testfile|less

# this softlink is broken (connection to data lost)
lrwxrwxrwx  1 user user 19 2023-10-04 12:53 softlink.small.text.testfile -> small.text.testfile

# verify connection to data is lost
xxd -b softlink.small.text.testfile|less

how to go about manual deduplication on ext4 based systems then?

GNU Linux -> how to find and remove duplicate files

liked this article?

  • only together we can create a truly free world
  • plz support dwaves to keep it up & running!
  • (yes the info on the internet is (mostly) free but beer is still not free (still have to work on that))
  • really really hate advertisement
  • contribute: whenever a solution was found, blog about it for others to find!
  • talk about, recommend & link to this blog and articles
  • thanks to all who contribute!
admin