GNU Linux (Debian) – how to – find the largest 30 duplicate files wasting disk space – multi line sorting madness (mlsm) – how to output x blocks of text separated by delimiter – build (Bill Poser’s and BSDs) msort from src

04.Oct.2023

Bash / Terminal / Scripts, compile, Debian, General / Allgemein, storage / NAS / QNAP

This is actually VERY usefull to find files that waste disk space.

lsb_release -a; # tested on
Distributor ID:	Debian
Description:	Debian GNU/Linux 12 (bookworm)

the solution: czkawka_cli

install rust like this (no need to install rust as root)
install for default-non-root user:
- ```
cargo install czkawka_cli
```
after a bit of downloading & compiling, run it like this:
run it (non-root, because it is not installed for root only for default-non-root user)

czkawka_cli dup --directories /where/to/search/for/duplicates/ | less

czkawka_cli AUTOMATICALLY sorts by filesize (great job! all involved! :D)

in order to limit output to 30 blocks:

czkawka_cli dup --directories /where/to/search/for/duplicates/ > /scripts/find_largest_30_duplicates.sh.txt

then define the following 2 scripts

(script 1 uses script 2 so both need to exist)

vim /scripts/output_x_amount_of_textblocks.sh 

#!/bin/bash
echo "=== output $2 amount of text blocks from $1 defined by delimiter $3 ==="
TEMPFILE=$1
# TEMPFILE=/scripts/find_largest_30_duplicates.sh.txt
# fdupes -Sr $1 | grep "bytes each" -A 2 > $TEMPFILE

BLOCK_LIMIT=$2
BLOCK_COUNTER=0
DELIMITER="$3"

while read -r LINE; do
	if [ $BLOCK_COUNTER -le $BLOCK_LIMIT ]; then
		# verbose output
		# echo "currently on block: "$BLOCK_COUNTER;

    			echo "$LINE"

		if [[ "$LINE" == *"$DELIMITER"* ]]; then
			((BLOCK_COUNTER++))
		fi
	fi
done < "$TEMPFILE"


# then call it like this
/scripts/output_x_amount_of_textblocks.sh "/scripts/find_largest_30_duplicates.sh.txt" 30 "----"

all in one script:

vim /scripts/find_largest_30_duplicates.sh

#!/bin/bash
BLOCK_LIMIT=30
BLOCK_COUNTER=0
DELIMITER="----"
TEMPFILE=/scripts/find_largest_30_duplicates.sh.txt

# create temp file and make sure it's empty
echo "" > "$TEMPFILE"
# verbose: monitor changes to tempfile
# tail -f "$TEMPFILE" &

# AUTOMATICALLY sorts by filesize 😀
czkawka_cli dup --directories $1 > /scripts/find_largest_30_duplicates.sh.txt


echo "=== output $2 amount of text blocks from $1 defined by delimiter $3 ==="
# fdupes -Sr $1 | grep "bytes each" -A 2 > $TEMPFILE

while read -r LINE; do
	if [ $BLOCK_COUNTER -le $BLOCK_LIMIT ]; then
		# verbose output
		# echo "currently on block: "$BLOCK_COUNTER;

    			echo "$LINE"

		if [[ "$LINE" == *"$DELIMITER"* ]]; then
			((BLOCK_COUNTER++))
		fi
	fi
done < "$TEMPFILE"

then run it like:

chmod +x /scripts/*.sh
/scripts/find_largest_30_duplicates.sh /where/large/duplicates/consume/valuable/disk/space | less

alternative with fdupes

# define script
vim /scripts/find_largest_30_duplicates.sh

#!/bin/bash
TEMPFILE=/scripts/find_largest_30_duplicates.sh.txt
fdupes -Sr $1 | grep "bytes each" -A 2 > $TEMPFILE

BLOCK_LIMIT=30
BLOCK_COUNTER=0
DELIMITER="--"

while read -r LINE; do
	if [ $BLOCK_COUNTER -le $BLOCK_LIMIT ]; then
		# verbose output
		# echo "currently on block: "$BLOCK_COUNTER;

    			echo "$LINE"

		if [[ "$LINE" == *"$DELIMITER"* ]]; then
			((BLOCK_COUNTER++))
		fi
	fi
done < "$TEMPFILE"


# call it like this
/scripts/find_largest_30_duplicates.sh "/where/to/search/for/duplicates/"

compiling msort

WARNING! MULTIPLE versions of MSORT exist!

http://www.billposer.org/Software/msort.html

Bill Poser (billposer ÄT alum DOT mit DOT edu) (https://packages.debian.org/trixie/msort)

and the BSD msort

how to compile Bill Poser’s msort

manpage: msort.man.txt

Second, msort requires support for Unicode normalization. It can be compiled to use either libicu (International Components for Unicode), which may be obtained
from http://www.icu-project.org/, or libutf8proc, which may be obtained from http://www.flexiguided.de/publications.utf8proc.en.html.

ICU is fairly widely used, so you already have it on your system.

To use it, give the option –disable-utf8proc to configure. msort defaults to using utf8proc because utf8proc is smaller and easier to install.

of course it’s much easier to just (and also recommended)

su - root
apt update
apt install msort

So because thie utf option was missing, the compilation quest began …

lsb_release -a
Description:
Debian GNU/Linux 12 (bookworm)

# prepare requirements: tre

# recommended to compile this inside a vm
# because it will require a lot of packages
# build tre (https://github.com/laurikari/tre/)
su - root

# allow username to use sudo
usermod -a -G sudo username
apt install build-essential git wget autoconf automake gettext libtool zip autopoint libutf8proc-dev libuninum-dev;
reboot; # save all unsaved files and reboot to make user permission (sudo) changes active
# test if sudo works
sudo bash
# if yes
Ctrl+D
# become non-root user again
cd ~;
mkdir software;
git clone https://github.com/laurikari/tre.git;
cd tre;
./utils/autogen.sh;
./configure;
make;
make check;
sudo make install

# compile bill's msort from src
cd ~;
mkdir software;
cd software;
wget http://billposer.org/Software/Downloads/msort-8.53.tar.gz;
# here is a backup copy of the src and corresponding sha512sum file
tar fxvz msort-8.53.tar.gz;
cd msort-8.53;
./configure;
make;
sudo make install;

# if all went (almost) well
# this should output, which means: CELEBRATE! IT WORKS :D
msort --version
msort 8.53
lib gmp not linked
lib utf8proc
lib tre 0.8.0
lib uninum not linked
glibc 2.36
Compiled Oct 4 2023 11:52:38 on x86_64
under Linux 6.1.0-12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.52-1 (2023-09-07)

how to compile the bsd msort:

manpage: msort.bsd.man.txt


lsb_release -a
Description:	Debian GNU/Linux 12 (bookworm)

su - root;
apt update;
apt search autoreconf;
apt install dh-autoreconf;

Ctrl+D # logoff root, become non-root user

cd ~

mkdir -p software

cd software

git clone https://github.com/mayank-02/msort.git
cd msort
# actually start building
autoreconf --install
./configure
make
# install binaries
sudo make install

# the user knows if it is the BSD msort because this will fail
msort --version
msort: invalid option -- '-'
Refer README.md for more information.