GNU Linux – bash script – WebHTTrack and wget – Recursively download/backup entire Website

19.Jun.2015

cool tested GNU Linux Apps, web, webBrowser

wget way:

wget --no-check-certificate --limit-rate=5k --random-wait --recursive --no-clobber --page-requisites --html-extension --convert-links https://domain.com

# remove "--random-wait" and rate "--limit-rate" if it's the user's website and bandwidth is no problem
vim /scripts/download_website.sh; # create new file and fill with this content

#!/bin/bash
wget --no-check-certificate --limit-rate=5k --random-wait --recursive --no-clobber --page-requisites --html-extension --convert-links $1

chmod u+x /scripts/download_website.sh; # mark script executable

mkdir /offline_websites_workspace; # create new dir where to offline the website

cd /offline_websites_workspace;

/scripts/download_website.sh "https://domain.com"; # recursively download domain.com

options explained:

–mirror
- Turn on options suitable for mirroring.
- This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings.
- It is currently equivalent to -r -N -l inf –no-remove-listing.

–no-check-certificate (basically: enable https)
- Don’t check the server certificate against the available certificate authorities. Also don’t require the URL host name to match the common name presented by the certificate.
–limit-rate=5k
- Limit the download speed to 5KBytes per second.
–random-wait
- Some web sites may perform log analysis to identify retrieval programs such as Wget by looking for statistically significant similarities in the time between requests. This
- option causes the time between requests to vary between 0.5 and 1.5 * wait seconds, where wait was specified using the –wait option, in order to mask Wget’s presence from such analysis.
- A 2001 article in a publication devoted to development on a popular consumer platform provided code to perform this analysis on the fly. Its author suggested blocking at the class C address level to ensure automated retrieval programs were blocked despite changing DHCP-supplied addresses.
- The –random-wait option was inspired by this ill-advised recommendation to block many unrelated users from a web site due to the actions of one.
–recursive
- download the entire Web site.
~~–domains website.org~~
- don’t follow links outside website.org.
~~–no-parent~~
- don’t follow links outside the directory tutorials/html/.
–page-requisites
- get all the elements that compose the page (images, CSS and so on).
–html-extension
- save files with the .html extension.
–convert-links
- convert links so that they work locally, off-line (./relative/paths/)
- After the download is complete, convert the links in the document to make them suitable for local viewing.
- This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
- Each link will be changed in one of the two ways:
  - The links to files that have been downloaded by Wget will be changed to refer to the file they point to as a relative link.
~~–restrict-file-names=windows~~
- modify filenames so that they will work in Windows as well, fuck windows one does not need it.
–no-clobber
- easy resume of when if connection breaks down: don’t overwrite any existing files (used in case the download is interrupted and resumed)

creditz: http://www.linuxjournal.com/content/downloading-entire-web-site-wget

manpage: wget.man.txt

WebHTTrack

httrack is a dedicated “download the entire website tool” that is written solely for this purpose

warning: some websites have brute force attack detectionand prevention in place (anti DDoS) and might block the copy process half way

setup:

hostnamectl; # tested on
   Static hostname: DebianLaptop
  Operating System: Debian GNU/Linux 9 (stretch)
            Kernel: Linux 4.9.0-13-amd64
      Architecture: x86-64
su - root;
apt update;
apt install httrack webhttrack;
Ctrl+D # logoff root
# start local webserver:8080
webhttrack

start browser and go to localhost:8080

problem:

there seems to be no option for webHTTrack to convert absolute (https://domain.com/image/file.jpg) into relative (./image/file.jpg) paths?

this kind of sucks (wget can do that!)

https://www.linux-magazine.com/Online/Features/WebHTTrack-Website-Copier

liked this article?

only together we can create a truly free world
plz support dwaves to keep it up & running!
(yes the info on the internet is (mostly) free but beer is still not free (still have to work on that))
really really hate advertisement
contribute: whenever a solution was found, blog about it for others to find!
talk about, recommend & link to this blog and articles
thanks to all who contribute!

admin