wget way:

wget --no-check-certificate --limit-rate=5k --random-wait --recursive --no-clobber --page-requisites --html-extension --convert-links https://domain.com

# remove "--random-wait" and rate "--limit-rate" if it's the user's website and bandwidth is no problem
vim /scripts/download_website.sh; # create new file and fill with this content

#!/bin/bash
wget --no-check-certificate --limit-rate=5k --random-wait --recursive --no-clobber --page-requisites --html-extension --convert-links $1

chmod u+x /scripts/download_website.sh; # mark script executable

mkdir /offline_websites_workspace; # create new dir where to offline the website

cd /offline_websites_workspace;

/scripts/download_website.sh "https://domain.com"; # recursively download domain.com

options explained:

  • –mirror
    • Turn on options suitable for mirroring.
    • This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP directory listings.
    • It is currently equivalent to -r -N -l inf –no-remove-listing.
  • –no-check-certificate (basically: enable https)
    • Don’t check the server certificate against the available certificate authorities. Also don’t require the URL host name to match the common name presented by the certificate.
  • –limit-rate=5k
    • Limit the download speed to 5KBytes per second.
  • –random-wait
    • Some web sites may perform log analysis to identify retrieval programs such as Wget by looking for statistically significant similarities in the time between requests. This
    • option causes the time between requests to vary between 0.5 and 1.5 * wait seconds, where wait was specified using the –wait option, in order to mask Wget’s presence from such analysis.
    • A 2001 article in a publication devoted to development on a popular consumer platform provided code to perform this analysis on the fly. Its author suggested blocking at the class C address level to ensure automated retrieval programs were blocked despite changing DHCP-supplied addresses.
    • The –random-wait option was inspired by this ill-advised recommendation to block many unrelated users from a web site due to the actions of one.
  • –recursive
    • download the entire Web site.
  • –domains website.org
    • don’t follow links outside website.org.
  • –no-parent
    • don’t follow links outside the directory tutorials/html/.
  • –page-requisites
    • get all the elements that compose the page (images, CSS and so on).
  • –html-extension
    • save files with the .html extension.
  • –convert-links
    • convert links so that they work locally, off-line (./relative/paths/)
    • After the download is complete, convert the links in the document to make them suitable for local viewing.
    • This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
    • Each link will be changed in one of the two ways:
      • The links to files that have been downloaded by Wget will be changed to refer to the file they point to as a relative link.
  • –restrict-file-names=windows
    • modify filenames so that they will work in Windows as well, fuck windows one does not need it.
  • –no-clobber
    • easy resume of when if connection breaks down: don’t overwrite any existing files (used in case the download is interrupted and resumed)

creditz: http://www.linuxjournal.com/content/downloading-entire-web-site-wget

manpage: wget.man.txt

WebHTTrack

httrack is a dedicated “download the entire website tool” that is written solely for this purpose

warning: some websites have brute force attack detectionand prevention in place (anti DDoS) and might block the copy process half way

setup:

hostnamectl; # tested on
   Static hostname: DebianLaptop
  Operating System: Debian GNU/Linux 9 (stretch)
            Kernel: Linux 4.9.0-13-amd64
      Architecture: x86-64
su - root;
apt update;
apt install httrack webhttrack;
Ctrl+D # logoff root
# start local webserver:8080
webhttrack

start browser and go to localhost:8080

problem:

there seems to be no option for webHTTrack to convert absolute (https://domain.com/image/file.jpg) into relative (./image/file.jpg) paths?

this kind of sucks (wget can do that!)

https://www.linux-magazine.com/Online/Features/WebHTTrack-Website-Copier

liked this article?

  • only together we can create a truly free world
  • plz support dwaves to keep it up & running!
  • (yes the info on the internet is (mostly) free but beer is still not free (still have to work on that))
  • really really hate advertisement
  • contribute: whenever a solution was found, blog about it for others to find!
  • talk about, recommend & link to this blog and articles
  • thanks to all who contribute!
admin