vim /scripts/download_website.sh; # create new file and fill with this content #!/bin/bash wget --no-check-certificate --limit-rate=5k --random-wait --recursive --no-clobber --page-requisites --html-extension --convert-links $1 chmod u+x /scripts/download_website.sh; # mark script executable mkdir /offline_websites_workspace; # create new dir where to offline the website cd /offline_websites_workspace; /scripts/download_website.sh "https://domain.com"; # recursively download domain.com
- –no-check-certificate (basically: enable https)
- Don’t check the server certificate against the available certificate authorities. Also don’t require the URL host name to match the common name presented by the certificate.
- Limit the download speed to 5KBytes per second.
- Some web sites may perform log analysis to identify retrieval programs such as Wget by looking for statistically significant similarities in the time between requests. This
- option causes the time between requests to vary between 0.5 and 1.5 * wait seconds, where wait was specified using the –wait option, in order to mask Wget’s presence from such analysis.
- A 2001 article in a publication devoted to development on a popular consumer platform provided code to perform this analysis on the fly. Its author suggested blocking at the class C address level to ensure automated retrieval programs were blocked despite changing DHCP-supplied addresses.
- The –random-wait option was inspired by this ill-advised recommendation to block many unrelated users from a web site due to the actions of one.
- download the entire Web site.
- don’t follow links outside website.org.
- don’t follow links outside the directory tutorials/html/.
- get all the elements that compose the page (images, CSS and so on).
- save files with the .html extension.
- convert links so that they work locally, off-line (absolute 2 relative)
- After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
- Each link will be changed in one of the two ways:
- The links to files that have been downloaded by Wget will be changed to refer to the file they point to as a relative link.
- modify filenames so that they will work in Windows as well, fuck windows one does not need it.
- easy resume of when if connection breaks down: don’t overwrite any existing files (used in case the download is interrupted and resumed)
httrack is a dedicated “download the entire website tool” that is written solely for this purpose
warning: some websites have brute force attack detectionand prevention in place (anti DDoS) and might block the copy process half way
hostnamectl; # tested on Static hostname: DebianLaptop Operating System: Debian GNU/Linux 9 (stretch) Kernel: Linux 4.9.0-13-amd64 Architecture: x86-64 su - root; apt update; apt install httrack webhttrack; Ctrl+D # logoff root # start local webserver:8080 webhttrack
start browser and go to localhost:8080
there seems to be no option for webHTTrack to convert absolute (https://domain.com/image/file.jpg) into relative (./image/file.jpg) paths?
this kind of sucks (wget can do that!)