12
July
2023
17:25

Replacing the Wget utility with Wget2

12 July 2023 17:25

The Wget2 utility provides the same capabilities as Wget, but the speed is increased by 5-10 times due to parallel execution of requests.

To download the entire site (mirror) or check an existing site for broken links (404) or download one file, use the Wget command line utility. There is a free and improved replacement for this utility - Wget2, written by another programmer.

The main improvement of Wget2 is multithreading, due to which the loading of a site or its “bypass” via links is accelerated by 5-10 times.

The program must be compiled from source code.

https://github.com/rockdaboot/wget2

  1. First I updated the packages:

    sudo apt-get update
    sudo apt-get upgrade

  2. Installed the packages required for compilation:

    sudo apt-get install autoconf autogen automake autopoint libtool python3 rsync tar pkg-config doxygen pandoc gettext libgnutls30 libidn12 flex libpsl5 libnghttp2-14 lcov

  3. Compilation and installation

By instructions for her "Downloading and building from tarball":

wget https://gnuwget.gitlab.io/wget2/wget2-latest.tar.gz

tar xf wget2-latest.tar.gz
cd wget2-
./configure
make
make check
sudo make install

The program will be installed in the folder /usr/local/bin/wget2

  1. Calling help:

    man wget2

Using the wget2 utility is no different from wget - the program keys are the same.

  1. Running example: checking a site for “broken” 404 links:

    wget2 -o out.log -m -l 3 --save-content-on "404" -T 2 http://example.com/

where:

-o - log file
-m - "mirror" - mirror copy with folders and files
-l 3 - nesting level for transitions

--save-content-on "404" - save only when returning 404 code
-T 2 - response waiting time.

Only service pages like “404” will be saved, which were not found as a result of crawling the site using internal links, with addresses from which incorrect links can be identified.

With these settings, requests will be made continuously, which is optimal for local websites.
For other people's websites, I recommend adding a pause -w 2 - the waiting time in seconds (2s) between requests.


MAN Wget2 in Russian

WGET2(1) GNU Wget2 2.0.1 WGET2(1)

Name
Wget2- recursive metalink/file/website downloader.

Brief overview
wget2 [options]... [URL]...

Description
GNU Wget2 is a free utility for non-interactively downloading files from the Internet. It supports HTTP and HTTPS protocols, as well as retrieving information through HTTP(s) proxies.

Wget2 is a non-interactive tool, meaning it can run in the background while the user is not logged in. This allows you to start retrieving information and disconnect from the system, allowing Wget2 to finish its work. On the contrary, most web browsers require constant user presence, which can be a big hindrance when transferring large amounts of data.

Wget2 can follow links in HTML, XHTML, CSS, RSS, Atom and sitemap files to create local versions of remote websites, completely recreating the directory structure of the original site. What is sometimes called recursive loading. However, Wget2 respects the robot exclusion standard (/robots.txt). Wget2 can be instructed to convert links in downloaded files to point to local files for offline viewing.

Wget2 was designed to provide reliability over slow or unstable network connections; If the download fails due to a network problem, the utility will try again until the entire file is restored. If the server supports partial downloading, it will continue from where it left off.

Wget2 Options

Options syntax

Options are optional additional parameters that can be used to control the behavior of the program. - approx. translator

Each option option has a long form and sometimes a short form. Long options are easier to remember, but they take longer to type. You can freely mix different styles of options. So you could write:

wget2 -r --tries=10 https://example.com/ -o log

The space between the option taking an argument and the argument itself can be omitted. Instead of -o log you can write -olog.

You can combine multiple options that require no arguments, like:

wget2 -drc URL

What is equivalent

wget2 -d -r -c URL

Since other options may be specified after the arguments, you can separate them with --. So, the following command will attempt to load the -x URL, and will report failure to the log:

wget2 -o log -- -x URL

Options that accept comma-separated lists of arguments follow the convention that adding -no clears its value. This may be useful for cleaning up .wgetrc settings. For example, if your .wget2rc sets exclude-directories to /cgi-bin, then the --no-exclude-directories option will first reset this setting and then set it to exclude /priv and /trash. You can also clear lists in the .wget2rc file.

wget2 --no-exclude-directories -X /priv,/trash

Most options that take no arguments are boolean options, so named because their state can be captured using a "boolean" variable. A Boolean parameter can be either affirmative or negative (starting with --no-). All such options have several common properties.

Affirmative options can be converted to negative options by prefixing the option name with --no-; Negative options can be converted to affirmative by omitting the --no- prefix. This may seem redundant - if the default for a "positive" option is not to do something, then why provide a way to explicitly turn it off? But the startup file may actually change the defaults. For example, using the TimeStamping=on option in .wget2rc tells WGET2 to only download updated files. Using the --no-timestamping option is the only way to restore the option's factory default value from the command line.

Basic launch options for Wget2

-V, --version
Display Wget2 version.

-h, -help
Print a help message describing all WGET2 command line options.

-b, --background
Send the application to the background immediately after launch. If no output file is specified via -o, the output is redirected to the "wget-log" file.

-e, -execute-command
Execute the command as if it were part of .wget2rc. A command declared this way will be executed after the commands in .wget2rc, thus exceeding its priority. If you need to specify more than one WGET2RC command, use multiple instances of -e.

--hyperlink
Use hyperlinks instead of names of downloaded files so that they can be opened from the terminal by clicking on them. Only a few terminal emulators currently support hyperlinks. Enable this option if you know that your terminal supports hyperlinks.

Log and Input File Options

-o,--output-file = logfile
Print all runtime messages to logfile. Otherwise, error messages are written to standard stream.

-a,-append-output = logfile
Add to log file. This is the same as -o, only it appends to the log rather than overwriting the old log file. If the log file does not exist, a new file is created.

-d, --debug
Enable debugging, which means various information important to WGET2 developers if it doesn't work as expected. Because the system administrator may have decided to compile wget2 without debugging support, in which case -d will not work. Please note that compiling with debug support is always safe, WGET2 compiled with debug support will not output debug messages unless the -d option is explicitly specified.

-q, -quiet
Turn off output from WGET2.

-v, --verbose
Enable multiword output with all available data. The output is verbose by default.

-nv, --no-verbose
Disable verbose output. This mode is not completely silent (use -Q for this), which means error messages and basic information will still be printed.

--report-speed=type
Display a horizontal speed indicator bar with speed values ​​in type. The only accepted values ​​for type are bytes (which is the default) and bits. This option only works if --progress=bar is also set.

-i,--input-file=file
Read URLs from a local or external file file with a list of URLs. If "-" is specified as a file, the URLs are read from standard input. Use ./- to read from a file literally named "./-".

If this option is used, specifying the URL on the command line is not required. If there are URLs on both the command line and --input-file file, then the URLs passed on the command lines will be the first to be extracted. The file is expected to contain a list of addresses, one URL per line, except for one of the --force- options, which specifies a different format.
if you indicated --force-html, the document will be treated as HTML. In this case, you may have problems with relative links, which you can solve by adding «base href="url"» tag in documents or by indicating --base=url on the command line.

If you indicated --force-css, the document with links will be treated as CSS.

If you indicated --force-sitemap, the document will be treated as an XML SiteMap.

If you indicated --force-atom, the document will be treated as an Atom feed.

If you indicated --force-rss, the document will be treated as an RSS feed.

If you indicated --force-metalink, the document will be considered as a description of Metalink.

If you have problems with relative links, you should use --base=url on the command line.
-F, --force-html
when a web page is considered as an input file with the --input-file=file option, force it to be treated as an HTML file. This will allow you to extract files by relative links from existing HTML files on your local disk, either by adding “” to the HTML, or by using the -base option on the command line.

--force-css
Read and parse input file as CSS. This allows you to extract links from existing CSS files on your local drive. You will need the -base switch to handle relative links correctly.

--force-sitemap
Read and parse input file as Sitemap XML. This allows you to extract links from existing Sitemaps on your local drive. You will need the -base switch to handle relative links correctly.

--force-atom
Read and parse the input file as an Atom XML feed. This allows you to extract links from existing Atom XML feed files on your local disk. You will need the -base switch to handle relative links correctly.

--force-rss

Read and parse the input file as an RSS XML feed. This allows you to extract links from existing RSS XML feed files on your local drive. You will need the -base switch to handle relative links correctly.

--force-metalink
Read and parse the input file as a Metalink file. This allows you to retrieve links from existing Metalink files on your local drive. You will need the -base switch to handle relative links correctly.

-B, --base=URL
Allows relative links using a URL as a reference point, when reading links from an HTML file specified via the -i/input -file option (together with the -force... option, or when the input file was retrieved remotely from the Server, describing it as HTML, CSS, Atom, or RSS). This is equivalent to having a "BASE" tag in the input HTML file, with the URL as the value for the "href" attribute.

For example, if you specify https://example.com/bar/a.html for the base URL, and Wget2 from the ../baz/b.html link in the input file will retrieve the file at https://example.com/baz/b.html.

--config=FILE
Specify the location of the configuration files you want to use. If you specify more than one file, either using a comma-separated list or multiple configuration options, those files are read in order from left to right. The files listed in the environment variables $SYSTEM_WGET2RC and ($WGET2RC or ~/.wget2rc) are read in that order, followed by user-supplied configuration files. If specified, $WGET2RC replaces ~/.wget2rc.

--no-config
Clears the internal list of configuration files. Therefore, if you want to prevent any configuration files from being read, use the --no-config option on the command line.

--no-config followed by --config=file skips reading the standard files and just reads the configuration file file.
wget will attempt, on supported platforms, to resolve file names written in the configuration file with a tilde ~ as the user's directory name. To use a file that begins with a literal `~' character, use "./~" or an absolute path.

--rejected-log=logfile [Not implemented yet]
Logs all revoked configuration file URLs as a comma-separated list. Values ​​include the reason for the rejection, the URL, and the parent URL it was found.

--local-db
Allows reading/writing to the local file database (default: enabled).

There are files for --hsts, --hpkp, --ocsp, etc..

When specified --no-local-db you can disable read/write, which is convenient for testing.
This option does not affect reading configuration files.

--stats-dns=[FORMAT:]FILE
Saves DNS statistics in FORMAT format to a FILE file.
FORMAT can be "human" or "csv".
the output file "-" stands for stdout and "h" is the abbreviation for human readable format.

The output CSV file format is as follows:
Hostname, IP, Port, Duration
where:
Duration- response time is indicated in milliseconds.

--stats-tls=[FORMAT:]FILE
Save TLS statistics (of secure connections) in FORMAT format to a FILE file.
FORMAT can be "human" or "csv".

The output file "-" stands for stdout and "h" is the abbreviation for human readable format.

The output CSV file format is as follows:
Hostname, TLSVersion, FalseStart, TFO, Resumed, ALPN, HTTPVersion, Certificates, Duration
where:
Tlsversion can be 1,2,3,4,5 for SSL3, TLS1.0, TLS1.1, TLS1.2 and TLS1.3. -1 means "no".
Falsestart- whether the False Start connection was used. -1 if not applicable.
Tfo- whether the connection uses TCP "fast open". -1 if TFO was disabled.
Resumed- whether the TLS session was resumed or not.
ALPN is the ALPN negotiation string.
HTTPVersion- is equal to 0 for HTTP 1.1, and 1 for HTTP 2.0.
Certificates is the size of the server's certificate chain.
Duration- duration is indicated in milliseconds.

--stats-ocsp=[FORMAT:]FILE
Save OCSP statistics in FORMAT format to a FILE file.
FORMAT can be "human" or "csv". The output file "-" stands for stdout and "h" is the abbreviation for human readable format.

The output CSV file format is as follows:
Hostname, Stapling, Valid, Revoked, Ignored

where:
Stapling whether the OCSP response was stitched or not.
Valid how many server certificates were valid against OCSP.
Revoked how many server certificates have been revoked regarding OCSP.
Ignored how many server certificates were ignored regarding OCSP.

--stats-server=[FORMAT:]FILE
Save Server statistics in FORMAT format to a FILE file.
FORMAT can be "human" or "csv". The output file "-" stands for stdout and "h" is the abbreviation for human readable format.
the output CSV file format is as follows:
Hostname, IP, Scheme, HPKP, NewHPKP, HSTS, CSP
where:
Scheme - the values ​​0, 1, 2 mean, respectively, None, http, https.
HPKP- HTTP Public Key Pinning values ​​0,1,2,3 mean 'No HPKP', 'HPKP match', 'HPKP do not match', 'HPKP error'.
NewHPKP- whether the server sent the HPKP header (Public-Key-Pins)
HSTS- did the server send the HSTS (Strict-Transport-Security) header?
CSP- whether the server sent the CSP (Content-Security-Policy) header.

--stats-site=[FORMAT:]FILE
Save site statistics in FORMAT format, to a FILE file.
FORMAT can be "human" or "csv". The output file "-" stands for stdout and "h" is the abbreviation for human readable format.
The output CSV file format is as follows:
ID, ParentID, URL, Status, Link, Method, Size, SizeDecompressed, TransferTime, ResponseTime, Encoding, Verification
where:
ID is a unique ID for recording statistics.
ParentID is the ID of the parent document, applicable to the --recursive recursion mode.
URL is the URL (address) of the document.
Status is the HTTP response code, or 0 if not applicable.
Link - 1 means 'direct link', 0 means 'redirected link'.
Method - 1, 2, 3 indicate the GET, HEAD, and POST request types.
Size is the size of the downloaded body (and the theoretical value for HEAD requests).
SizeDecompressed is the size of the decompressed body (and 0 for HEAD requests).
TransferTime is the time in milliseconds between the start of the request and the completion of the download.
ResponseTime is the time in milliseconds between the start of the request and the first response to the packet.
Encoding - 0, 1, 2, 3, 4, 5 indicates the compression type on the server: 'identity', 'gzip', 'deflate', 'lzma/xz', 'bzip2', 'brotli', 'zstd', 'lzip'
Verification is the PGP verification status. 0, 1, 2, 3 indicate 'none', 'valid', 'invalid', 'bad', 'missing'.

Download options

-bind-address=ADDRESS
When creating TCP/IP client connections, bind to a specific IP address on the local machine. The address can be specified as a hostname or an IP address. This option can be useful if your machine has multiple IP addresses.

--bind-interface=INTERFACE
When creating TCP/IP client connections, bind to an interface on the local machine. Interface can be specified as the name of the network interface. This option can be useful if your computer has multiple network interfaces. However, the option only works when wget2 is running with elevated privileges (on GNU/Linux: root/sudo or sudo setcap cap_net_raw+ep "path to wget | wget2").

-t, --tries=number
Set the number of attempts. Specify 0 or Inf to repeat indefinitely. By default, the program will retry requests 20 times, except for fatal errors such as "connection refused" or "not found" (404), which are not retried.

--retry-on-http-error=list
Provide a comma-separated list of HTTP codes for which WGET2 will retry the download. List elements can contain wildcards. If the HTTP code begins with the character ! this code will not be loaded. The option is useful when trying to download something with an exception. For example, retry every failed download except for the 404 error code:
wget2 --retry-on-http-error=*,!404 https://example.com/

Please keep in mind that "200" is the only prohibited code. If it is included in the status list, wget2 will ignore it. The maximum number of download attempts is set with the --tries option.

-O, --output-document=file
The documents will not be written to their respective separate files, but will be combined all together and written to file. If "-" is used as a file, documents will be printed to standard output, disabling link conversion. Use ./- to print to a file literally named "-". To avoid getting WGET2 status messages mixed in with the file contents, use -q in combination with "-" (this is different from how WGET 1.x behaves).

Using "-r" or "-p" with "-o" may not work as you expect: Wget2 won't just download the first file to a file and then download the rest to their normal names: all downloaded content will be placed in a file.

The combination with -NC is only accepted if the given output file does not exist.

When used in conjunction with the "-c" option, WGET2 will attempt to continue downloading the file whose name is passed to the option, regardless of whether the actual file exists on disk or not. This allows users to download a file with a temporary name next to an existing file.

Note that the "-k" combination is only allowed when loading a single document, since in this case it simply converts all relative URIs to external ones; "-k" has no meaning for multiple URIs when they are all loaded into one file; "-k" can only be used when the output is a regular file.

Compatibility note: WGET 1.X uses a mechanism similar to shell redirection when handling the "-o" option. Wget2 does not handle the option in the same way. Therefore, the file will not always be created again. The file's timestamps will not be affected unless it is actually written. As a result, both "-c" and "-n" are now supported in combination with this option.

-nc, --no-clobber
If a file is downloaded more than once in the same directory, Wget2's behavior depends on several options, including -nc.

In some cases, the local file will be corrupted or overwritten when you download it again. In other cases, the original file will be saved.

Running wget2 without "-n", "-nc", "-r" or "-p", downloading the same file in the same directory will cause the original copy of the file to be retained and an instance of the file named file.1 will be created nearby. If this file is downloaded again, a third copy will be created called file.2 and so on. (This is also the behavior with -nd, even if -P or -p is in effect.) Use --keep-extension to use an alternative file naming pattern.

When -NC is given, this behavior (file.1, file.2) is suppressed and WGET2 will refuse to download new copies of the file. Therefore, "--no-clobber" is actually a misnomer for this mode - it is not overwriting that is prevented (since the numeric suffixes would have already prevented it), but rather the prevention of multiple versions of the same file being created.

When running wget2 with the "-r" or "-p" options, but without "-N", "-nd" or "-nc", downloading the file again will download a new version, which will overwrite the old one. Adding -NC will prevent this behavior and will instead cause the original version to be retained and any new copies of the file on the server to be ignored.

When running wget2 with the "-N" switch, with or without "-r" or -p, the decision of whether to download a newer copy of the file depends on the local and remote timestamp and the file size. The -nc option cannot be specified with -N. The combination with "-O" / --output-document is only accepted if the given output file does not exist.

Note that when -nc is specified, files with the .html or .htm suffixes will be downloaded from the local disk and parsed as if they were retrieved from the Internet.

--backups=backups
Before overwriting a file, create a backup copy of the existing file by adding the suffix .1 to the file name. Such backup files are created with names ending in .2, .3, etc. - backups (and lost beyond that).

-c, --continue
Continue receiving the partially downloaded file. This is useful when you want to finish a download started by a previous instance of WGET2 or another program. For example:

wget2 -c https://example.com/tarball.gz

If there is a file named tarball.gz in the current directory, WGET2 will assume that this is the first part of the file being downloaded and will ask the server to continue searching from an offset equal to the length of the local file.

Note that you do not need to specify this option if you simply want the current WGET2 call to retry downloading the file if the connection is dropped mid-transfer. This is the default behavior. The "-c" switch only affects the resumption of downloads that were started before this Wget2 call and whose local files are present.

Without the "-c" in the previous example, WGET2 will simply download a file called Tarball.gz.1 from the Internet, leaving the truncated tarball.gz file nearby.

If you use "-c" with a non-empty file, and it turns out that the server does not support continuing the download ("re-download"), Wget2 will refuse to start the download from scratch, effectively destroying the existing contents of the file. If you really want the download to start from scratch, delete the file.

If you use the "-c" switch with a file that is the same size as the one on the server, Wget2 will refuse to download the file and print an explanatory message. The same thing happens if the file on the server is smaller than the local one (presumably because it has changed on the server since your last upload attempt). Since "continue" doesn't make sense, no loading occurs.

On the other side of the coin, when using "-c" any file that is larger on the server than locally will be considered an incomplete download and only "(length(remote) - length(local))" will be downloaded and appended to the end of the local file. This behavior may be desirable in some cases. For example, you can use wget2 -c to download only the new part that was appended to the end of the data being collected or to the log file.

However, if the file is larger on the server because it has been modified and not just attached, you will end up with a malformed file. WGET2 cannot confirm that the local file is indeed a valid prefix of the remote file. You have to be especially careful with this when using -C in combination with -R, since every file will be treated as a candidate for "incomplete loading".

Another instance where you will get a malformed file if you try to use -c is if you have a crooked HTTP proxy that inserts a "transfer" line into a local file. In the future, a "rollback" option may be added to address this case.

Note that -c only works with HTTP servers that support the "RANGE" header.

--start-pos=OFFSET
Start loading with a position offset relative to zero. The offset can be expressed in bytes, kilobits with the suffix 'k', or megabytes with the suffix 'm', etc.

--start-pos takes precedence over --continue. When both --start-pos and --continue are specified, WGET2 will issue a conflict warning.

The server is required to support loading continuation, otherwise --start-pos will not help. See the description of the "-c" option for details.

--progress=type
Select the type of progress bar you want to use. Supported indicator types are "none" and "bar".

The "bar" type draws a graphic with an ASCII progress ball (a.k.a "Thermometer") indicating progress status.

If the output is a TTY console, then "bar" is the default. Otherwise, the progress bar will be disabled unless --force-progress is used.

The "dot" type is not currently supported, but will not raise an error so as not to break WGET batch files.

Parameterized panel types: "bar:force" and "bar:force:noscroll" add the --force-progress effect. This is done for better compatibility with WGET.

--force-progress
Enables wget2 to display a progress bar (progress bar).

By default, Wget2 only displays the progress bar in "--verbose" mode. However, someone might want WGET2 to display a progress bar on the screen in combination with any other modes such as "–no-verbose" or "–quiet". This is often a desired property when calling WGET2 to download multiple small/large files. In such a case, WGET2 can simply be called with this parameter to get cleaner output on the screen.

This option will also cause the progress bar to be output to the STDERR stream when used in conjunction with the "--output-file" option.

-N, --timestamping
Enable file timestamps.

-no-if-modified-since
In "-N" mode, do not send the "If-Changed-With" header. Instead, send the initial HEAD request. Has an effect only in "-N" mode.

--no-use-server-timestamps
Do not convert the date and time of local files to the dates and times on the server.

По By default, when a file is downloaded, its timestamps are set to match those on the remote file on the server. This allows timestamps to be used in subsequent WGET2 calls. However, it is sometimes useful to leave the local file's timestamp as when it was actually first downloaded; The option –no-server-timestamps was provided for this purpose.

-S, --server-response
Print response headers sent by HTTP servers.

--spider
When called with this option, Wget2 will behave like a web spider, meaning that it will not load pages, just checking that they are there. For example, you can use wget2 to check your bookmarks.

For example, you can use wget2 to check the bookmarks.html file:

wget2 --spider --force-html -i bookmarks.html

This feature requires a lot more work for Wget2 to get close to the functionality of real web spiders.
-T seconds, --timeout=seconds
set the network timeout to seconds. This is equivalent to specifying --dns-timeout, --connect-timeout, and --read-timeout for the same time.
When interacting with the network, Wget2 can check for a timeout and abort the operation if it takes too long. This prevents anomalies such as read hangs and endless connections. The only timeout enabled by default is the read timeout of 900 seconds. Setting the timeout to 0 disables it altogether. Unless you know what you're doing, it's best not to change the default timeout settings.

All timeout related parameters accept decimal values ​​as well as fractional values. For example, 0.1 seconds is a legal (though unwise) timeout choice. Timeouts under partitions are useful for checking server response time or for testing network latency.

--dns-timeout=seconds
Sets DNS Timeout to seconds. DNS lookups that do not complete within the specified time will fail. By default, there is no timeout for DNS lookups, except for what is implemented by the system libraries.

--connect-timeout=seconds
Sets the connection timeout to seconds. TCP connections that take longer will be terminated. By default, there is no connection timeout, other than what is implemented by the system libraries.

--read-timeout=seconds
Sets the read (and write) timeout to seconds. The "time" of this timeout refers to temporary downtime: if at any point during the download no data has been received for more than the specified number of seconds, the failure read and download are restarted. This option does not directly affect the duration of the entire download.

Of course, the remote server may decide to close the connection before this option takes effect. The default read timeout is 900 seconds.

--limit-rate=amount
Limits the download speed by the amount of bytes per second. The amount can be expressed in bytes, kilobits with the suffix K, or megabytes with the suffix M. For example, –limit-rate = 20K will limit the fetch rate to 20 KB/s. This is useful when, for some reason, you don't want Wget2 to consume all the available bandwidth.

This option allows the use of decimal numbers, usually in combination with dimension suffixes; For example, –limit-rate = 2.5K is a legal value.

Note that WGET2 implements a limit by going to sleep after booting for a time corresponding to the amount of time after reading the network, which took less time than the speed value indicated. Ultimately, this strategy causes TCP transmission to slow down to approximately the specified speed. However, it may take some time for this balance to be achieved, so don't be surprised if the speed limit doesn't work with very small files.

-w seconds, --wait=seconds
Wait the specified number of seconds between requests. It is recommended to use this option as it reduces the load on the server by making requests less frequent. Instead of seconds, time can be specified in minutes using the suffix "M", in hours using the suffix "H" or in days using the suffix "D".

Specifying a large value for this option is useful if the destination network or host is down (in a state of failure) so that WGET2 can wait long enough to reasonably expect the network error to be corrected before retrying. The wait interval specified by this function is affected by the "--random-wait" option, if present.

--waitretry=seconds
If you don't want WGET2 to wait between each request, but only between requests after failed downloads, you can use this option. WGET2 will use a linear fallback, waiting 1 second after the first failure in a given file, and then waiting 2 seconds after the second failure in that file, up to the maximum number of seconds you specify.

By default, Wget2 uses a value of 10 seconds.

--random-wait
Some websites can perform on-the-fly log analysis to identify file downloaders, such as WGET2, looking for statistically significant similarities in timing between requests. This option causes the time between requests to vary from 0.5 to 1.5 seconds. By using the --random-wait option you can try to mask the presence of Wget2 from such analysis.

--no-proxy[=exceptions]
If the option argument is not specified, Wget2 tries to remain backward compatible with wget1.x and not use proxies, even if the corresponding *_proxy environment variable is defined.

If the option argument is a comma separated list of exceptions (domains/IP addresses), these exceptions will be loaded without using a proxy. Option trumps environment variable no_proxy.

-Q quota, --quota=quota
Specify a quota (volume limit) for automatic downloading. The quota value can be specified in bytes (default), kilobits (suffixed with "k"), or megabytes (suffixed with "M").

Please note that the quota will never affect the download of a single file. So if you specify

wget2 -q10k https://example.com/bigfile.gz

The bigfile.gz file will be downloaded anyway. The same thing happens even when multiple URLs are specified on the command line. However, the quota will be applied when loading recursively, or according to the list from the input file. This way you can safely run the command (without fear of your hard drive becoming full):

wget2 -q2m -i website

The download will be aborted if the quota is exceeded.

Setting the quota to 0 or Inf removes the limit from the download quota.

--restrict-file-names=modes
Configure which characters found in remote URLs should be replaced with escape sequences when creating local filenames. Characters prohibited by this option are escaped, that is, replaced with %HH, where HH is the hexadecimal number corresponding to the prohibited character. This option can also be used to force all alphabetic case to be converted to lower or upper case.

By default, Wget2 escapes characters that are not valid or safe as part of file names on your operating system, as well as control characters that are not normally displayed on the screen. This option is useful for changing these default values, perhaps because you are loading the site on a filesystem partition other than its own, or because you want to disable control character escaping, or you want to further limit characters to only those within the ASCII character set value range.

Modes are a set of text values ​​separated by commas. Valid values ​​are "unix", "windows", "nocontrol", "ascii", "lowercase" and "uppercase". The values ​​ "unix" and "windows" are mutually exclusive (one overrides the other), just as "lowercase" excludes "uppercase". The last two are special cases in that they do not change the set of characters that will be escaped, but rather force local file paths to be converted to either lowercase or uppercase.

КогIf the value is “unix”, Wget2 escapes the character or characters and escape sequences in the character ranges 0–31 and 128–159. This is the default on Unix-like operating systems.

КогYes is set to “windows”, Wget2 escapes the characters " ", |, /, :, ?, ", *, ", ", and control characters in the ranges 0–31 and 128–159. In addition, Wget2 in Windows mode uses + instead of : to separate the hostname from the port in local file names, and uses @ instead of ? to separate the URL portion of the filename from the rest request. Therefore, the URL that would be saved as www.xemacs.org:4300/search.pl?input=blah in Unix mode would be saved as www.xemacs.org+4300/search.pl@in‐put=blah in Windows mode. This is the default mode in Windows OS.

If you specify nocontrol, control character escaping is also disabled. This setting may make sense when you are loading URLs whose names contain UTF-8 characters on a system that can store and display file names in UTF-8 (some of the possible byte values ​​used in UTF-8 byte sequences fall within the range of values ​​designated by Wget2 as "control characters").

The ascii mode is used to indicate that any bytes whose values ​​fall outside the ASCII character range (that is, greater than 127) should be escaped. This can be useful when saving file names whose encoding does not match the one used locally.

-4, --inet4-only, -6, --inet6-only
Force the program to connect to IPv4 or IPv6 addresses. With --inet4-only or -4, Wget2 will connect only to IPv4 hosts, ignoring AAAA records in DNS, and prohibiting connecting to IPv6 addresses specified in URLs. Conversely, with –inet6-only or -6, Wget2 will only connect to IPv6 machines and will ignore A records and IPv4 addresses.

Neither option is usually needed. By default, Wget2 with IPv6 support will use the address family specified in the host's DNS record. If DNS responds with IPv4 and IPv6 addresses, Wget2 will try them sequentially until it finds one it can connect to. (See also the "--prefer-family" option described below.)

These settings can be used to intentionally force the use of IPv4 or IPv6 address families on dual-family systems, typically to facilitate debugging or correct network misconfigurations. Only one of the --inet6-only or --inet4-only options can be specified at a time. Neither option is available in Wget2 compiled without IPv6 support.

--prefer-family=none/IPv4/IPv6

When given a choice of multiple IP addresses, connect to addresses with the specified address family first. The default address order returned by DNS is unchanged.

ЭтоAvoids false errors and connection attempts when accessing hosts that resolve to both IPv6 and IPv4 addresses from IPv4 networks. For example, www.kame.net resolves to 01:200:0:8002:203:47ff:fea5:3085 and 203.178.141.194. When the preferred family is "IPv4", the IPv4 address is used first; when the preferred family is "IPv6", the IPv6 address is used first; if set to "none", the order of addresses returned by DNS is used unchanged.

Unlike -4 and -6, this option does not deny access to any address family, it only changes the order in which addresses are accessed. Also note that the reordering performed by this option is stable. This does not affect the order of addresses within the same family. That is, the relative order of all IPv4 addresses and all IPv6 addresses remains the same in all cases.

--tcp-fastopen
Enables TCP Fast Open (TFO) support (default: enabled).

TFO reduces connection latency by 1 round trip on hot connections (2nd connection to the same host within a certain time).
This currently works on the latest Linux and OSX kernels, on HTTP and HTTPS.

--dns-cache-preload=file
Load a list of tuples (DNS resource records) from the IP/Names file into the DNS cache.

The file format is similar to /etc/hosts: IP-address space Hostname

This saves time searching for a domain name, which in some cases is a bottleneck. Additionally, this option can be used to simulate the use of the HOSTALIASES environment variable (which is not portable to other OSes).

--dns-cache
Allow DNS caching (default: enabled).

Typically Wget2 remembers the IP addresses it looks up in DNS, so it doesn't have to repeatedly contact the DNS server for the same (usually small) set of hosts it retrieves from. This cache exists only in memory; a new run of Wget2 will contact the DNS again.

ОднThere have been reports that in some situations it is undesirable to cache hostnames even for the short duration of an application such as Wget2. With the --no-dns-cache option, Wget2 performs a new DNS lookup (more precisely, a new "gethostbyname" or "getaddrinfo" call) every time it establishes a new connection. Note that this setting will not affect caching that may be performed by a resolver library or an external caching layer such as NSCD.

--retry-connrefused
Treat "connection refused" as a temporary error and try again. Typically Wget2 refuses a URL when it can't connect to a site, because connection failure is taken as a sign that the server isn't running at all and retrying won't help. This option is intended for mirroring untrusted sites whose servers tend to disappear for short periods of time.

--user=user, --password=password
Specify a username and password to access files via HTTP. This overrides the search for credentials in the .netrc file (the --netrc option is enabled by default). These two options can be overridden by using the --http-user and --http-password options for HTTP(S) connections.

If neither --http-proxy-user nor --http-proxy-password is specified, these settings are also used for proxy authentication.

--ask-password

Display the password prompt on the command line. Overrides the password set by --password (if specified).

--use-askpass=command
Prompts for username and password using the specified command. Overrides the user and/or password set by --user/--password (if specified).

--no-iri
Disable support for internationalized URIs (IRIs). Use --iri to enable. IRI is enabled by default.

You can set the default IRI support state using the "iri" command in .wget2rc. This setting can be overridden from the command line.

--local-encoding=encoding
Force Wget2 to use the specified encoding as the system encoding. This affects how Wget2 converts URLs given as arguments from the local encoding to UTF-8 for IRI support.

Wget2 uses the "nl_langinfo()" function and then the "CHARSET" environment variable to get the locale. If this fails, ASCII is used.

--remote-encoding=encoding
Force Wget2 to use the encoding as the default remote server encoding. This affects how Wget2 converts URIs found in files from the remote encoding to UTF-8 during recursive fetching. These options are only useful for IRI support, for interpreting non-ASCII characters.

For HTTP, the remote encoding can be obtained from the HTTP header field “Content-Type” and from the HTML tag “Content-Type http-equiv”.

--input-encoding=encoding
Use the specified encoding for the input file --input-file with a list of URLs. By default, the local encoding is used.

--unlink
Make Wget2 unlink a file instead of overwriting an existing file. (See -nc above for mashing reference). This option is useful for uploading to a directory with hard links.

--cut-url-get-vars
Remove HTTP GET parameters from file names when making requests to the server. For example, "main.css?v=123" will be replaced with "main.css". Be aware that this may have unintended side effects, for example "image.php?name=sun" will be changed to "image.php". Trimming occurs before URLs are added to the download queue.

--cut-file-get-vars
Remove HTTP GET variables from file names. For example, “main.css?v=123” will be replaced with “main.css”.

Be aware that this may have unintended side effects, for example "image.php?name=sun" will be changed to "image.php". Cropping occurs when saving the file after downloading.

Filenames derived from the "Content-Disposition" header are not affected by this option (see --content-disposition) and may be a workaround for this problem.

When "--trust-server-names" is used, this setting affects the URL when redirecting.

--chunk-size=size
Load large files in multiple multi-threaded chunks. This switch specifies the chunk size, specified in bytes, unless another multiple of bytes is specified. Default is 0/disabled.

--max-threads=number
Specifies the maximum number of simultaneous download threads for a resource. The default is 5, but if you want to allow more or less, use this option.

-s, --verify-sig[=fail|no-fail]
Enables PGP signature verification (if there is no "no-" prefix). When enabled, Wget2 will attempt to download and verify PGP signatures on matching files. Any uploaded file whose content type begins with application/pgp-signature will cause Wget2 to request a signature for that file.

The signature file name is calculated by adding the extension to the full path of the file that was just downloaded.
The extension used is determined by the "--signature-extensions" option. If the content type for the signature request is application/pgp-signature, Wget2 will try to check the signature against the source file. By default, if the signature file cannot be found (that is, a request to it receives a 404 status code), Wget2 will exit with an error code.

This behavior can be configured using the following arguments:
*fail*: This is the default value, meaning this is the value when you specify the flag without an argument. Indicates that missing signature files will cause Wget2 to exit with an error code.
no-fail**: This value resolves missing signature files. The 404 message will still be issued, but the program will continue to operate normally (assuming there are no unrelated errors).

In addition to this,--no-verify-sig completely disables signature verification. --no-verify-sig does not allow arguments.

--signature-extensions
Specifies file extensions for signature files without a leading ".". You can list multiple extensions as a comma separated list. All provided extensions will be tried simultaneously when searching for the signature file. Default is "sig".

--gnupg-homedir
Specifies the gnupg home directory to use when verifying PGP signatures of downloaded files. The default is your home directory on the operating system.

--verify-save-failed
Tells Wget2 to save files that fail PGP signature verification. By default, files that fail PGP verification are deleted.

--xattr
Save document metadata as “user POSIX Extended Attributes” (default: enabled). This feature only works if the file system supports it. More information at https://freedesktop.org/wiki/CommonExtendedAttributes.

Currently Wget2 sets the attributes

  • user.xdg.origin.url
  • user.xdg.referrer.url
  • user.mime_type
  • user.charset

To display extended file attributes (on Linux):getfattr -d \«file\»

--metalink
Explore/process metalink URLs without saving them (default: enabled).

Metalink metalink files describe downloads including mirrors, files, checksums, signatures. This allows downloads from several parts, automatic selection of the nearest mirror and checking the downloaded file for integrity.

--fsync-policy
Enables execution of the sync command after each file has finished downloading (default: disabled).

--http2-request-window=number
Sets the maximum number of parallel threads on an HTTP/2 connection (default: 30).

--keep-extension
This option changes the behavior for creating a unique filename if a file already exists.
This option changes the behavior when creating a unique file name if the file already exists.

The standard (default) pattern for file names is “filename”.“N”. The new pattern is “basename”._N”.“ext”.
The idea is to use such files without renaming them when opening with applications depends on the extension, as in Windows.

This option does not change the behavior --backups.

Directory Options

-nd, --no-directories
Do not create a directory hierarchy when recursively extracting. If this option is enabled, all files will be saved in the current directory without overwriting (if the name appears more than once, the file names will have the extension .n, where n is an integer).

-x, --force-directories
Opposite -nd: create a hierarchy of directories even if they would not have been created otherwise.
For example,wget2 -x https://example.com/robots.txt will save the downloaded file to the example.com/robots.txt folder.

-nH, --no-host-directories
Disables the creation of directories named as host prefixes. By default, calling Wget2 with the recursion switch -r https://example.com/ will create a folder structure starting with example.com/. This option disables this behavior.

--protocol-directories
Use the protocol name as part of the name for the directory with local files. For example, with this option: wget2 -r --protocol-directories https://example.com will save to https/example.com/... and not just example.com/....

--cut-dirs=number
Ignore numbers in the directory component. This is useful for gaining precise control over the directory in which the recursive search will be stored.

Take, for example, the directory "https://example.com/pub/sub/". If you get it with -r, it will be saved locally at "example.com/pub/sub/". Although the -nH option may remove the "example.com/" part, you will still get "pub/sub/".

This will come in handy --cut-dirs; because of this, Wget2 doesn't "see" a number of components of the remote directory. Here are some examples of how this parameter --cut-dirs works.

Without option -» example.com/pub/sub/

--cut-dirs=1 -» example.com/sub/
--cut-dirs=2 -» example.com/

-nH -» pub/sub/
-nH --cut-dirs=1 -» sub/
-nH --cut-dirs=2 -» .

If you just want to get rid of the directory structure, this option is like a combination of -nd and -P. However, unlike -nd, --cut-dirs will not lose subdirectories. For example, with -nH --cut-dirs=1 subdirectory beta/ will be placed in sub/beta/, as you would expect.

-P prefix, --directory-prefix=prefix
Set the directory prefix to "prefix". The directory prefix is ​​the directory in which all other files and subdirectories will be stored, i.e. the top of the search tree. The default is ".", the current directory. If the directory prefix does not exist, it will be created.

HTTP and HTTPS options

See Part 2 of the article.

Recursive Extraction Options

-r, --recursive
Enable recursive extraction. The default maximum extraction depth is 5.

-l depth, --level=depth
Specify the maximum recursion depth in depth levels

--delete-after
This option tells Wget2 to delete every file it downloads after doing so. This can be useful for preloading popular pages through proxy servers, for example:

wget2 -r -nd --delete-after https://example.com/~popular/page/

Where the option is -r for recursive extraction, and -nd - do not create directories.

Note that when --delete-after is specified, the other --convert-links option is ignored, so the .orig files are simply not created in the first place.

-k, --convert-links
Once the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only visible hyperlinks, but any part of the document that links to external content, such as inline images, links to style sheets, hyperlinks to non-HTML content, etc.

Each link will be changed in one or two ways:

  1. Links to files that have been downloaded by Wget2 will be changed to link to the file it points to, to become a relative file link.

Example: if the loaded file /foo/doc.html links to /bar/img.gif, also loaded, then the link in the doc.html file will be changed to point to ../bar/img.gif. Transformations of this kind work reliably for any combination of directories.

  1. Links to files that have not yet been downloaded by Wget2 will be changed to include the hostname and absolute path to the location they point to.

Example: if the loaded file /foo/doc.html links to /bar/img.gif (or ../bar/img.gif), then the link is in doc.html
will be changed to point to https://example.com/bar/img.gif.

This makes local browsing work reliably: if a linked file has been downloaded, the link will refer to its local name; if it hasn't been downloaded, the link will point to its full internet address rather than a broken link. The fact that the former links are converted to relative links ensures that you can move the downloaded folder hierarchy to another directory.

Please note that only at the end of the download you will be able to know which links have been downloaded. Because of this, work done with -k will be done at the end of all downloads.

--convert-file-only
This option resolves only the filename portion of the URL, leaving the remaining URLs untouched. This part of the file name is sometimes referred to as the “base name,” although we avoid that term here to avoid confusion.

This option works especially well in combination with --adjust-extension, although this option pairing is not forced. Filling Internet caches can be useful when downloading files from various hosts.

Example: If some link points to //foo.com/bar.cgi?xyz , with the --adjust-extension option specified, its local destination would be assumed to be ./foo.com/bar.cgi?xyz.css, then the link would be changed to //foo.com/bar.cgi?xyz.css. Please note that only the file name portion of the link will be changed. The rest of the URL will remain unchanged, including the network path (“//”) that would otherwise be processed by Wget2 and converted to the applicable scheme (for example, “https://”).

-K, --backup-converted
When converting a file, save the original file with the suffix .orig. Affects the -N (timestamps) option.

-m, --mirror
Includes options suitable for mirroring. This option enables recursion and timestamps, and sets the recursion levels to infinite depth.
Currently it is equal to the combination of options -r -N -l inf.

-p, --page-requisites
This option causes Wget2 to download all the files that are needed to display the web page correctly. Things like embedded images, sounds, and referenced style sheets.

When loading a single HTML page normally, any necessary documents that might be required to display it correctly are not loaded.
Using -r along with -l can help, but since Wget2 usually doesn't differentiate between external and embedded documents, the former mostly remain as “child documents” that lack the required resources.

For example, suppose documents 1.html contains the tag “IMG” referring to 1.gif and tag “A” pointing to external document 2.html. Let 2.html be similar to the first one, with a link to image 2.gif and a link to document 3.html. Let this continue indefinitely.

If we run the command:

wget2 -r -l 2 https://«site»/1.html

then 1.html, 1.gif, 2.html, 2.gif, and 3.html will be loaded. As you can see, 3.html is without the 3.gif it requires
because Wget2 simply counts the number of "hops" (up to 2) from the (initial) 1.html to figure out where to stop the recursion. However, with the command:

wget2 -r -l 2 -p https://«site»/1.html

All the above files and the required 3.html drawing 3.gif will be downloaded.

Likewise, the team

wget2 -r -l 1 -p https://«site»/1.html

will load 1.html, 1.gif, 2.html, and 2.gif. Some may think that:

wget2 -r -l 0 -p https://«site»/1.html

will only load 1.html and 1.gif, but unfortunately this is not the case, because -l 0 is the equivalent of -l inf, which is infinite recursion.

To load a single HTML page (or, for convenience, all those specified on the command line or in the -i URL input file) and its (or their) required resources, simply remove the -r and -l:

wget2 -p https://«site»/1.html

Note: Wget2 will behave as if the -r switch was specified, but only for one page, all the resources necessary to display it will be downloaded.
Links from this page to other external documents will not be tracked. Actually, to load one web page with all the necessary elements (even if they exist on different websites), and to be sure that the web page will be displayed correctly from the local disk, this author likes to use -p with some padding:

wget2 -E -H -k -K -p https://«site»/«document»

where -E - add extensions to file names
-H - move to other hosts when fetching
-k - convert links for local viewing
-K - save backup copies of source files with .orig extension
-p - extract all files necessary for displaying the page

To finish this topic, it's worth knowing that Wget2 considers any URL specified in the "A", "AREA", or "LINK" tags to be an external link. except for “LINK REL="stylesheet"".

--strict-comments
Deprecated option for compatibility with Wget1.x. Wget2 always adds a closing comment tag, just like popular browsers.

--robots
Enable adherence to the robot exclusion standard (default: enabled).

For each domain you visit, follow the rules listed in /robots.txt. You should respect the domain owner's rules and only disable it for very good reasons.

Whether enabled or not, the robots.txt file is loaded and scanned for sitemaps. These are lists of pages/files available for download that are not necessarily available through a recursive crawl.

This behavior can be disabled using --no-follow-sitemaps.

Recursive accept/reject options

-A acclist, --accept=acclist
-R rejlist, --reject=rejlist

Specify comma-separated lists of suffixes or filename patterns to accept or reject. Note that if any of the wildcards, , ?, [, ] appear in an acclist or rejlist element, it will be treated as a pattern rather than a suffix.
In this case, you must enclose the pattern in quotes to prevent your shell from expanding it, for example in -A "
.mp3" or -A '*.mp3'.

--accept-regex=urlregex
--reject-regex=urlregex

Specify a regular expression to accept or reject file names.

--regex-type=regextype
Specify the regular expression type. Possible types: posix or pcre. Note that to use the pcre type, wget2 must be compiled with libpcre support.

--filter-urls
Apply accept and reject filters to the URL before starting the download.

-D domain-list, --domains=domain-list
Set the domains to monitor. domain-list is a comma separated list of domains. Note that it does not include -H.

--exclude-domains=domain-list
Specify a list of domains whose links the program will not follow.

--follow-sitemaps
Parsing sitemap from robots.txt and following links. (default: enabled).

This option is enabled for recursive downloads regardless of whether you specify --robots or -no-robots. Tracking of URLs found in sitemaps can be disabled using --no-follow-sitemaps.

--follow-tags=list
Wget2 has an internal table of HTML tag/attribute pairs that it considers when searching for related documents during a recursive search. However, if the user wants only a subset of these tags to be considered, he or she should specify such tags in a comma-separated list using this option.

--ignore-tags=list
This is the opposite of the --follow-tags option. To skip certain HTML tags when recursively searching for documents to download, list them separated by commas.

In the past, this option was the best choice for loading a single page and its details using the command line, for example:

wget2 --ignore-tags=a,area -H -k -K -r https://site/document

However, the author of this option came across a page with tags like "" and came to the conclusion that specifying tags to ignore was not enough. You can't just tell Wget2 to ignore "" because then the stylesheets won't be loaded. Now the best choice for loading a single page and its details is the special option --page-requisites.

--ignore-case
Ignore case when matching files and directories. This affects the behavior of the -R, -A, -I, and -X options. For example, the -A "*.txt" option will accept the files file1.txt, and file2.TXT, file3.TxT, etc.

-H, --span-hosts
Allows transitions between hosts when performing recursive extraction.

-L, --relative [Not implemented yet]
Follow relative links only. Useful when retrieving a specific home page without any distractions, even from the same hosts.

-I list, --include-directories=list
Specify a comma-separated list of directories you want to follow when downloading. List elements can contain wildcards.

wget2 -r https://webpage.domain --include-directories=*/pub/*/

Please keep in mind that /pub// is the same as //pub// and matches directories, not strings. This means that /pub does not affect files contained in, for example, /directory/something/pub, but /pub/ matches every subdirectory of /pub. -X list, --exclude-directories=list
specify a comma-separated list of directories that you want to exclude from downloading. List elements can contain wildcards.

wget2 -r https://gnu.org --exclude-directories=/software

-I / -X combinations
Specify in a single command a comma-separated list of directories that Wget2 should/shouldn't follow when downloading. List elements can contain wildcards.
Keep in mind that Wget2 behaves slightly differently with this flag combination than wget1.x.
If the -I option is given first, the default is "exclude all". If -X is specified first, the default is "enable all".
Several -I/-X options are processed "end to end". The latest match is relevant.

Example:wget2 -I /pub -X /pub/trash will download everything from /pub/ except /pub/trash.
Example:wget2 -X /pub -I /pub/important will load everything except /pub in which only the /pub/important section will be loaded.
to reset the list (i.e. ignore -I/-X from .wget2rc files), use --no-include-directories or --no-exclude-directories.

-np, --no-parent
Never go up to the parent directory when loading with recursion. This is a useful option because it ensures that only files below a certain hierarchy are loaded.

--filter-mime-type=list
Provide a comma-separated list of MIME types to be loaded. List elements can contain wildcards. If the MIME type begins with the character "!", it won't be downloaded, this is useful when trying to download something with an exception. If the server doesn't specify the MIME type of the file, it will be treated as 'application/octet-stream'. For example, download everything except images:
wget2 -r https://site/document --filter-mime-type=*,\!image/*

it's also a good idea to download files that are compatible with your operating system. For example, to download every LibreOffice Writer compatible file from a website using recursive mode:

wget2 -r https://site/document --filter-mime-type=$(sed -r '/^MimeType=/!d;s/^MimeType=//;s/;/,/g' /usr/share/applications/libreoffice-writer.desktop)

Plugin options, exit codes, Wget2 debugging

See second part of the article.



Related publications