Archive

Posts Tagged ‘linkchecker’

LinkChecker 5.2 commandline options

May 18, 2010 1 comment

USAGE linkchecker [options] [file-or-url]…

Options:

-h, –help show this help message and exit

General options:

-f FILENAME, –config=FILENAME

Use FILENAME as configuration file. Per default

LinkChecker first searches

/etc/linkchecker/linkcheckerrc and then

~/.linkchecker/linkcheckerrc (under Windows <path-to-

program>\linkcheckerrc).

-I, –interactive Ask for URL if none are given on the commandline.

-t NUMBER, –threads=NUMBER

Generate no more than the given number of threads.

Default number of threads is 10. To disable threading

specify a non-positive number.

–priority Run with normal thread scheduling priority. Per

default LinkChecker runs with low thread priority to

be suitable as a background job.

-V, –version Print version and exit.

–allow-root Do not drop privileges when running as root user on

Unix systems.

–stdin Read list of white-space separated URLs to check from

stdin.

Output options:

-v, –verbose Log all URLs. Default is to log only errors and

warnings.

–complete Log all URLs, including duplicates. Default is to log

duplicate URLs only once.

–no-warnings Don’t log warnings. Default is to log warnings.

-W REGEX, –warning-regex=REGEX

Define a regular expression which prints a warning if

it matches any content of the checked link. This

applies only to valid pages, so we can get their

content.

Use this to check for pages that contain some form of

error message, for example ‘This page has moved’ or

‘Oracle Application Server error’.

–warning-size-bytes=NUMBER

Print a warning if content size info is available and

exceeds the given number of bytes.

–check-html Check syntax of HTML URLs with local library (HTML

tidy).

–check-html-w3 Check syntax of HTML URLs with W3C online validator.

–check-css Check syntax of CSS URLs with local library

(cssutils).

–check-css-w3 Check syntax of CSS URLs with W3C online validator.

–scan-virus Scan content of URLs with ClamAV virus scanner.

-q, –quiet Quiet operation, an alias for ‘-o none’. This is only

useful with -F.

-o TYPE[/ENCODING], –output=TYPE[/ENCODING]

Specify output as ‘xml’, ‘none’, ‘gml’, ‘text’,

‘blacklist’, ‘html’, ‘gxml’, ‘sql’, ‘csv’, ‘dot’.

Default output type is text. The ENCODING specifies

the output encoding, the default is that of your

locale. Valid encodings are listed at

http://docs.python.org/lib/standard-encodings.html.

-F TYPE[/ENCODING][/FILENAME], –file-output=TYPE[/ENCODING][/FILENAME]

Output to a file linkchecker-out.TYPE,

$HOME/.linkchecker/blacklist for ‘blacklist’ output,

or FILENAME if specified. The ENCODING specifies the

output encoding, the default is that of your locale.

Valid encodings are listed at

http://docs.python.org/lib/standard-encodings.html.

The FILENAME and ENCODING parts of the ‘none’ output

type will be ignored, else if the file already exists,

it will be overwritten. You can specify this option

more than once. Valid file output types are ‘xml’,

‘none’, ‘gml’, ‘text’, ‘blacklist’, ‘html’, ‘gxml’,

‘sql’, ‘csv’, ‘dot’. You can specify this option

multiple times to output to more than one file.

Default is no file output. Note that you can suppress

all console output with the option ‘-o none’.

–no-status Do not print check status messages.

-D STRING, –debug=STRING

Print debugging output for the given logger. Available

loggers are ‘all’, ‘thread’, ‘checking’, ‘gui’,

‘cache’, ‘cmdline’, ‘dns’. Specifying ‘all’ is an

alias for specifying all available loggers. The option

can be given multiple times to debug with more than

one logger.

For accurate results, threading will be disabled

during debug runs.

–trace Print tracing information.

–profile Write profiling data into a file named

linkchecker.prof in the current working directory. See

also –viewprof.

–viewprof Print out previously generated profiling data. See

also –profile.

Checking options:

-r NUMBER, –recursion-level=NUMBER

Check recursively all links up to given depth. A

negative depth will enable infinite recursion. Default

depth is infinite.

–no-follow-url=REGEX

Check but do not recurse into URLs matching the given

regular expression. This option can be given multiple

times.

–ignore-url=REGEX Only check syntax of URLs matching the given regular

expression. This option can be given multiple times.

-C, –cookies Accept and send HTTP cookies according to RFC 2109.

Only cookies which are sent back to the originating

server are accepted. Sent and accepted cookies are

provided as additional logging information.

–cookiefile=FILENAME

Read a file with initial cookie data. The cookie data

format is explained below.

-a, –anchors Check HTTP anchor references. Default is not to check

anchors. This option enables logging of the warning

‘url-anchor-not-found’.

–no-anchor-caching

This option is deprecated and does nothing. It will be

removed in a future release.

-u STRING, –user=STRING

Try the given username for HTTP and FTP authorization.

For FTP the default username is ‘anonymous’. For HTTP

there is no default username. See also -p.

-p STRING, –password=STRING

Try the given password for HTTP and FTP authorization.

For FTP the default password is ‘anonymous@’. For HTTP

there is no default password. See also -u.

–timeout=NUMBER Set the timeout for connection attempts in seconds.

The default timeout is 60 seconds.

-P NUMBER, –pause=NUMBER

Pause the given number of seconds between two

subsequent connection requests to the same host.

Default is no pause between requests.

-N STRING, –nntp-server=STRING

Specify an NNTP server for ‘news:…’ links. Default

is the environment variable NNTP_SERVER. If no host is

given, only the syntax of the link is checked.

–no-proxy-for=REGEX

This option is deprecated and does nothing. It will be

removed in a future release.

EXAMPLES

The most common use checks the given domain recursively, plus any

single URL pointing outside of the domain:

linkchecker http://www.example.org/

Beware that this checks the whole site which can have several hundred

thousands URLs. Use the -r option to restrict the recursion depth.

Don’t connect to mailto: hosts, only check their URL syntax. All other

links are checked as usual:

linkchecker –ignore-url=^mailto: http://www.example.org

Checking local HTML files on Unix:

linkchecker ../bla.html subdir/blubber.html

Checking a local HTML file on Windows:

linkchecker c:\temp\test.html

You can skip the "http://" url part if the domain starts with "www.":

linkchecker http://www.example.de

You can skip the "ftp://" url part if the domain starts with "ftp.":

linkchecker -r0 ftp.example.org

OUTPUT TYPES

Note that by default only errors and warnings are logged.

You should use the –verbose option to see valid URLs,

and –complete when outputting a sitemap graph format.

text Standard text output, logging URLs in keyword: argument fashion.

html Log URLs in keyword: argument fashion, formatted as HTML.

Additionally has links to the referenced pages. Invalid URLs have

HTML and CSS syntax check links appended.

csv Log check result in CSV format with one URL per line.

gml Log parent-child relations between linked URLs as a GML sitemap

graph.

dot Log parent-child relations between linked URLs as a DOT sitemap

graph.

gxml Log check result as a GraphXML sitemap graph.

xml Log check result as machine-readable XML.

sql Log check result as SQL script with INSERT commands. An example

script to create the initial SQL table is included as create.sql.

blacklist

Suitable for cron jobs. Logs the check result into a file

~/.linkchecker/blacklist which only contains entries with invalid

URLs and the number of times they have failed.

none Logs nothing. Suitable for debugging or checking the exit code.

REGULAR EXPRESSIONS

Only Python regular expressions are accepted by LinkChecker.

See http://www.amk.ca/python/howto/regex/ for an introduction in

regular expressions.

The only addition is that a leading exclamation mark negates

the regular expression.

COOKIE FILES

A cookie file contains standard RFC 805 header data with the following

possible names:

Scheme (optional)

Sets the scheme the cookies are valid for; default scheme is ‘http’.

Host (required)

Sets the domain the cookies are valid for.

Path (optional)

Gives the path the cookies are value for; default path is ‘/’.

Set-cookie (optional)

Set cookie name/value. Can be given more than once.

Multiple entries are separated by a blank line.

The example below will send two cookies to all URLs starting with

http://example.org/hello/‘ and one to all URLs starting

with ‘https://example.com/‘:

Host: example.org

Path: /hello

Set-cookie: ID="smee"

Set-cookie: spam="egg"

Scheme: https

Host: example.com

Set-cookie: baggage="elitist"; comment="hologram"

Categories: UNIX Tags: