Apache/Nginx Directory Index Recursion Tool (ANDIRT)

Author

ichenh (physchen.com)

Version

0.1.0

Description

ANDIRT is a specialized Python crawler designed to recursively scrape open directory listings, typically found on Apache/Nginx servers. The tool downloads academic document formats (PDF, DOCX, TXT, etc.) and archives (ZIP, TAR, 7Z, etc.) while ignoring non-essential web assets like HTML and CSS files.

Features

Recursive Downloading: Handles open directory listings, mirroring all files in specified directories.
Allowed File Types: Configurable filter to download only academic documents and archives.
Concurrency: Uses multiple threads to download files simultaneously, improving performance.
Logging: Includes detailed logging for monitoring progress and errors.
Automatic Skipping: Skips files that are already downloaded to prevent duplication.
Respectful Crawling: Implements a 0.1-second delay between requests to minimize server load.

Requirements

Python 3.6+
requests
beautifulsoup4

Installation

Clone this repository or download the script:

git clone https://github.com/ichenh/ANDIRT.git
cd ANDIRT

Install dependencies:
```
pip install -r requirements.txt
```

Usage

To use the tool, run the following command:

python andirt.py [URL] [--output OUTPUT_DIR]

Arguments:

url (required): The target URL of the directory listing to scrape. The URL must begin with http:// or https://.
--output or -o (optional): Specifies the local directory to save the downloaded files. Defaults to ./downloads in the current directory.

Examples:

Basic usage (saves to ./downloads):

python andirt.py http://archive.ubuntu.com/ubuntu/indices/

Specify a custom download folder:

python andirt.py http://example.com/public-datasets/ --output "D:/Research_Data"

Notes:

The script implements a 0.1-second delay between requests to reduce the risk of server blocking.
Files that already exist locally are automatically skipped.
Only files with extensions listed in ALLOWED_EXTENSIONS will be downloaded.
Supports downloading multiple file types including PDF, DOCX, TXT, ZIP, TAR, and others.
For large repositories, consider adjusting the concurrency (ThreadPoolExecutor configuration) to suit your network capacity.

Warning:

Usage of this tool for downloading copyrighted material without permission is strictly prohibited. Users are responsible for complying with target website TOS.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
andirt.py		andirt.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache/Nginx Directory Index Recursion Tool (ANDIRT)

Author

Version

Description

Features

Requirements

Installation

Usage

Arguments:

Examples:

Notes:

Warning:

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Apache/Nginx Directory Index Recursion Tool (ANDIRT)

Author

Version

Description

Features

Requirements

Installation

Usage

Arguments:

Examples:

Notes:

Warning:

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages