Skip to content

ichenh/ANDIRT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Apache/Nginx Directory Index Recursion Tool (ANDIRT)

Author

ichenh (physchen.com)

Version

0.1.0

Description

ANDIRT is a specialized Python crawler designed to recursively scrape open directory listings, typically found on Apache/Nginx servers. The tool downloads academic document formats (PDF, DOCX, TXT, etc.) and archives (ZIP, TAR, 7Z, etc.) while ignoring non-essential web assets like HTML and CSS files.

Features

  • Recursive Downloading: Handles open directory listings, mirroring all files in specified directories.
  • Allowed File Types: Configurable filter to download only academic documents and archives.
  • Concurrency: Uses multiple threads to download files simultaneously, improving performance.
  • Logging: Includes detailed logging for monitoring progress and errors.
  • Automatic Skipping: Skips files that are already downloaded to prevent duplication.
  • Respectful Crawling: Implements a 0.1-second delay between requests to minimize server load.

Requirements

  • Python 3.6+
  • requests
  • beautifulsoup4

Installation

  1. Clone this repository or download the script:

    git clone https://github.com/ichenh/ANDIRT.git
    cd ANDIRT
  2. Install dependencies:

    pip install -r requirements.txt

Usage

To use the tool, run the following command:

python andirt.py [URL] [--output OUTPUT_DIR]

Arguments:

  • url (required): The target URL of the directory listing to scrape. The URL must begin with http:// or https://.
  • --output or -o (optional): Specifies the local directory to save the downloaded files. Defaults to ./downloads in the current directory.

Examples:

  • Basic usage (saves to ./downloads):

    python andirt.py http://archive.ubuntu.com/ubuntu/indices/
  • Specify a custom download folder:

    python andirt.py http://example.com/public-datasets/ --output "D:/Research_Data"

Notes:

  • The script implements a 0.1-second delay between requests to reduce the risk of server blocking.
  • Files that already exist locally are automatically skipped.
  • Only files with extensions listed in ALLOWED_EXTENSIONS will be downloaded.
  • Supports downloading multiple file types including PDF, DOCX, TXT, ZIP, TAR, and others.
  • For large repositories, consider adjusting the concurrency (ThreadPoolExecutor configuration) to suit your network capacity.

Warning:

Usage of this tool for downloading copyrighted material without permission is strictly prohibited. Users are responsible for complying with target website TOS.

License

This project is licensed under the MIT License.

About

ANDIRT is a specialized Python crawler designed to recursively scrape open directory listings, typically found on Apache/Nginx servers. The tool downloads academic document formats (PDF, DOCX, TXT, etc.) and archives (ZIP, TAR, 7Z, etc.) while ignoring non-essential web assets like HTML and CSS files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages