ichenh (physchen.com)
0.1.0
ANDIRT is a specialized Python crawler designed to recursively scrape open directory listings, typically found on Apache/Nginx servers. The tool downloads academic document formats (PDF, DOCX, TXT, etc.) and archives (ZIP, TAR, 7Z, etc.) while ignoring non-essential web assets like HTML and CSS files.
- Recursive Downloading: Handles open directory listings, mirroring all files in specified directories.
- Allowed File Types: Configurable filter to download only academic documents and archives.
- Concurrency: Uses multiple threads to download files simultaneously, improving performance.
- Logging: Includes detailed logging for monitoring progress and errors.
- Automatic Skipping: Skips files that are already downloaded to prevent duplication.
- Respectful Crawling: Implements a 0.1-second delay between requests to minimize server load.
- Python 3.6+
requestsbeautifulsoup4
-
Clone this repository or download the script:
git clone https://github.com/ichenh/ANDIRT.git cd ANDIRT -
Install dependencies:
pip install -r requirements.txt
To use the tool, run the following command:
python andirt.py [URL] [--output OUTPUT_DIR]url(required): The target URL of the directory listing to scrape. The URL must begin withhttp://orhttps://.--outputor-o(optional): Specifies the local directory to save the downloaded files. Defaults to./downloadsin the current directory.
-
Basic usage (saves to
./downloads):python andirt.py http://archive.ubuntu.com/ubuntu/indices/
-
Specify a custom download folder:
python andirt.py http://example.com/public-datasets/ --output "D:/Research_Data"
- The script implements a 0.1-second delay between requests to reduce the risk of server blocking.
- Files that already exist locally are automatically skipped.
- Only files with extensions listed in
ALLOWED_EXTENSIONSwill be downloaded. - Supports downloading multiple file types including PDF, DOCX, TXT, ZIP, TAR, and others.
- For large repositories, consider adjusting the concurrency (
ThreadPoolExecutorconfiguration) to suit your network capacity.
Usage of this tool for downloading copyrighted material without permission is strictly prohibited. Users are responsible for complying with target website TOS.
This project is licensed under the MIT License.