This project demonstrates a web crawler built using Cheerio.js, Puppeteer, and Socket.IO. The crawler extracts URLs from web pages and provides real-time progress updates to the user interface. The citations.html
file serves as an example of how the project can be used to document contributions and iterative development.
- Dynamic Web Crawling: Extracts URLs from web pages using Cheerio.js and Puppeteer.
- Real-Time Updates: Displays crawling progress dynamically using Socket.IO.
- Frontend GUI: Provides a user-friendly interface for configuring and starting the crawler.
- Exportable Results: Supports exporting discovered URLs to a CSV file (optional enhancement).
- Compliance with
robots.txt
: Ensures ethical crawling practices using therobots-txt-parser
library.
Before running the project, ensure you have the following installed:
- Node.js (v12 or higher): Download Node.js
- npm (comes with Node.js)
- Git: Download Git
-
Clone the Repository
git clone https://github.com/your-username/web-crawler.git cd web-crawler
-
Install Dependencies
npm install
-
Run the Server Start the backend server:
node server.js
-
Open the Frontend Open the
citations.html
file in your browser or navigate tohttp://localhost:3000
if the server is running.
- Open the web interface (
citations.html
) in your browser. - Enter the following details:
- Base URL: The root domain of the website to crawl (e.g.,
https://www.nasa.gov
). - Page URL to Crawl: The specific page to start crawling (e.g.,
https://www.nasa.gov/page
). - Max URLs to Discover: The maximum number of unique URLs to extract.
- Base URL: The root domain of the website to crawl (e.g.,
- Click the "Start Crawling" button.
- Monitor the progress in the Progress Log section.
- View the results in the Results Log section.
The citations.html
file is included in this repository as an example of how to document contributions and iterative development. It highlights the collaboration between Joshua Greenfield (Prompt Engineer) and Qwen2.5-Max (Large Language Model).
- Cheerio.js Template: Provided a foundational template for web scraping.
- Frontend Development: Designed a static HTML, CSS, and JavaScript frontend.
- Backend Integration: Developed a Node.js backend using Express.
- Enhancements: Added features like real-time progress updates and compliance with
robots.txt
.
To view the citations page:
- Clone the repository.
- Open
citations.html
in your browser.
We welcome contributions to improve this project! If you'd like to contribute, please follow these steps:
- Fork the repository.
- Create a new branch for your feature or bug fix:
git checkout -b feature-name
- Commit your changes:
git commit -m "Add feature or fix"
- Push your changes to GitHub:
git push origin feature-name
- Submit a pull request.
Special thanks to Qwen2.5-Max for their expertise, patience, and iterative approach to problem-solving. Their contributions were instrumental in bringing this project to life.
For more information about Qwen2.5-Max, visit:
This project is licensed under the MIT License. See the LICENSE file for details.
Feel free to customize this README further based on your project's specific needs. Let me know if you need additional sections or modifications! 🚀