mirror of
https://github.com/jnstockley/web-scraper.git
synced 2026-06-05 11:37:59 -05:00
Customizable Web Scrapper to get alerts when criteria is met on web sites.
- Python 86.2%
- Shell 8.1%
- Dockerfile 5.7%
|
Some checks failed
Lint, Test, and Deploy / Lint (push) Has been cancelled
Lint, Test, and Deploy / Test (push) Has been cancelled
Trivy Security Scan / Scan (push) Has been cancelled
Lint, Test, and Deploy / Publish to PyPI (push) Has been cancelled
Lint, Test, and Deploy / Deploy Docker Image (push) Has been cancelled
Sync from python-starter |
||
|---|---|---|
| .github/workflows | ||
| scripts | ||
| src | ||
| tests | ||
| .dockerignore | ||
| .gitignore | ||
| .python-version | ||
| .yamllint | ||
| compose-dev.yml | ||
| compose.yml | ||
| Dockerfile | ||
| LICENSE | ||
| Main Branch Protection.json | ||
| pyproject.toml | ||
| README.md | ||
| renovate.json | ||
| sample.env | ||
| uv.lock | ||
WebScraper
- This program can scrap data from websites using different scrapers, and send an email when matches/ changes deadening on the scraper used
- There are 2 types of scrapers:
- Generic: Can scrap any website, but might not be as exact
- Specific: Can scrap only specific websites, but will be more exact
Generic Scrapers
- Text
- Diff
Specific Scrapers
- Cars.com
How to use
Text
- Set these specific env variables
-
SCRAPER=text # Scraper to use URL=<URL> # URL to scrape TEXT=<TEXT> # Text to look for - Ensure all other required env variables are set
Diff
- Set these specific env variables
-
SCRAPER=diff # Scraper to use URL=<URL> # URL to scrape PERCENTAGE=<PERCENTAGE_DIFF> # Percentage difference to look for - Ensure all other required env variables are set
Cars.com
- Set these specific env variables
-
SCRAPER=cars_com # Scraper to use URL=https://www.cars.com/shopping/results/ # URL to scrape, must be on the results page, for a specific search - Ensure all other required env variables are set
Required env variables
SLEEP_TIME_SEC= # Time to sleep between each scrape
SENDER_EMAIL= # Email to send from
FROM_EMAIL= # Name to send from i.e. '"Web Scraper" <no-reply@jstockley.com>'
RECEIVER_EMAIL= # Email to send to
PASSWORD= # Password for the sender's email
SMTP_SERVER= # SMTP server to use
SMTP_PORT= # SMTP port to use
TLS= # True/False to use TLS
Running multiple of the same scraper
To run 2+ scrapers of the same type, i.e. 2 diff scrapers, make sure the host folder mapping is different
Ex:
diff-scraper-1:
image: jnstockley/web-scraper:latest
volumes:
- ./diff-scraper-1-data/:/app/data/
environment:
- TZ=America/Chicago
- SCRAPER=diff
- URL=https://google.com
- PERCENTAGE=5
- SLEEP_TIME_SEC=21600
diff-scraper-2:
image: jnstockley/web-scraper:latest
volumes:
- ./diff-scraper-2-data/:/app/data/
environment:
- TZ=America/Chicago
- SCRAPER=diff
- URL=https://yahoo.com
- PERCENTAGE=5
- SLEEP_TIME_SEC=21600