This tool is designed to extract email addresses from a list of websites. It uses a two-step approach:
- First, it tries a fast method using
requestsandBeautifulSoup - If that fails, it falls back to a more robust method using
Seleniumwith Chrome WebDriver
- Two-step scraping approach for maximum effectiveness
- Automatically checks contact pages for additional emails
- Filters out false positives (image files with @ symbols)
- Creates example URLs file if none exists
- Saves results to CSV for easy analysis
- Detailed console output with progress information
- Python 3.6 or higher
- Chrome browser installed (for Selenium fallback method)
- Required Python packages (see Installation)
- Make sure you have Python installed (with "Add to PATH" option checked)
- Install the required packages:
pip install -r requirements.txt
Or install them individually:
pip install selenium pandas beautifulsoup4 requests webdriver-manager
-
Create a file named
urls.txtwith one URL per line, for example:https://example.com https://example.org -
Run the script:
python local_scraper.py -
The script will create a file named
extracted_emails.csvwith the results
- For each URL in your list, the scraper first tries the fast method using
requests - If no emails are found, it automatically switches to the more powerful
Seleniummethod - Both methods also check for contact pages and scan them for additional emails
- All unique emails are saved to a CSV file with their source URLs
You can modify the following variables at the top of the script:
URLS_FILE: Change the input file name (default: 'urls.txt')OUTPUT_CSV: Change the output file name (default: 'extracted_emails.csv')EMAIL_REGEX: Modify the regular expression used to find emails
If you encounter issues with Selenium:
- Make sure Chrome is installed on your system
- Try updating Chrome to the latest version
- If you're on Linux, you might need to install additional dependencies
- The script includes a 3-second delay when using Selenium to allow JavaScript to load
- A 1-second delay is added between URLs to avoid overloading servers