Learn how to use proxies with multithreading in Python to speed up web scraping and avoid IP blocks.
Web scraping at scale often requires speed, efficiency, and proxy management to avoid IP bans or rate-limiting. Using proxies with multithreading allows you to scrape websites faster by sending multiple requests concurrently, while rotating proxies to stay undetected.
In this guide, we’ll demonstrate how to combine Python’s multithreading capabilities with proxy rotation for large-scale web scraping.
Before we start, ensure you have:
IP:Port
format (or authenticated proxies).requests
, concurrent.futures
(built-in), and random
.Install requests
if needed:
pip install requests
Setting Up the Proxy ListPrepare a list of proxies:
proxies = [
"http://user:pass@IP1:PORT",
"http://user:pass@IP2:PORT",
"http://user:pass@IP3:PORT",
]
Use Python’s ThreadPoolExecutor for managing multiple threads:
from concurrent.futures import ThreadPoolExecutor
import requests
import random
# Proxy list
proxies = [
"http://user:pass@IP1:PORT",
"http://user:pass@IP2:PORT",
"http://user:pass@IP3:PORT",
]
# Target URL
url = "https://httpbin.org/ip"
# Function to send a request using a proxy
def fetch(url, proxy):
try:
response = requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=5)
print(f"Proxy {proxy} returned: {response.json()}")
except Exception as e:
print(f"Proxy {proxy} failed: {e}")
# Multithreading execution
with ThreadPoolExecutor(max_workers=5) as executor:
for _ in range(10): # Adjust the number of requests
proxy = random.choice(proxies)
executor.submit(fetch, url, proxy)
ThreadPoolExecutor
: Manages a pool of threads (max_workers
) for concurrent execution.random.choice
.Instead of random proxies, use itertools.cycle to rotate through proxies sequentially:
from itertools import cycle
proxy_pool = cycle(proxies)
def fetch_with_cycle(url):
proxy = next(proxy_pool)
try:
response = requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=5)
print(f"Proxy {proxy} returned: {response.json()}")
except Exception as e:
print(f"Proxy {proxy} failed: {e}")
# Execute with threading
with ThreadPoolExecutor(max_workers=5) as executor:
for _ in range(10):
executor.submit(fetch_with_cycle, url)
Check proxy latency and success rate to remove slow or dead proxies:
def verify_proxy(proxy):
try:
response = requests.get("https://httpbin.org/ip", proxies={"http": proxy, "https": proxy}, timeout=3)
return proxy, response.status_code == 200
except:
return proxy, False
valid_proxies = [proxy for proxy, success in map(verify_proxy, proxies) if success]
print(f"Working proxies: {valid_proxies}")
Combine multithreading, proxy rotation, and error handling to scrape a website like quotes.toscrape.com:
def scrape_quotes(url, proxy):
try:
response = requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=5)
if response.status_code == 200:
print(f"Scraped with {proxy}: {response.text[:100]}")
except Exception as e:
print(f"Error with {proxy}: {e}")
# Scraping URLs
urls = [f"http://quotes.toscrape.com/page/{i}/" for i in range(1, 6)]
# Execute with multithreading
with ThreadPoolExecutor(max_workers=5) as executor:
for url in urls:
proxy = random.choice(proxies)
executor.submit(scrape_quotes, url, proxy)