How to Use Proxies with Multithreading in Python for Faster Scraping

Learn how to use proxies with multithreading in Python to speed up web scraping and avoid IP blocks.

Dec 17, 2024

How to Use Proxies with Multithreading in Python for Faster Scraping

Web scraping at scale often requires speed, efficiency, and proxy management to avoid IP bans or rate-limiting. Using proxies with multithreading allows you to scrape websites faster by sending multiple requests concurrently, while rotating proxies to stay undetected.

In this guide, we’ll demonstrate how to combine Python’s multithreading capabilities with proxy rotation for large-scale web scraping.

Requirements

Before we start, ensure you have:

Python 3.x installed.
A list of proxies in IP:Port format (or authenticated proxies).
Libraries: requests, concurrent.futures (built-in), and random.

Install requests if needed:

pip install requests

`‍`Setting Up the Proxy List

Prepare a list of proxies:

proxies = [
    "http://user:pass@IP1:PORT",
    "http://user:pass@IP2:PORT",
    "http://user:pass@IP3:PORT",
]

‍

Basic Multithreading Setup

Use Python’s ThreadPoolExecutor for managing multiple threads:

from concurrent.futures import ThreadPoolExecutor
import requests
import random

# Proxy list
proxies = [
    "http://user:pass@IP1:PORT",
    "http://user:pass@IP2:PORT",
    "http://user:pass@IP3:PORT",
]

# Target URL
url = "https://httpbin.org/ip"

# Function to send a request using a proxy
def fetch(url, proxy):
    try:
        response = requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=5)
        print(f"Proxy {proxy} returned: {response.json()}")
    except Exception as e:
        print(f"Proxy {proxy} failed: {e}")

# Multithreading execution
with ThreadPoolExecutor(max_workers=5) as executor:
    for _ in range(10):  # Adjust the number of requests
        proxy = random.choice(proxies)
        executor.submit(fetch, url, proxy)

‍

Explanation

ThreadPoolExecutor: Manages a pool of threads (max_workers) for concurrent execution.
Proxy Rotation: A random proxy is selected for each request using random.choice.
Timeout Handling: Prevents stalled proxies from blocking execution.
Error Handling: Proxies that fail are skipped.

Advanced Proxy Rotation with a Cycle

Instead of random proxies, use itertools.cycle to rotate through proxies sequentially:

from itertools import cycle

proxy_pool = cycle(proxies)

def fetch_with_cycle(url):
    proxy = next(proxy_pool)
    try:
        response = requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=5)
        print(f"Proxy {proxy} returned: {response.json()}")
    except Exception as e:
        print(f"Proxy {proxy} failed: {e}")

# Execute with threading
with ThreadPoolExecutor(max_workers=5) as executor:
    for _ in range(10):
        executor.submit(fetch_with_cycle, url)

Verifying Proxy Performance

Check proxy latency and success rate to remove slow or dead proxies:

def verify_proxy(proxy):
    try:
        response = requests.get("https://httpbin.org/ip", proxies={"http": proxy, "https": proxy}, timeout=3)
        return proxy, response.status_code == 200
    except:
        return proxy, False

valid_proxies = [proxy for proxy, success in map(verify_proxy, proxies) if success]
print(f"Working proxies: {valid_proxies}")

Real-World Example: Scraping with Proxies and Multithreading

Combine multithreading, proxy rotation, and error handling to scrape a website like quotes.toscrape.com:

def scrape_quotes(url, proxy):
    try:
        response = requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=5)
        if response.status_code == 200:
            print(f"Scraped with {proxy}: {response.text[:100]}")
    except Exception as e:
        print(f"Error with {proxy}: {e}")

# Scraping URLs
urls = [f"http://quotes.toscrape.com/page/{i}/" for i in range(1, 6)]

# Execute with multithreading
with ThreadPoolExecutor(max_workers=5) as executor:
    for url in urls:
        proxy = random.choice(proxies)
        executor.submit(scrape_quotes, url, proxy)