How to Use Proxies in Scrapy: A Complete Guide

Learn how to use proxies in Scrapy!

Dec 17, 2024
How to Use Proxies in Scrapy: A Complete Guide

Scrapy is a powerful tool for web scraping—versatile and effective, but it often needs a bit of help to navigate challenges like IP bans. That’s where proxies come in. By integrating proxies into your Scrapy projects, you can scrape data more efficiently and reduce bans. Let’s explore how to use proxies in Scrapy.

Why Use Proxies in Scrapy?

Proxies act as intermediaries between your scraper and the target website, offering several key advantages:

  • Avoid IP Bans: Proxies let you change your IP address to avoid detection and bans.
  • Distribute Traffic: Spread requests across multiple proxies to minimize server load and avoid rate limits.

Configuring a Proxy in Scrapy

Scrapy makes it easy to configure proxies. Here are two straightforward methods:

Setting a Static Proxy

If you need to use a single proxy for all requests, follow these steps:

Open your Scrapy project’s settings.py file.

Add the following:

# settings.py
PROXY = 'http://PROXY_ADDRESS:PORT'

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

DEFAULT_REQUEST_HEADERS = {
    'Proxy': PROXY
}

Replace PROXY_ADDRESS and PORT with your proxy details.

2. Using a Proxy in a Custom Spider

To set a proxy for specific requests:

import scrapy

class ProxySpider(scrapy.Spider):
    name = 'proxy_spider'
    start_urls = ['http://example.com']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, meta={
                'proxy': 'http://PROXY_ADDRESS:PORT'
            })

This allows you to customize proxy usage for each request.

Handling Proxy Authentication

Some proxies require authentication with a username and password. Scrapy supports this with the meta parameter:

import scrapy

class AuthenticatedProxySpider(scrapy.Spider):
    name = 'authenticated_proxy_spider'
    start_urls = ['http://example.com']

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                meta={
                    'proxy': 'http://USERNAME:PASSWORD@PROXY_ADDRESS:PORT'
                }
            )

Replace USERNAME, PASSWORD, PROXY_ADDRESS, and PORT with your credentials.

Implementing Rotating Proxies

Rotating proxies can help distribute traffic and avoid bans, especially for large-scale scraping projects.

Using a List of Proxies

First, define a list of proxies:

PROXY_LIST = [
    'http://PROXY1_ADDRESS:PORT',
    'http://PROXY2_ADDRESS:PORT',
    'http://PROXY3_ADDRESS:PORT',
]

Then modify your spider to select proxies randomly:

import random
import scrapy

class RotatingProxySpider(scrapy.Spider):
    name = 'rotating_proxy_spider'
    start_urls = ['http://example.com']

    def start_requests(self):
        for url in self.start_urls:
            proxy = random.choice(self.settings.get('PROXY_LIST'))
            yield scrapy.Request(url, meta={'proxy': proxy})

Using a Proxy Service

For more convenience, you can use a proxy service like Scrapy-Proxy-Pool. It automates proxy rotation:

Install the library:

pip install scrapy-proxy-pool

Add it to your settings.py:

DOWNLOADER_MIDDLEWARES = {
'scrapy_proxy_pool.middlewares.BanDetectionMiddleware': 620,
'scrapy_proxy_pool.middlewares.ProxyPoolMiddleware': 610,
}

This setup simplifies managing and rotating proxies.

Conclusion

Proxies are a vital tool for efficient and effective web scraping with Scrapy. Whether you’re using a single proxy, rotating multiple proxies, or leveraging a proxy service, Scrapy’s flexibility makes integration simple. By following best practices and configuring proxies properly, you can scrape data while avoiding bans and staying compliant.

If you’re looking for reliable proxies to pair with Scrapy, check out the plans at Stat Proxies. Our solutions are designed to keep your scraping fast, efficient, and undetected. Explore your options here.

Stat Proxies Logo