In this tutorial, you will learn what node-unblocker is, understand the benefits for web scraping projects, and learn how to use it.
In the world of web scraping, developers often face challenges such as internet censorship, geo-restrictions, and rate limiting. These obstacles can significantly hinder data collection efforts, making it difficult to gather crucial information from websites. Enter Node Unblocker, a powerful tool that can help overcome these challenges and streamline your web scraping projects.
Node Unblocker is an open-source web proxy designed to bypass internet censorship and access geo-restricted content. Originally created as a censorship circumvention tool, it has evolved into a versatile library for proxying and rewriting remote webpages. This makes it an excellent choice for web scraping projects that require accessing restricted content or maintaining anonymity.
Let's walk through the process of setting up Node Unblocker and using it for a web scraping project.
First, create a new directory for your project and initialize a new Node.js project:
mkdir node-unblocker-scraper
cd node-unblocker-scraper
npm init -y
Now, install the required dependencies:
npm install express unblocker puppeteer
Create a new file called proxy-server.js
and add the following code:
const express = require('express');
const Unblocker = require('unblocker');
const app = express();
const unblocker = new Unblocker({prefix: '/proxy/'});
// Use unblocker middleware
app.use(unblocker);
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
console.log(`Proxy server running on http://localhost:${PORT}/proxy/`);
}).on('upgrade', unblocker.onUpgrade);
This code sets up a basic Express server with Node Unblocker middleware. The /proxy/
prefix will be used for all proxied requests.
Now, let's create a web scraper that uses our proxy server. Create a new file called scraper.js
and add the following code:
const puppeteer = require('puppeteer');
async function scrapeWithProxy(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Use the proxy for this request
const proxiedUrl = `http://localhost:3000/proxy/${url}`;
await page.goto(proxiedUrl, {waitUntil: 'networkidle0'});
// Example: Scrape all paragraph texts
const paragraphs = await page.evaluate(() => {
return Array.from(document.querySelectorAll('p')).map(p => p.textContent);
});
await browser.close();
return paragraphs;
}
// Usage
scrapeWithProxy('https://example.com')
.then(data => console.log(data))
.catch(error => console.error('Scraping failed:', error));
This script uses Puppeteer to open a web page through our proxy server and scrape all paragraph texts.
To run your web scraping setup:
1. Start the proxy server:
node proxy-server.js
2. In a new terminal, run the scraper:
node scraper.js
One of the powerful features of Node Unblocker is its support for custom middleware. This allows you to modify requests and responses, adding functionality such as request throttling, user agent rotation, or content modification.
Here's an example of how to add custom middleware to rotate user agents:
const express = require('express');
const Unblocker = require('unblocker');
const app = express();
const userAgents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15',
'Mozilla/5.0 (X11; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0'
];
function rotateUserAgent(data) {
const randomUserAgent = userAgents[Math.floor(Math.random() * userAgents.length)];
data.headers['user-agent'] = randomUserAgent;
}
const unblocker = new Unblocker({
prefix: '/proxy/',
requestMiddleware: [
rotateUserAgent
]
});
app.use(unblocker);
const PORT = process.env.PORT || 3000;
app.listen(PORT, () => {
console.log(`Proxy server with user agent rotation running on http://localhost:${PORT}/proxy/`);
}).on('upgrade', unblocker.onUpgrade);
This middleware will randomly select a user agent for each request, helping to make your scraping activities less detectable.
While Node Unblocker provides a solid foundation for web scraping, you may still encounter some common challenges. Let's explore how to address these issues:
Many modern websites use JavaScript to load content dynamically. To scrape such sites effectively, you need to wait for the content to load before extracting data. Here's how you can modify your scraper to handle dynamic content:
const puppeteer = require('puppeteer');
async function scrapeWithProxy(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const proxiedUrl = `http://localhost:3000/proxy/${url}`;
await page.goto(proxiedUrl, {waitUntil: 'networkidle0'});
// Wait for a specific element to load
await page.waitForSelector('.dynamic-content');
// Now scrape the dynamic content
const dynamicContent = await page.evaluate(() => {
return document.querySelector('.dynamic-content').textContent;
});
await browser.close();
return dynamicContent;
}
Many websites split their content across multiple pages. Here's how you can modify your scraper to handle pagination:
async function scrapeWithPagination(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
let allData = [];
let currentPage = 1;
let hasNextPage = true;
while (hasNextPage) {
await page.goto(`${url}?page=${currentPage}`);
// Scrape data from the current page
const pageData = await page.evaluate(() => {
// Your scraping logic here
});
allData = allData.concat(pageData);
// Check if there's a next page
hasNextPage = await page.evaluate(() => {
return !!document.querySelector('.next-page');
});
currentPage++;
}
await browser.close();
return allData;
}
As your web scraping needs grow, you may need to scale your operations. Here are some strategies to consider:
You can distribute your scraping tasks across multiple machines or cloud instances to increase throughput. Tools like Apache Airflow or Celery can help manage distributed tasks.
Implement a queueing system like RabbitMQ or Redis to manage scraping tasks efficiently, especially when dealing with large numbers of URLs.
For large-scale scraping operations, consider using databases designed for big data, such as MongoDB or Cassandra, to store your scraped data efficiently.
While Node Unblocker provides an excellent foundation for web scraping, the choice of proxy can significantly impact the success and efficiency of your scraping operations. This is where Stat Proxies' residential ISP proxies come into play.
Integrating Stat Proxies with your Node Unblocker setup is straightforward. Here's a basic example:
const fs = require ('fs');
const express = require('express');
const Unblocker = require('unblocker');
const statProxiesList = fs.readFileSync('statproxies.txt', 'utf8').split('\n');
const app = express();
const unblocker = new Unblocker({
prefix: '/proxy/',
requestMiddleware: [
async (data) => {
const proxy = statproxieslist[Math.floor(Math.random() * statproxieslist.length)].split(':');
data.proxy = {
host: proxy.host,
port: proxy.port,
auth: {
username: proxy.username,
password: proxy.password
}
};
}
]
});
app.use(unblocker);
app.listen(3000, () => {
console.log('Proxy server with Stat Proxies integration running on http://localhost:3000/proxy/');
});
This setup ensures that each request going through Node Unblocker uses a fresh residential ISP proxy from Stat Proxies, maximizing your chances of successful scraping.
Node Unblocker, combined with Stat Proxies' residential ISP proxies, provides a powerful solution for your web scraping needs. By leveraging the flexibility of Node Unblocker and the reliability of our proxy network, you can overcome common scraping challenges and scale your operations effectively.
Remember to always scrape responsibly, respecting website terms of service and implementing rate limiting to avoid overwhelming target servers. With the right tools and practices, web scraping can be an incredibly valuable source of data for your projects and business intelligence needs.
Ready to supercharge your web scraping? Sign up for Stat Proxies today and experience the difference our residential ISP proxies can make in your data collection efforts!