Need Advice for a Large-scale Web Scraping Project: VPNs vs. Proxies

Hi everyone,

I’m embarking on a substantial web scraping project and could use some guidance. I’m targeting a website with around 4 million product pages, available in two languages, which brings the total to about 8 million pages. The silver lining is that these pages can be accessed via .json links, offering a way to minimize traffic impact. The site is protected by Cloudflare, but I’ve managed to bypass this using VPN provider so far.

In my tests, I’ve successfully run 5 Docker containers concurrently under a single VPN account, each using different IPs. However, this is just the start. I plan to scale my operations to scrape up to 100 million pages. This scaling brings me to a crossroads: should I invest in 3-4 VPN accounts, or would proxies be a better route? I’ve never used proxies before, so I’m particularly interested in insights about their effectiveness, cost, and how they might compare to using multiple VPN accounts for a project of this scale.

Any advice, experiences, or tips you can share would be incredibly valuable, especially regarding handling large-scale scraping projects, managing IP rotation, and staying under the radar of protections like Cloudflare.

Thanks in advance for your help!

I think public proxies are relative cheap to acesss and easy to rotate as well programmatically and you can also randomize proxies among docker as well.

You also need to consider whether it’s against the ToS of the VPN. It might end up getting blocked or costing you more in the long run.

What’s your budget for proxies/vpn’s? I run a pretty large scraping project for a client that revolves around targeted ads and SEO. We use residential and mobile proxies, and mobile hotspots like NetGear Nighthawks.

Typically our scrapers will hit an intermediate proxy that we run internally using something like squid. Squid will be listening on a range of ports. Depending on the port the scrapers web request comes in on, the traffic will get sent out to a specific proxy. ie port 3177’s next cache peer might be a mobile proxy in LA. Using this technique we’ve had success scraping various data across the web. Each request made by a scraper goes out a different random proxy unless there is a specific need for a specific proxy or location.

We also rotate the types of scrapers we use ranging from Selenium Grid via Docker swarm, to playwright, to simple Python programs using HTTPX or niquests.

PM me if you want more info.

If you have not already done it, check if you need all the productpages. If you are collecting the URLs from f.ex the sitemap, then maybe some of the products may be expired, etc.

On some of the sites I scrape every old and new products is under the same map so my scrape include product-status parameter. If it is expired it appends the url to an expired-csv-file that the script uses to check againts the next time. Rinse repeat. Once every month/3 months etc. I do a full check just incase some products are “un-expired”.

Have you tried it with a large amount of request yet? My guess is after a couple thousand the VPN will start to fail

Interesting any direction to point me?

Hmm i didn’t take into consideration, but i gonna check. Thanks for the tip

What volume are you handling?

Good tip for the expired ones. I’m gonna add this future to my script, too. i didn’t take this to consideration. Thanks!

Nice read, probably one of the best resources I’ve seen on the topic out there , especially with the sun articles about each anti bot measures providers articles linked to it.

Links to this domain have been disabled.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

Links to this domain have been disabled. [2]

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

No, thats why before doing anything i wanted to do a bit of research to set up everything good. Not start my project and hit a wall after

Dm me for more info.

Usually scraping via VPN can get you banned immediately if it’s against tos

What do you mean by volume? Clients, sites, ads???

Thanks, I think the most important thing to note is that all anti-bot services usually calculate a trust score which decides whether to let you in. So, if you have a lot of time for this project you can slowly chop at it with limited resources.

This is one upside of how VPN’s could work for long term slow scraping projects as IP is shared with real users who could up your trust score but in reality all big anti-bot teams can just pull the VPN IPs (or track the ASNs) and mark them so maybe with some lesser known good VPNs it could work? If you’re really strapped for resources that could be an interesting option to try. Check out wireproxy if you want to try VPNs as proxies.