Need Advice for a Large-scale Web Scraping Project: VPNs vs. Proxies

MrSpyCat · June 17, 2025, 6:41pm

Hi everyone,

I’m embarking on a substantial web scraping project and could use some guidance. I’m targeting a website with around 4 million product pages, available in two languages, which brings the total to about 8 million pages. The silver lining is that these pages can be accessed via .json links, offering a way to minimize traffic impact. The site is protected by Cloudflare, but I’ve managed to bypass this using VPN provider so far.

In my tests, I’ve successfully run 5 Docker containers concurrently under a single VPN account, each using different IPs. However, this is just the start. I plan to scale my operations to scrape up to 100 million pages. This scaling brings me to a crossroads: should I invest in 3-4 VPN accounts, or would proxies be a better route? I’ve never used proxies before, so I’m particularly interested in insights about their effectiveness, cost, and how they might compare to using multiple VPN accounts for a project of this scale.

Any advice, experiences, or tips you can share would be incredibly valuable, especially regarding handling large-scale scraping projects, managing IP rotation, and staying under the radar of protections like Cloudflare.

Thanks in advance for your help!

lazynoob0503 · June 17, 2025, 6:41pm

I think public proxies are relative cheap to acesss and easy to rotate as well programmatically and you can also randomize proxies among docker as well.

itsm3rick · June 17, 2025, 6:41pm

You also need to consider whether it’s against the ToS of the VPN. It might end up getting blocked or costing you more in the long run.

ActiveTreat · June 17, 2025, 6:41pm

What’s your budget for proxies/vpn’s? I run a pretty large scraping project for a client that revolves around targeted ads and SEO. We use residential and mobile proxies, and mobile hotspots like NetGear Nighthawks.

Typically our scrapers will hit an intermediate proxy that we run internally using something like squid. Squid will be listening on a range of ports. Depending on the port the scrapers web request comes in on, the traffic will get sent out to a specific proxy. ie port 3177’s next cache peer might be a mobile proxy in LA. Using this technique we’ve had success scraping various data across the web. Each request made by a scraper goes out a different random proxy unless there is a specific need for a specific proxy or location.

We also rotate the types of scrapers we use ranging from Selenium Grid via Docker swarm, to playwright, to simple Python programs using HTTPX or niquests.

PM me if you want more info.

Rockworldred · June 17, 2025, 6:41pm

If you have not already done it, check if you need all the productpages. If you are collecting the URLs from f.ex the sitemap, then maybe some of the products may be expired, etc.

On some of the sites I scrape every old and new products is under the same map so my scrape include product-status parameter. If it is expired it appends the url to an expired-csv-file that the script uses to check againts the next time. Rinse repeat. Once every month/3 months etc. I do a full check just incase some products are “un-expired”.

GeekLifer · June 17, 2025, 6:41pm

Have you tried it with a large amount of request yet? My guess is after a couple thousand the VPN will start to fail

MrSpyCat · June 17, 2025, 6:41pm

Interesting any direction to point me?

MrSpyCat · June 17, 2025, 6:41pm

Hmm i didn’t take into consideration, but i gonna check. Thanks for the tip

jessejjohnson · June 17, 2025, 6:41pm

What volume are you handling?

MrSpyCat · June 17, 2025, 6:41pm

Good tip for the expired ones. I’m gonna add this future to my script, too. i didn’t take this to consideration. Thanks!

Anonymous · June 17, 2025, 6:41pm

Nice read, probably one of the best resources I’ve seen on the topic out there , especially with the sun articles about each anti bot measures providers articles linked to it.

AutoModerator · June 17, 2025, 6:41pm

Links to this domain have been disabled.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

AutoModerator · June 17, 2025, 6:41pm

Links to this domain have been disabled. [2]

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

MrSpyCat · June 17, 2025, 6:41pm

No, thats why before doing anything i wanted to do a bit of research to set up everything good. Not start my project and hit a wall after

lazynoob0503 · June 17, 2025, 6:41pm

Dm me for more info.

wrd83 · June 17, 2025, 6:41pm

Usually scraping via VPN can get you banned immediately if it’s against tos

ActiveTreat · June 17, 2025, 6:41pm

What do you mean by volume? Clients, sites, ads???

scrapecrow · June 17, 2025, 6:41pm

Thanks, I think the most important thing to note is that all anti-bot services usually calculate a trust score which decides whether to let you in. So, if you have a lot of time for this project you can slowly chop at it with limited resources.

This is one upside of how VPN’s could work for long term slow scraping projects as IP is shared with real users who could up your trust score but in reality all big anti-bot teams can just pull the VPN IPs (or track the ASNs) and mark them so maybe with some lesser known good VPNs it could work? If you’re really strapped for resources that could be an interesting option to try. Check out wireproxy if you want to try VPNs as proxies.