I’m embarking on a substantial web scraping project and could use some guidance. I’m targeting a website with around 4 million product pages, available in two languages, which brings the total to about 8 million pages. The silver lining is that these pages can be accessed via .json links, offering a way to minimize traffic impact. The site is protected by Cloudflare, but I’ve managed to bypass this using VPN provider so far.
In my tests, I’ve successfully run 5 Docker containers concurrently under a single VPN account, each using different IPs. However, this is just the start. I plan to scale my operations to scrape up to 100 million pages. This scaling brings me to a crossroads: should I invest in 3-4 VPN accounts, or would proxies be a better route? I’ve never used proxies before, so I’m particularly interested in insights about their effectiveness, cost, and how they might compare to using multiple VPN accounts for a project of this scale.
Any advice, experiences, or tips you can share would be incredibly valuable, especially regarding handling large-scale scraping projects, managing IP rotation, and staying under the radar of protections like Cloudflare.
I think public proxies are relative cheap to acesss and easy to rotate as well programmatically and you can also randomize proxies among docker as well.
What’s your budget for proxies/vpn’s? I run a pretty large scraping project for a client that revolves around targeted ads and SEO. We use residential and mobile proxies, and mobile hotspots like NetGear Nighthawks.
Typically our scrapers will hit an intermediate proxy that we run internally using something like squid. Squid will be listening on a range of ports. Depending on the port the scrapers web request comes in on, the traffic will get sent out to a specific proxy. ie port 3177’s next cache peer might be a mobile proxy in LA. Using this technique we’ve had success scraping various data across the web. Each request made by a scraper goes out a different random proxy unless there is a specific need for a specific proxy or location.
We also rotate the types of scrapers we use ranging from Selenium Grid via Docker swarm, to playwright, to simple Python programs using HTTPX or niquests.
If you have not already done it, check if you need all the productpages. If you are collecting the URLs from f.ex the sitemap, then maybe some of the products may be expired, etc.
On some of the sites I scrape every old and new products is under the same map so my scrape include product-status parameter. If it is expired it appends the url to an expired-csv-file that the script uses to check againts the next time. Rinse repeat. Once every month/3 months etc. I do a full check just incase some products are “un-expired”.
Nice read, probably one of the best resources I’ve seen on the topic out there , especially with the sun articles about each anti bot measures providers articles linked to it.
Thanks, I think the most important thing to note is that all anti-bot services usually calculate a trust score which decides whether to let you in. So, if you have a lot of time for this project you can slowly chop at it with limited resources.
This is one upside of how VPN’s could work for long term slow scraping projects as IP is shared with real users who could up your trust score but in reality all big anti-bot teams can just pull the VPN IPs (or track the ASNs) and mark them so maybe with some lesser known good VPNs it could work? If you’re really strapped for resources that could be an interesting option to try. Check out wireproxy if you want to try VPNs as proxies.