Question about scrapping

I’m trying to create an app that scans a supermarket receipt and returns the data in a JavaScript object. The idea is to allow the user to scan a QR code or directly input the link of the receipt. This link is then sent to the Node server, which makes a request to the URL. The server should scrape the HTML response using jsdom and convert the resulting data into a JavaScript object.

The app runs smoothly when executed locally. However, when deployed, the request made to the backend returns a timeout error. The error message appears as follows:

when using fetch-node 2.6.11:

FetchError: Request to https://url/xx/xxxxx/xxxxxx/ failed. Reason: Connect ETIMEDOUT 2xx.xx.xx.xx:xxx.

when using native fetch:

TypeError: fetch failed

`at Object.fetch (node:internal/deps/undici/undici:11413:11)`

`at process.processTicksAndRejections (node:internal/process/task_queues:95:5)`

`at async getData (/opt/render/project/src/src/v1/parsers/ticketParser.js:12:24)`

`at async Object.scanNewDiscoTicket (/opt/render/project/src/src/v1/parsers/ticketParser.js:77:20)`

`at async Object.scanNewDiscoTicket (/opt/render/project/src/src/v1/services/ticketService.js:7:27)`

`at async scanDiscoTicket (/opt/render/project/src/src/v1/controllers/ticketController.js:19:27) {`

cause: ConnectTimeoutError: Connect Timeout Error

`at onConnectTimeout (node:internal/deps/undici/undici:8380:28)`

`at node:internal/deps/undici/undici:8338:50`

`at Immediate._onImmediate (node:internal/deps/undici/undici:8369:13)`

`at process.processImmediate (node:internal/timers:476:21) {`

`code: 'UND_ERR_CONNECT_TIMEOUT'`

`}`

}

I have deployed the app on both Render and Railway, but encountered the same issue on both platforms. I suspect that the ticket URL may not respond to requests originating from data center IPs. However, I’m not completely certain about this, as the error I’m receiving doesn’t seem to be related to authentication or bad request problems.

How can I confirm this? Are there any workarounds you would recommend?

here’s the repo: https://github.com/luissimosa199/ticketscanner

and here’s the file where the fetch happens: https://github.com/luissimosa199/ticketscanner/blob/master/src/v1/parsers/ticketParser.js

node v18.15.0

Are you connected to a VPN locally?

Try connecting to the URL without being connect to the VPN if you are. Given the URL seems to be resolving to an IP, I’m guessing the top level domain is public, but there’s no route to the URL you’re trying to reach publicly.

One thing you could check is if your hosting provider is allowing any external network access. Just create a simple Nodejs script that fetches code from www.google.com and your ticket site and see if you get any response back from both urls.

If both fail, then you know it is your hosting provider blocking the network, if www.google.com succeeds and the other fails, then it could be that ticket url host that is blocking any traffic from your host or vice versa.

No vpn. I’m not sure if I get what you’re saying but the same url, works just fine in local (or even accessing directly to the link, it’s a basic HTML file). It’s the same result both in postman and the browser.

I will check it, thanks!

The connection refused resolves to an IP 200.x.x.x, therefore there is a public DNS entry to resolve to that IP since you’re in the public cloud.

What I’m curious about is if the DNS for the ticketing system you’re trying to reach is not hosted publicly, which is why I asked if you were on a VPN locally.

The ticket it’s accesible publicly, I think that the problem could be related to the data center IP (railway/render) or the location of the request (less probable). What intrigues me the most is the Time Out error, literally 0 info about why the request failed. I’m kinda new to web scrapping but I would expect at least an error code or something.

Anyways, thanks for commenting!