Some websites can’t be scraped

This topic is: resolved

 

Thank you for contacting me. Please note that I live in the GMT+3 time zone - responses might be delayed by this.

Viewing 1 reply thread
  • Author
    Posts
    • #3217


      teddychu2001
      Participant
      Post count: 9

      Below are some example websites can’t be scraped:

      https://gsuitetips.com/news/ (This one displays blank shen using Visual Selector)

      bestforandroid.com (This one shows below error in Crawling Helper)

      https://techcult.com/ (This one shows below error in Crawling Helper)

      I’ve checked the above 2 sites in Crawling Helper. It shows “Error in page crawling. Please try again/other webpage.”

      How to scrape websites like these?

    • #3218


      Szabi – CodeRevolution
      Keymaster
      Post count: 4556

      Hello,

      First of all, thank you for your purchase.

      The websites you linked are using JavaScript to dynamically load their content, after the user loaded the page in the browser. This dynamic content is not visible to conventional scrapers, because they are not returned in the HTML response of the page, but are added to it afterwards, dynamically, using JavaScript.

      The good news is that the plugin can scrape content also from these pages, if you combine it with a headless browser, like puppeteer or phantomjs (installed on your server) or HeadlessBrowserAPI (which is a service I implemented to handle dynamic content parsing, without the need to have headless browsers installed on your server).

      Please check these tutorial videos for details on this:

      Puppeteer example: https://www.youtube.com/watch?v=g99IlDkt_SY

      HeadlessBrowserAPI example: https://www.youtube.com/watch?v=205EinBQAoo&list=PLEiGTaa0iBIjDrfexapWc3M28iHwJI5tT&index=2

      OnlyFans example: https://www.youtube.com/watch?v=TXAdvsVCuy8

      I hope this info helped.

      Regards,

      Szabi – CodeRevolution.

Viewing 1 reply thread

The topic ‘Some websites can’t be scraped’ is closed to new replies.