Weird Feed Failure

This topic is: resolved

 

Thank you for contacting me. Please note that I live in the GMT+3 time zone - responses might be delayed by this.

This topic has 13 replies, 2 voices, and was last updated 1 year, 10 months ago by richelo.

Viewing 13 reply threads
  • Author
    Posts
    • #6541


      richelo
      Participant
      Post count: 8

      Hi,

      I have this feed: https://www.constantcontact.com/blog/feed/

      When you go to it directly, it loads fine. When you load it in an RSS reader, it loads fine.

      When you load it into Echo RSS, you get these errors:

      [4-Jan-2023 19:17:11 UTC] Error in parsing RSS feed (custom method): cURL error 22: The requested URL returned error: 403 for https://www.constantcontact.com/blog/feed/
      [4-Jan-2023 19:17:11 UTC] Exception thrown Failed: https://www.constantcontact.com/blog/feed/

      All other feeds on my site works perfectly, just this one has this issue.

      Thanks

      Rich

    • #6542


      Szabi – CodeRevolution
      Keymaster
      Post count: 4577

      Hello,

      First of all, thank you for your purchase.

      I checked and this RSS feed is protected by a scraping protection mechanism, it loads only in a browser (and not when the page is downloaded in the plugin).

      To scrape this site, I recommend you also check the Crawlomatic plugin, which is able to use Puppeteer (which is a headless browser which needs to be installed on your server) or HeadlessBrowserAPI (which is a cloud service which renders websites and gets around their scraping protection).

      Please check this tutorial video for details: https://www.youtube.com/watch?v=ZljpMpmi_dU

      Settings I used in the Crawlomatic plugin which worked for me:

       

      Scraper Start (Seed) URL / Keywords
      https://www.constantcontact.com/blog/

      Content Scraping Method To Use:
      Puppeteer

      Do Not Scrape Seed URL:
      checked

      Seed Page Crawling Query Type:
      Class

      Seed Page Crawling Query String:
      post-title

       

      I hope this info helps.

      Regards, Szabi – CodeRevolution.

    • #6545


      richelo
      Participant
      Post count: 8

      Thank you for your detailed response.

      Can Crawlomatic do Excerpts of the scraped posts, or only complete post? I am not talking about summary with something like TLDRThis, I mean just an Excerpt of the original content?

      Thanks

      Rich.

    • #6546


      Szabi – CodeRevolution
      Keymaster
      Post count: 4577

      Hello,

      Yes, Crawlomatic can automatically create an excerpt based on the content of the post. You can use the %%item_description%% shortcode in the ‘Generated Post Content’ settings field for this.

      Regards.

    • #6556


      richelo
      Participant
      Post count: 8

      Setup everything exactly as you said, using HeadLessBrowserAPI, and it fails, and gives me this in the logs:

      [5-Jan-2023 16:53:17 UTC] An error occurred while getting content from HeadlessBrowserAPI: https://headlessbrowserapi.com/apis/scrape/v1/puppeteer?apikey=MyValidKey&url=https%3A%2F%2Fwww.constantcontact.com%2Fblog%2F&custom_user_agent=Mozilla%2F5.0+%28Windows+NT+6.3%3B+Win64%3B+x64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F60.0.3112.113+Safari%2F537.36&custom_cookies=default&user_pass=default&timeout=default&proxy_url=default&proxy_auth=default&solvecaptcha=1&enableadblock=1 – puppeteer Unhandled Rejection Unhandled Rejection, reason: Error: net::ERR_TUNNEL_CONNECTION_FAILED at https://www.constantcontact.com/blog/ at navigate (/var/www/html/wp-content/plugins/custom-scraper-api/res/puppeteer/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23) at process._tickCallback (internal/process/next_tick.js:68:7) /var/www/html/wp-content/plugins/custom-scraper-api/res/puppeteer/puppeteer.js:33 process.on(‘unhandledRejection’, up => { console.error(‘Unhandled Rejection, reason:’, up);throw up }) ^ Error: net::ERR_TUNNEL_CONNECTION_FAILED at https://www.constantcontact.com/blog/ at navigate (/var/www/html/wp-content/plugins/custom-scraper-api/res/puppeteer/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23) at process._tickCallback (internal/process/next_tick.js:68:7)
      [5-Jan-2023 16:53:22 UTC] Failed to get source web page, importing will not run from this URL! https://www.constantcontact.com/blog/

    • #6557


      Szabi – CodeRevolution
      Keymaster
      Post count: 4577

      This is a proxy issue, the proxies are not able to connect to the site you are scraping. Please set: ‘Main Settings’ menu -> ‘Web Proxy Address List’:

       

      disabled

    • #6558


      richelo
      Participant
      Post count: 8

      That worked PERFECTLY, thank you!

      There is ONE issue though … The scraping takes today’s date for the post, and not post publish date. I also set in the main settings to not import anything before 1 Jan 2023, but this one imported a bunch from December 2022.

      Sorry for being a pain. Really new to all this. One more that does not even have an RSS feed.

      I tried with the same settings, but not working.

      Could you help with the scraping settings for this one please: https://convertkit.com/resources/

       

    • #6559


      richelo
      Participant
      Post count: 8

      I also just noticed that the Convertkit one does not have dates on the posts/articles. OUCH!

    • #6560


      Szabi – CodeRevolution
      Keymaster
      Post count: 4577

      I am glad to hear that it worked! Yes, when dates are not available, the current date will be used.

      Regards.

    • #6561


      richelo
      Participant
      Post count: 8

      You missed out on a few points …

      • The ConstantContact one HAS dates, but they were all published with today’s date.
      • I set in main settings to not import anything before 1 January 2023, but for ConstantContact, it imported a bunch from December 2022, and yes, the posts has dates.
      • I need help getting https://convertkit.com/resources/ to work in the scraper.
      • Convertkit is the one that does not have dates. I am kind of okay publishing those on the days scraped.
      • One last thing … In the RSS plugin, there is a URL that needs to be run in cron with wget to have the rules run. I don’t see this in the scraping plugin. Does it just put itself in the WP cron?

      That’s all for now. Thank you so much for your help!

       

    • #6562


      Szabi – CodeRevolution
      Keymaster
      Post count: 4577

      Are you using Crawlomatic to scrape https://www.constantcontact.com/blog/? I am still not sure of this, as we spoke both about Echo RSS and Crawlomatic. If yes, please send me a screenshot with your current plugin settings (to kisded@yahoo.com).

      Regards.

    • #6563


      richelo
      Participant
      Post count: 8

      Sorry, yes, using Crawlomatic along with HeadLessBrowser API, I bought the plugin today on Envato, and signed up for a subscription for the API today as well.

      I will get that screenshot over to you a little later.

    • #6564


      Szabi – CodeRevolution
      Keymaster
      Post count: 4577

      Ok, sure.

    • #6566


      richelo
      Participant
      Post count: 8

      Email sent.

Viewing 13 reply threads

The topic ‘Weird Feed Failure’ is closed to new replies.