Weird Feed Failure | CodeRevolution Support

This topic is: resolved

Thank you for contacting me. Please note that I live in the GMT+3 time zone - responses might be delayed by this.

This topic has 13 replies, 2 voices, and was last updated 2 years, 5 months ago by richelo.

Viewing 13 reply threads

Author

Posts
- January 4, 2023 at 7:22 pm #6541
  
  richelo
  Participant
  
  Post count: 8
  
  Hi,
  
  I have this feed: https://www.constantcontact.com/blog/feed/
  
  When you go to it directly, it loads fine. When you load it in an RSS reader, it loads fine.
  
  When you load it into Echo RSS, you get these errors:
  
  [4-Jan-2023 19:17:11 UTC] Error in parsing RSS feed (custom method): cURL error 22: The requested URL returned error: 403 for https://www.constantcontact.com/blog/feed/
  [4-Jan-2023 19:17:11 UTC] Exception thrown Failed: https://www.constantcontact.com/blog/feed/
  
  All other feeds on my site works perfectly, just this one has this issue.
  
  Thanks
  
  Rich
  
  Add New Note to this Reply
- January 4, 2023 at 7:39 pm #6542
  
  Szabi – CodeRevolution
  Keymaster
  
  Post count: 4855
  
  Hello,
  
  First of all, thank you for your purchase.
  
  I checked and this RSS feed is protected by a scraping protection mechanism, it loads only in a browser (and not when the page is downloaded in the plugin).
  
  To scrape this site, I recommend you also check the Crawlomatic plugin, which is able to use Puppeteer (which is a headless browser which needs to be installed on your server) or HeadlessBrowserAPI (which is a cloud service which renders websites and gets around their scraping protection).
  
  Please check this tutorial video for details: https://www.youtube.com/watch?v=ZljpMpmi_dU
  
  Settings I used in the Crawlomatic plugin which worked for me:
  
  Scraper Start (Seed) URL / Keywords
  https://www.constantcontact.com/blog/
  
  Content Scraping Method To Use:
  Puppeteer
  
  Do Not Scrape Seed URL:
  checked
  
  Seed Page Crawling Query Type:
  Class
  
  Seed Page Crawling Query String:
  post-title
  
  I hope this info helps.
  
  Regards, Szabi – CodeRevolution.
  
  Add New Note to this Reply
- January 5, 2023 at 4:00 am #6545
  
  richelo
  Participant
  
  Post count: 8
  
  Thank you for your detailed response.
  
  Can Crawlomatic do Excerpts of the scraped posts, or only complete post? I am not talking about summary with something like TLDRThis, I mean just an Excerpt of the original content?
  
  Thanks
  
  Rich.
  
  Add New Note to this Reply
- January 5, 2023 at 8:39 am #6546
  
  Szabi – CodeRevolution
  Keymaster
  
  Post count: 4855
  
  Hello,
  
  Yes, Crawlomatic can automatically create an excerpt based on the content of the post. You can use the %%item_description%% shortcode in the ‘Generated Post Content’ settings field for this.
  
  Regards.
  
  Add New Note to this Reply
- January 5, 2023 at 4:57 pm #6556
  
  richelo
  Participant
  
  Post count: 8
  
  Setup everything exactly as you said, using HeadLessBrowserAPI, and it fails, and gives me this in the logs:
  
  [5-Jan-2023 16:53:17 UTC] An error occurred while getting content from HeadlessBrowserAPI: https://headlessbrowserapi.com/apis/scrape/v1/puppeteer?apikey=MyValidKey&url=https%3A%2F%2Fwww.constantcontact.com%2Fblog%2F&custom_user_agent=Mozilla%2F5.0+%28Windows+NT+6.3%3B+Win64%3B+x64%29+AppleWebKit%2F537.36+%28KHTML%2C+like+Gecko%29+Chrome%2F60.0.3112.113+Safari%2F537.36&custom_cookies=default&user_pass=default&timeout=default&proxy_url=default&proxy_auth=default&solvecaptcha=1&enableadblock=1 – puppeteer Unhandled Rejection Unhandled Rejection, reason: Error: net::ERR_TUNNEL_CONNECTION_FAILED at https://www.constantcontact.com/blog/ at navigate (/var/www/html/wp-content/plugins/custom-scraper-api/res/puppeteer/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23) at process._tickCallback (internal/process/next_tick.js:68:7) /var/www/html/wp-content/plugins/custom-scraper-api/res/puppeteer/puppeteer.js:33 process.on(‘unhandledRejection’, up => { console.error(‘Unhandled Rejection, reason:’, up);throw up }) ^ Error: net::ERR_TUNNEL_CONNECTION_FAILED at https://www.constantcontact.com/blog/ at navigate (/var/www/html/wp-content/plugins/custom-scraper-api/res/puppeteer/node_modules/puppeteer/lib/cjs/puppeteer/common/FrameManager.js:115:23) at process._tickCallback (internal/process/next_tick.js:68:7)
  [5-Jan-2023 16:53:22 UTC] Failed to get source web page, importing will not run from this URL! https://www.constantcontact.com/blog/ –
  
  Add New Note to this Reply
- January 5, 2023 at 5:00 pm #6557
  
  Szabi – CodeRevolution
  Keymaster
  
  Post count: 4855
  
  This is a proxy issue, the proxies are not able to connect to the site you are scraping. Please set: ‘Main Settings’ menu -> ‘Web Proxy Address List’:
  
  disabled
  
  Add New Note to this Reply
- January 5, 2023 at 5:18 pm #6558
  
  richelo
  Participant
  
  Post count: 8
  
  That worked PERFECTLY, thank you!
  
  There is ONE issue though … The scraping takes today’s date for the post, and not post publish date. I also set in the main settings to not import anything before 1 Jan 2023, but this one imported a bunch from December 2022.
  
  Sorry for being a pain. Really new to all this. One more that does not even have an RSS feed.
  
  I tried with the same settings, but not working.
  
  Could you help with the scraping settings for this one please: https://convertkit.com/resources/
  
  Add New Note to this Reply
- January 5, 2023 at 5:29 pm #6559
  
  richelo
  Participant
  
  Post count: 8
  
  I also just noticed that the Convertkit one does not have dates on the posts/articles. OUCH!
  
  Add New Note to this Reply
- January 5, 2023 at 5:35 pm #6560
  
  Szabi – CodeRevolution
  Keymaster
  
  Post count: 4855
  
  I am glad to hear that it worked! Yes, when dates are not available, the current date will be used.
  
  Regards.
  
  Add New Note to this Reply
- January 5, 2023 at 5:40 pm #6561
  richelo
  Participant
  
  Post count: 8
  You missed out on a few points …
  - The ConstantContact one HAS dates, but they were all published with today’s date.
  - I set in main settings to not import anything before 1 January 2023, but for ConstantContact, it imported a bunch from December 2022, and yes, the posts has dates.
  - I need help getting https://convertkit.com/resources/ to work in the scraper.
  - Convertkit is the one that does not have dates. I am kind of okay publishing those on the days scraped.
  - One last thing … In the RSS plugin, there is a URL that needs to be run in cron with wget to have the rules run. I don’t see this in the scraping plugin. Does it just put itself in the WP cron?
  That’s all for now. Thank you so much for your help!
  
  Add New Note to this Reply
- January 5, 2023 at 6:26 pm #6562
  
  Szabi – CodeRevolution
  Keymaster
  
  Post count: 4855
  
  Are you using Crawlomatic to scrape https://www.constantcontact.com/blog/? I am still not sure of this, as we spoke both about Echo RSS and Crawlomatic. If yes, please send me a screenshot with your current plugin settings (to kisded@yahoo.com).
  
  Regards.
  
  Add New Note to this Reply
- January 5, 2023 at 6:42 pm #6563
  
  richelo
  Participant
  
  Post count: 8
  
  Sorry, yes, using Crawlomatic along with HeadLessBrowser API, I bought the plugin today on Envato, and signed up for a subscription for the API today as well.
  
  I will get that screenshot over to you a little later.
  
  Add New Note to this Reply
- January 5, 2023 at 6:43 pm #6564
  
  Szabi – CodeRevolution
  Keymaster
  
  Post count: 4855
  
  Ok, sure.
  
  Add New Note to this Reply
- January 5, 2023 at 7:12 pm #6566
  
  richelo
  Participant
  
  Post count: 8
  
  Email sent.
  
  Add New Note to this Reply
Author

Posts

Viewing 13 reply threads

The topic ‘Weird Feed Failure’ is closed to new replies.