Can’t crawl the pageurl via visual selector for kolin.com

This topic is: resolved

Thank you for contacting me. Please note that I live in the GMT+3 time zone - responses might be delayed by this.

This topic has 3 replies, 2 voices, and was last updated 1 year, 5 months ago by Szabi – CodeRevolution.

Viewing 3 reply threads

Author

Posts
- December 22, 2023 at 6:48 am #9448
  
  acluke
  Participant
  
  Post count: 11
  
  Hi, I recently purchased this plugin and not sure how to use it well.
  
  For example, https://kolin.com.tw/product/fridge
  
  I tried to use xpath “//div[@class=’thumbnail’]/a” or visual selector to get all the fridges URLs but it failed.
  
  I am not sure whether I need to enable headless or any extra features to make it work well.
  
  Could please help to take a look and give me some advices to use this plugin?
  
  Thanks so much.
  
  Luke
  
  Add New Note to this Reply
- December 22, 2023 at 8:30 am #9449
  
  Szabi – CodeRevolution
  Keymaster
  
  Post count: 4854
  
  Hello,
  
  First of all, thank you for your purchase.
  
  To scrape products from this specific site, please use the below config:
  
  Do Not Scrape Seed URL:
  checked
  
  Seed Page Crawling Query Type:
  Class
  
  Seed Page Crawling Query String:
  col-sm-4 contant-box
  
  Regards, Szabi – CodeRevolution.
  
  Add New Note to this Reply
- December 25, 2023 at 7:42 am #9458
  
  acluke
  Participant
  
  Post count: 11
  
  Thanks for replying!
  
  I’ve encountered another issue while crawling content page.
  
  Here is the log:
  
  [25-Dec-2023 15:26:12 Etc/GMT-8] Failed to exec curl in crawlomatic_curl_exec_utf8! https://kolin.com.tw/assets/uploads/files/product/3fridge_cate/KR-258V05/KR-258V05_%E5%95%86%E8%AA%AA02.png – err: Connection timed out after 10001 milliseconds – 28 url: https://kolin.com.tw/assets/uploads/files/product/3fridge_cate/KR-258V05/KR-258V05_%E5%95%86%E8%AA%AA02.png
  
  The page I crawled: https://kolin.com.tw/product/fridge/518 & xpath: //div[@class=’row pdt_content’]
  
  It seems the content image is too big and timed out, may I ask how to solve such questions?
  
  BTW, also wanna ask what’s <b>Crawled Pages Crawling Query </b>for and when I will need to use it?
  
  Thanks and merry x’mas,
  
  Luke
  
  Add New Note to this Reply
- December 25, 2023 at 9:00 am #9459
  
  Szabi – CodeRevolution
  Keymaster
  
  Post count: 4854
  
  Hello,
  
  The above issue does not point that the image is too large, but it points to the “Connection timeout” – meaning that the connection to the image (to get its first byte) was not able to be done after 10 seconds. This usually points that the image is inaccessible because a firewall rule blocking the connection (this can be from your server’s side or from the target server’s side, where the image is hosted).
  
  The ‘Crawled Pages Crawling Query’ settings refer to extracting links which should be scraped, from the URLs from where the plugin already scraped content and created posts (usually these are posts). Using this feature, you can continue to scrape links which are usually found on the right side of blog posts (posts recommended for users on the right column of posts). This feature is optional.
  
  Regards.
  
  Add New Note to this Reply
Author

Posts

Viewing 3 reply threads

The topic ‘Can’t crawl the pageurl via visual selector for kolin.com’ is closed to new replies.