Thank you for contacting me. Please note that I live in the GMT+3 time zone - responses might be delayed by this.
Tagged: images
-
AuthorPosts
-
-
February 20, 2022 at 11:27 am #4639
When scraping URL: https://techcult.com/how-to-install-kodi/, everything is default setting except “Copy Images From Content Locally” is tick.
In the result, some images can be downloaded but some can’t. Please see screenshot.
Can you please check how I can download all the images?
Attachments:
You must be logged in to view attached files. -
February 20, 2022 at 11:30 am #4641
This reply has been marked as private. -
February 20, 2022 at 12:03 pm #4642
Hello,
First of all, thank you for your purchase.
This site uses lazy loading for images from their content. To fix them, I added in importing rule settings, for rule ID 81, the following:
Lazy Loading Images HTML Tag:
data-fullNow images should be able to be scraped correctly, please check.
Tutorial video for this feature: https://www.youtube.com/watch?v=BMzJWZdodlo
Also: https://www.youtube.com/watch?v=AzadF_dAAco
Regards, Szabi – CodeRevolution.
-
February 21, 2022 at 8:30 am #4643
Hello,
Thanks for your prompt response. I have actually tried data-full for Lazy Loading Images before but it doesn’t work.
Can you please have a look at rule ID 81 and its post again please? You will see only about half of the images scraped but not all.
-
February 21, 2022 at 4:24 pm #4645
Hello,
I checked again and indeed, this issue was caused by the scraped page limiting the usage of their images, because requests for image accessing were made too fast one after another. A scraping limiter kicks in on their part and denied access to some images.
I tried to get around this limitation by adding in importing rule settings for rule ID 81: ‘ Delay Between Multiple Requests (ms)’ -> 1000 and also ‘Set Custom Curl User Agent’ -> Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36
However, unfortunately none of the above helped scrape all the images correctly.
I am not yet sure which content scraping protection they are using, but I suspect that getting around it would be possible only by installing a headless browser on your server (like Puppeteer) and combining the plugin with it. However, I am not 100% sure about this neither that it will help. Depends on the scraping protection system’s aggressivity.
Please check details on the above, here: https://www.youtube.com/watch?v=g99IlDkt_SY
How to install Puppeteer on your server (VPS only): https://www.youtube.com/watch?v=KNOIJA4pTQo
Please check.
Regards, Szabi – CodeRevolution.
-
-
AuthorPosts
The topic ‘Some images can’t be downloaded’ is closed to new replies.