View Single Post
Old June 23rd, 2018, 02:12 PM   #359
deepsepia
Moderator
 
deepsepia's Avatar
 
Join Date: Jul 2007
Location: Upper left corner
Posts: 7,205
Thanks: 47,956
Thanked 83,444 Times in 7,199 Posts
deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+
Default

Quote:
Originally Posted by shasta View Post
Could you provide me with a template you use? I've taken a look at HTTrack in the past but I found it too complex. I'm trying to use a program called Cyotek Webcopy at the moment, but think I might have to spend a lot of time figuring it out. I did a test run on one page and it downloaded way too much stuff, so that I had to stop it. There are lots of things that would have to be excluded from the crawl.
I don't use that particular program so can't give any specific advice about using it, but here are the general issues with webspiders:

1) how deep will they search
2) will they stay on the same server, or look at other servers if there are links there.

each spider will have some kind of config panel where you set these options-- and they're important. Additionally, most spiders default to "polite" behavior, obeying robot exclusion rules; you usually have to override robot exclusion to rip a site.

As web technology has become more sophisticated, there are all sorts of technologies being used to defeat webscraping; if you can see it in your browser it almost certainly can be downloaded somehow, but it may take some doing, and you'll have to understand your particular application.
deepsepia is online now   Reply With Quote
The Following 10 Users Say Thank You to deepsepia For This Useful Post: