View Single Post
Old October 10th, 2015, 04:32 PM   #27
deepsepia
Moderator
 
deepsepia's Avatar
 
Join Date: Jul 2007
Location: Upper left corner
Posts: 7,205
Thanks: 47,953
Thanked 83,435 Times in 7,199 Posts
deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+deepsepia 350000+
Default

Quick and Dirty webscraping with JDownloader

OK, a new tip, and a very useful one.

Have you ever wanted to get all the links from a thread on a board? There are some huge threads here and on other web forums, let's say you want to get all the links from a 200 page thread . . . how are you going to do it?

You could go to each page on and right click on the link . . . you'll get the links, and carpal tunnel syndrome as well. I've previously pointed out that you can grab all the links from a single page, with JD, by doing a Ctrl-A & Ctrl-C [Select all, Copy All] on the page, because JD will parse links from stuff you put on the clipboard. But there are some very big threads out there, hundreds of pages . . . that could still take some time.

There's gotta be an easier way to do this, you'd think-- and you'd be right.

If you write PHP or Python, there are a lot of ways to do it, but let's say you don't write code.

Jdownloader is your friend. You can take the address of any page from a thread and paste it into JD's Linkgrabber's "Analyze and add links" feature (the "plus" [+] button on the lower left corner of Linkgrabber), and JD will analyze the page and look for download links.

Ah, but what if you have a thread with a hundred pages? Suppose you have a thread:
[someforum]/Girlswithbigbooks-[Pagenumber].html -- where the pages are identified 1 through 100?

Simple. I used a little basic interpreter, but you can use whatever you know-- a spreadsheet would be fine-- to generate a text file that reads

[someforum]/Girlswithbigbooks-p1.html
[someforum]/Girlswithbigbooks-p2.html
[someforum]/Girlswithbigbooks-p3.html

and so on up to 100, or whatever number of pages there are. Note that there are number of different ways that different forums deal with page numbers, you just have to copy the format used in the thread that you're interested in.

Now copy it and paste into JD's "Add New Links" utility. JD will most likely pop up a screen saying that it couldn't find any links, and asks if you want it to perform "Deep Link Analysis" -- basically looking at each item on the page and seeing if it links through to a downloadable file. Click "Yes" to this.

JD will iterate through the pages, looking for all links & images (you can restrict JD to looking for links, to ignore images, and so on). Some sites hide links behind "Code" tags, I'm not sure if JD picks them up, haven't tried yet.

Its a bit slow -- for a massive thread its the sort of job to leave JD humming on overnight.

But it works very well, and if utilities like "WGET" leave you saying "huh?", then this approach -- which experienced web coders will tell you is amateur kludge-- is going to be the easiest.


Last edited by deepsepia; October 10th, 2015 at 06:52 PM..
deepsepia is offline   Reply With Quote
The Following 7 Users Say Thank You to deepsepia For This Useful Post: