View Single Post
Old November 4th, 2017, 04:01 PM   #178
halvar
Blocked!
 
Join Date: Jan 2008
Location: HH
Posts: 1,963
Thanks: 115,040
Thanked 32,801 Times in 1,955 Posts
halvar 100000+halvar 100000+halvar 100000+halvar 100000+halvar 100000+halvar 100000+halvar 100000+halvar 100000+halvar 100000+halvar 100000+halvar 100000+
Default Future Plans for forum-backup

I will continue to improve it gradually. If there is something that could make things easier for you give me a note. Often little things achieve much.
Of course bugs have a priority if you find one, drop me a line and I will look into it.

Long term topics are:
Duplicate Detection
  • adding an MD5 hash for each image to the database
  • try image finger printing to find duplicates that have the same content but different resolution or quality. This is something that interests me from a mathematical point of view. Maybe http://phash.org/ could be of use here.
Dealing with edited/updated posts
I do not know of a way to specifically search or get a list of updated posts. So one way to deal with it could be
  • rename the old thread page files with a date-suffix like 't1234-p3-x.html.20171104
  • re-download all thread pages (this is something that should be avoided)
  • parse the posts again and compare the result.
  • If the result (imagelinks) differs then
  • add the links to the database, but with an offset at the index to distinguish them from the original links. like 1001, 1002, ...
  • an alternative to the offset could be storing thread edit date with the links
  • download the new links. This would probably lead to duplicates, but I want to keep the original file.
  • I could do an md5 or content compare of old and new files and delete the new files if they are identical. But what if the new files have nicer names? So I would rather keep all the files. Having is better than needing ;-)
  • Since this is a lengthy process that also causes load on the server this process shoul be started rarely (once a year per thread?). Or only started manually if one needs it.
  • Do edits matter anyway?
Dealing Deleted Posts
Last week a couple of pages were deleted from a thread due to a DMCA takedown:
http://vintage-erotica-forum.com/sho...&postcount=718
http://vintage-erotica-forum.com/t58...n-phoenix.html

So locally where pages 1 to 9, on the server only 1 to 6. If new posts where added the pages 6-8 would not have been downloaded again.

A workaround for this is
  • renaming page files 6 to 9 (append a date suffix like t1234-p6-x.html.2011104)
  • force the processing of the thread during the next run. This can be achieved by updating the thread using the webconsole without changing the page no. This deletes the last process date of the thread and the thread is processed during the next run.


I am planning to do some of this in my week off work in January.
halvar is offline   Reply With Quote