View Single Post
Old September 15th, 2014, 09:50 PM   #17
electile disfunction
Vintage Member
 
Join Date: Oct 2008
Location: Somewhere flat, that's either hot, cold, or windy ... Canada?
Posts: 1,966
Thanks: 42,100
Thanked 21,351 Times in 1,903 Posts
electile disfunction 100000+electile disfunction 100000+electile disfunction 100000+electile disfunction 100000+electile disfunction 100000+electile disfunction 100000+electile disfunction 100000+electile disfunction 100000+electile disfunction 100000+electile disfunction 100000+electile disfunction 100000+
Lightbulb Alternate process

I have been considering your entire issue about finding and deleting duplicates. Mrs e.d. & I are in a very similar situation to you as we have hundreds of thousands of picture files, spread over many broad categories, collected over years.

I'll give you (and everyone reading) a brief outline of how we sort our files, and how we look for duplicate & similar pictures. This might give you a few ideas to get a good start at what you need and what will work for you!

In our collection we have: known duplicates which we want to keep for various reasons; many similar pictures--some to keep some to delete; pictures that are radically different but computers see as similar; and additions & deletions almost constantly. Files range in size from the smallest computer icons of 10s of bytes to deep space astronomy photographic tifs larger than 150 MB each. Picture subject and sort categories range from a small nephew making rude noises at his sister, to works of incredibly complex but sublime digital art--there is no "easy" way to sort our collection ... trust me, I've tried.

Tools:
You need some good but basic software to keep your sanity. All of it you can get and use for free, at some time you may wish to purchase some (and/or make donations to the creators).

a) Font--if you have any respect for your sanity, choose a well designed, sans serif, professional font that has all the letters and punctuation you use and (most importantly!!!) different glyphs for each of these characters:
1 i I L O 0 o F f $ S s T t W V U w v u
You need this to see thousands of file names easily and constantly. Use this font either universally on your computer or just set it up in your favourite file sorting program.
Every operating system comes with at least one (this is how programmers can see what they are typing): try Segoe UI, Trebuchet, or Tahoma and use them compare to others if you wish. (Note: Trebuchet is used by many hard-of-seeing people for all their computer and print needs.)

b) Graphic file viewer--it needs to load quickly and display accurately: jpg, gif, png, tif files and maybe others. It is best if you can have many instances of it running simultaneously (for visually comparing similar pictures). Irfanview is awesome, ACDsee used to do this (I haven't looked at it in a decade). The programs that come with Windows are mostly useless for this as they are slow, resource hogging, and try so hard to be "helpful" you don't know if what you are seeing is real or not most of the time.
You will likely want a program that allows you to quickly find the metadata and exif info of your picture files.

c1) A program to find duplicate computer files, regardless of file name; and
c2) A program to find similar picture files, that creates & reuses it's own databases.
There are hundreds of programs to find duplicate computer files spread of many drives. Find one you like; if it works with--or is part of--your favourite computer file sort & searcher (like Windows Explorer, etc.) so much the better. Before any visual-type search of new files, use this to find the obvious duplicates as programs are almost always infinitely faster that any graphic-search software.
Programs to find similar picture files are much more difficult to find--VEF member have been looking for and testing them for years with widely varying results. Many have been suggested for you in this thread already. I have found Anti-Dupl.net is very good and fast, it's also free of charge and reliable, and it works on our (3) computers without fussing with the OS, but YMMV.

d) (optional) Computer file renamer: you may not wish to fully rename any files but adding a couple words at the beginning of file names can infinitely cut down on your work. Examples: almost all of our visual artists have created self portraits--I use a renamer to append "Autoportrait_" to the beginning of those files' names ... Therefore it is easy to remove, the original file names remain intact, the files are easy to seach for and separate from the other works. Groups of pictures (from many different scanners or sources) will always be together, and any additional or new files can easily join the group with "copy and paste" commands.

e) (optional) Software to check csv files for known collections (if you need this you already know what it is). VEF posting Help Guides probably have suggestions for these, and they get reviewed/recommend regularly in other VEF threads.

Process:
1) Sort your collection of pics any way you wish. Our hierarchy usually includes these levels of folders (or more depending on many variables):
i. general subject (examples: our photos, animals, art, astronomy, hair styles, info.); then
ii. creator/artist/souce (examples: vacations, mammalia; Leonardo Da Vinci, NASA, Helmut Newton (photographer), wife's hair); then
iii. subsection(s) (examples: camping 2011, aquatic, architecture, solar system, photshoot name, long styles).
Also at thesubsection level we usually start including appropriate "miscellaneous" directories--examples: in nudes we sort whatever pictures remain by the models' surnames, but in astronomy we sort leftovers by wavelength and/or date.

2) Find any duplicate files. Keep or destroy then as necessary.
NEVER tell your programs to delete any file automatically unless YOU know exactly what you are doing!

3) Tell your "similar picture finder" software where it should look for and compare picture files. Sometimes I seach individual directories or drives, but usually (for nudes especially) I'll ask the computer to compare all new files to all known files on a regular basis--it will ignore mistakes it made in the past and add any new images to its databases as necessary.

4) You should be able to sort the "similar picture finder" results alphabetically and/or by directory and/or by % similarity and/or by type of difference so that your 'known & sorted' file names occur together and any new resultant names are easy-to-see beside them. Open any pictures you need to see for comparisons, tell your softwares of the results (Anti-Dupl continually deletes files, and updates filenames & directories if you change them, for example), and go on to the next results.

5) Once your "similar picture finder" has its initial database(s) and file of erroneous matches to ignore, all subsequent duplicate searches wil be much faster. (Right now, we can search 100,000 files over three hard drives in 3 or 4 minutes, where the initial database creation and search may have taken 20 minutes.)

6) Your intial review of the results found by your programs will take longer than any subsequent search, too but you will remove a lot of junk that first time. And you'll learn a lot, too!

That's the overview, I hope it gives you some ideas for what you want and need for your files. I will add corrections and other ideas as they occur to me.

Please feel free to send me a Private Message if you think I can help with something.

e.d.
electile disfunction is offline   Reply With Quote
The Following 8 Users Say Thank You to electile disfunction For This Useful Post: