I don't know really where to share this but im both slightly embarrassed by how bodged it is but also pretty proud that I managed to get it working. I've managed to hack together an automatic image tagger using a couple AI models and a useful github tool.
I'm gonna approach this in a problem:solution sort of format, since I think that effectively gives the timeline of events as well as what exactly I've managed to do.
PROBLEM: I have 250k images, some are personal screenshots, photos, etc., but most of it is my autistic downloading of various art, animations, and other related stuff online. Most of it I rarely look at, and I don't have a whole lot of free time to sort it all and tag it all by hand. The PTR helps, but it doesn't get everything, and often misses a lot of the more content specific tags.
SOLUTION: AI tagging has slowly been improving over the past few years. My understanding is that it's mostly done to help with image generation, but it works just as well for tagging already existing images. In the past, the tool of choice (from my understanding) was hydrus_dd, but...
PROBLEM: DeepDanbooru isn't the most accurate AI tagging model, having some major flaws. Chiefly among them being it's lack of an ability to tag loli/shota. I don't know if this came from a place of moral disagreement over the contents of those images, or a lack of ability to download images with those tags from danbooru in the first place, but it's problematic for several reasons. It makes it hard to find, hard to filter, and effectively helps neither those who hate or love loli/shota.
SOLUTION: In the past year, a new tool has been created called
https://github.com/abtalerico/wd-hydrus-tagger. It uses a much more accurate model, and notably has the ability to tag loli/shota images.
PROBLEM: the tool dumps all generated tags to the "My Tags" service by default, diluting the service with possibly inaccurate tags. It's better than dumping the tags into the PTR, but not ideal.
SOLUTION: it's not documented, but looking at the code inside
main.py reveals that the tool actually does support identifying a specific tag service to dump tags into. running the tool with the --tag-service argument lets you identify a service to put tags into. Note that for some reason this is only possible using a hashes.txt to read hashes from, but truthfully I dont think anyone is gonna be running this for individual images. you can always just copy the hash of a single image into the file anyways.
PROBLEM: the tool is automatically tagging now, but it's hard to keep track of which files have already made it through the tool and which haven't.
SOLUTION: I managed to figure out how to append another tag onto the dictionary that stores the generated tags before they're dumped to hydrus. I'm not educated on either python or AI tagging models, but I was able to figure out that the tags are stored using key:value pairs, with the key identifying the tag and the value identifying the confidence the model has in the tag being accurate. I simply added another pair, with the key = ' wd-hydrus-tagger_ai_generated_tags' and the value being abnormally high to ensure it always get's added.
PROBLEM: Now I know which images have been tagged and what hasn't. Things are looking good, except that I'm noticing a specific problem with furry art. Anything with sharp horns is being tagged as "demon-girl", amongst other inaccuracies. the model simply isn't made to identify furry art that well.
SOLUTION: Thankfully someone has made a specific furry AI tagging model with help from the developer who originally made the model that wd-hydrus-tagger uses. This means that it's entirely compatible with the tool. I don't know if this was just really good luck or if this is some standard way of designing these AI models, but I'm acting a little more careful around dangerous machinery nowadays, just in case all my good karma has run out from this. However...
PROBLEM: The tool has no clear way to run models locally. There's nothing in the documentation and looking at the code reveals that the model is downloaded and stored in cache before running. I'm not entirely sure, but it looks like the code is able to download models from huggingface.co and run them, but the specific model for furry art is hosted in a discord server, which puts my archivist heart in pain.
SOLUTION: the solution here is actually pretty simple, but this stumped me for quite awhile. I have very little experience in python coding, and so far I've made it through good guesses and some overlapping if underutilized C++ knowledge. At first, I figured out that I could just replace the wd model in the cache folder with the furry model, rename it to match, and it'd work alright. But this lead to the problem of me having to navigate back to this folder and swap the models anytime I wanted to tag anything that was/wasn't furry. This quickly became tiring and confusing, and I started investigating the various python files for how the tool was specifically identifying the models. turns out that by simply pointing the tool to look in a specific location for the model instead of running the downloading function and getting the cache location back, it bypasses that entire process. This meant that I was able to copy the folder the tool was in, change it's name to e621-hydrus-tagger, and point it specifically at the furry model. Then it's just a matter of running the code from the e621-hydrus-tagger folder instead of the wd-hydrus-tagger folder. The only flaw with this so far, is that if I move the repos from the current folder I have them in, the tool wouldn't be able to identify the files. I doubt this is an impossible problem to fix though, I probably just need to read the documentation on the path() function more.
PROBLEM: I now have two different models, files are getting tagged, furry art is getting more accurate tags, all is good. Although it turns out that the furry model doesn't tag content ratings, meaning theres no quick way to filter out pornography.
SOLUTION: I copied the folder again, pointed it to the WD model to get a ratings tag, discarded all the other tags and only returned the ratings. Then I named the new folder "ratings-hydrus-tagger", and ran it just like the others.
PROBLEM: Now I have 3 different folders, each with custom versions of the tool. I don't want to get confused and accidentally run the wrong model, and storing all the relevant commands in a notepad isn't exactly ideal.
SOLUTION: I created several .bat files to handle all the execution. Plus, this makes sure I don't accidentally erase the hydrus API key or the tag service from the commands when i run them. I created a start.bat that initialized the venv environment, and i have wd.bat, e621.bat, and ratings.bat, to run their specific code. I also have e621.bat automatically execute ratings.bat, since that is almost always ran directly after the former. Running the tool is as easy as copying all the hashes into hashes.txt, opening start.bat, and typing wd.bat or e621.bat and pressing enter.
That about wraps up my experience. Potential improvements include a way to run the various versions of the tool through a menu system instead of having to type out the specific bat files, and fixing the path()s to identify the models within the local folder so I can move the entire folder if needed without breaking anything. I'm considering figuring out a way to release this, but it's so hacked together I'm not sure how useful it would be on other people's systems. I'm also running windows, so I'm not sure how easy it'd be to make it work on other OSes.
I don't like the idea of hoarding this to myself, since I can imagine it'd help a lot of people, but I'm just not sure how to go about releasing this. Any advice, criticisms, ideas, etc. are all welcome.