This was mentioned a couple times on the main general, and while its pretty messy, some of the pieces on their own do work, just nothing to make it a straight automated pipeline or even something hassle free to run that is turn key at the moment. I figured that since everyone here is more less more determined and committed to the craft, that maybe we could get some of the best minds, and a little push from ChatGPT, to get this working to help streamline the process of turning anime episodes into datasets.
https://github.com/cyber-meow/anime_screenshot_pipeline
Let me provide some of my notes and observations from what I have done so far:
With frame extraction, as stated in the github, you are turning a 24 minute animation of about 34k frames and condensing it to an average of 4k/6k/9k non-frozen/dead frames, depending on the show, episode, studio, or era of said source. The work is being done by ffmpeg's `mpdecimate` which's purpose is to "drop frames that do not differ greatly from the previous frame in order to reduce frame rate."
The frame extraction command with ffmpeg provided in the github works fine, the issue is that git maker's bulk file script, `extract_frames.py `, doesn't play nice and only produces the folders while the ffmpeg script fails to execute. I did consider that video file syntax could possibly be a culprit for the script to function based on some previous errors I ran into, but it's not an issue running ffmpeg so I side stepped the bulk script.
Since I already compiled the datasets I'm currently working on from manually running the command, I haven't had the need to go back and retry the script with any modifications. ChatGPT did offer some suggestions, but required me to provide a copy of the output to review which I no longer had and didn't have time to go and reproduce.
Similar Image Removal, the base application running the filter is called `FiftyOne`, a "computer vision model" used for collecting databases, with its recent use being to build clean visual databases for vehicle autopilot AI to use. Using `remove_similar.ipynb` in Jupyter Notebook, a second round of filtering that will remove duplicate, very similar frames of a certain threshold, across the entire dataset, instead of just the sequential frames of mpdecimate. This would be cases when the animation is stretched out during talking scenes where only the mouth moves, standing shots where the camera isn't being panned, etc.
The script has a default threshold of `0.985` value of what is considered a duplicate, but I've noticed that even at this value some frames were considered duplicates and purged that shouldn't have been but that's what manual review is if you need that higher accuracy in a dataset.
The main issue I ran with this was that with my dataset (could be a personal issue), the process would be painfully slow at 1 sample/s read on the duplicate image detection Notebook script. That's one and half hours sorting through a 24 minute episode worth of already filtered frames.
Through some trial and error and ChatGPT QA, I found that switching the model used in the script provided much faster results.
If you want to test your luck, switch out the following in Cell 2:
`model = foz.load_zoo_model("mobilenet-v2-imagenet-torch")`
with
`model = foz.load_zoo_model("alexnet-imagenet-torch") `
I was getting 4.9~5.1 samples/s, or roughly 15 minutes per episode after the adjustment. a 5x improvement of speed.
Other models can be found on:
https://docs.voxel51.com/user_guide/model_zoo/models.html
The recommendation was to stick to "imagenet" models but feel free to explore.
The github recommends 2 other alternatives for this task, but I have not checked them out myself.
https://github.com/ryanfwy/image-similarity
https://github.com/ChsHub/SSIM-PIL
I haven't proceeded further than this because I had a bit of an issue installing the face detector until just recently. The github links additional documentation on setting up the face detection as well as other commands by kohya_ss, the same as the SD-Script maker, that would just need to be DeepL'd for us English onlys.
https://github.com/hysts/anime-face-detector
https://note.com/kohya_ss/n/nad3bce9a3622
The Face detection also includes regularization instructions, includeing rotating the face images in proper orientation for training.
Tagging is being done with wd-1-4-vit
Face Detection can be trained on the subjects which I assume is for an automated filtering process and for later dreambooth weight calculations.
From there the rest is a bit of blurr. I admittedly m not as sharp to go through this on my own so I am kind of asking for help but I felt that not providing some sort of primer with fixes before doing so would be rude. And hopefully this would help everyone that's trying to build up Lora or even model datasets.