/hydrus/ - Hydrus Network

Archive for bug reports, feature requests, and other discussion for the hydrus network.

Index Catalog Archive Bottom Refresh
Name
Options
Subject
Message

Max message length: 12000

files

Max file size: 32.00 MB

Total max file size: 50.00 MB

Max files: 5

Supported file types: GIF, JPG, PNG, WebM, OGG, and more

E-mail
Password

(used to delete files and posts)

Misc

Remember to follow the Rules

The backup domains are located at 8chan.se and 8chan.cc. TOR access can be found here, or you can access the TOR portal from the clearnet at Redchannit 3.0.

US Election Thread

8chan.moe is a hobby project with no affiliation whatsoever to the administration of any other "8chan" site, past or present.

(14.78 KB 480x360 NBpReDYG1Fk.jpg)

Version 354 hydrus_dev 05/29/2019 (Wed) 22:54:51 Id: 273809 No. 12755
https://www.youtube.com/watch?v=NBpReDYG1Fk windows zip: https://github.com/hydrusnetwork/hydrus/releases/download/v354/Hydrus.Network.354.-.Windows.-.Extract.only.zip exe: https://github.com/hydrusnetwork/hydrus/releases/download/v354/Hydrus.Network.354.-.Windows.-.Installer.exe os x app: https://github.com/hydrusnetwork/hydrus/releases/download/v354/Hydrus.Network.354.-.OS.X.-.App.dmg linux tar.gz: https://github.com/hydrusnetwork/hydrus/releases/download/v354/Hydrus.Network.354.-.Linux.-.Executable.tar.gz source tar.gz: https://github.com/hydrusnetwork/hydrus/archive/v354.tar.gz I had a great week. The first duplicates storage update is done, and I got some neat misc fixes in as well. false positives and alternates The first version of the duplicates system did not store 'false positive' and 'alternates' relationships very efficiently. Furthermore, it was not until we really used it in real scenarios that we found the way we wanted to logically apply these states was also not being served well. This changes this week! So, 'false positive' (up until recently called 'not dupes') and 'alternates' (which are 'these files are related but not duplicates', and sitting in a holding pattern for a future big job that will allow us to process them better) are now managed in a more intelligent storage system. On update, your existing relationships will be auto-converted. This system uses significantly less space, particularly for large groups of alts like many game cg collections, and applies relationships transitively by its very structure. Alternates are now completely transitive, so if you have a A-alt-B set (meaning one group of duplicate files A has an alternate relationship to another group of duplicate files B) and then apply A-alt-C, the relationship B-alt-C will also apply without you having to do anything. False positive relationships are more tricky, but they are stored significantly more efficiently and also apply on an 'alternates' level, so if you have A-alt-B and add A-fp-M, B-fp-M will be automatically inferred. It may sound odd at first to think that something that is false positive to A could be nothing but false positive to B, but consider: if our M were same/better/worse to B, it would be in B and hence transitively alternate to A, which it cannot be as we already determined it was not related to A. Mistakenly previously allowable states, such as a false positive relationship within an alternates group, will be corrected on the update. Alternates will take precedence, and any subsequently invalid false positives will be considered mistakes and discarded. If you know there are some problems here, there is unfortunately no easy way at the moment to cancel or undo an alternate or false positive relationship, but once the whole duplicates system is moved over I will be able to write a suite of logically correct and more reliable reset/break/dissolve commands for all the various states in which files can be related with each other. These 'reset/set none' tools never worked well in the old system. The particularly good news about these changes is it cuts down on filtering time. Many related 'potential' duplicates can be auto-resolved from a single alternate or false positive decision, and if you have set many alts or false positives previously, you will see your pending potentials queue shrink considerably after update, and as you continue to process. With this more logically consistent design, alternate and false positive counts and results are more sensible and consistent when searched for or opened from the advanced mode thumbnail right-click menu. All the internal operations are cleaner, and I feel much better about working on an alternates workflow in future so we can start setting 'WIP' and 'costume change'-style labels to our alternates and browsing them more conveniently in the media viewer. The actual remaining duplicate relations, 'potential', 'same quality', and 'this is better', are still running on the old inefficient system. They will be the next to work on. I would like to have 'same quality' and 'this is better' done for 356, after E3. the rest The tag blacklists in downloaders' tag import options now apply to the page's tags after tag sibling processing. So, if you banned 'high heels', say, but the site suddenly starts delivering 'high-heel shoes', then as long as that 'high-heel shoes' gets mapped to 'high-heels' in one of your tag services, it should now be correctly filtered. The unprocessed tags are still checked as before, so this is really a double-round of checking to cover more valid ground. If you were finding you were chasing many different synonyms of the tags you do not like, please let me know how this now works for you. The annoying issue where a handful of thumbnails would sometimes stop fading in during heavy scrolling seems to be fixed! Also, the thumbnail 'waterfall' system is now cleverer about how it schedules thumbnails that need regeneration. You may notice the new file maintenance manager kicking in more often with thumbnail work. Another long-term annoying issue was the 'pending' menu update-flickering while a lot of tag activity was going on. It would become difficult to interact with. I put some time into this this week, cleaning up a bunch of related code, and I think I figured out a reliable way to make the menus on the main gui stop updating as long as one is open. It won't flicker and should let you start a tags commit even when other things are going on. The other menus (like the 'services' menu, which will update on various service update info) will act the same way. Overall menu-related system stability should be improved for certain Linux users as well. The new 'fix siblings and parents' button on manage tags is now a menu button that lets you apply siblings and parents from all services or just from the service you are looking at. These commands overrule your 'apply sibs/parents across all services' settings. So, if you ever accidentally applied a local sibling to the PTR or vice versa, please try the specific-service option here. full list - duplicates important: - duplicates 'false positive' and 'alternates' pairs are now stored in a new more efficient structure that is better suited for larger groups of files - alternate relationships are now implicitly transitive–if A is alternate B and A is alternate C, B is now alternate C - false positive relationships remain correctly non-transitive, but they are now implicitly shared amongst alternates–if A is alternate B and A is false positive with C, B is now false positive with C. and further, if C alt D, then A and B are implicitly fp D as well! - your existing false positive and alternates relationships will be migrated on update. alternates will apply first, so in the case of conflicts due to previous non-excellent filtering workflow, formerly invalid false positives (i.e. false positives between now-transitive alternates) will be discarded. invalid potentials will also be cleared out - attempting to set a 'false positives' or 'alternates' relationship to files that already have a conflicting relation (e.g. setting false positive to two files that already have alternates) now does nothing. in future, this will have graceful failure reporting - the false positive and alternate transitivity clears out potential dupes at a faster rate than previously, speeding up duplicate filter workflow and reducing redundancy on the human end
[Expand Post]- unfortunately, as potential and better/worse/same pairs have yet to be updated, the system may report that a file has the same alternate as same quality partner. this will be automatically corrected in the coming weeks - when selecting 'view this file's duplicates' from thumbnail right-click, the focus file will now be the first file displayed in the next page - . - duplicates boring details: - setting 'false positive' and 'alternates' status now accounts for the new data storage, and a variety of follow-on assumptions and transitive properties (such as implying other false positive relationships or clearing out potential dupes between two groups of merging alternates) are now dealt with more rigorously (and moreso when I move the true 'duplicate' file relationships over) - fetching file duplicate status counts, file duplicate status hashes, and searching for system:num_dupes now accounts for the new data storage r.e. false positives and alternates - new potential dupes are culled when they conflict with the new transitive alternate and false positive relationships - removed the code that fudges explicit transitive 'false positive' and 'alternate' relationships based on existing same/better/worse pairs when setting new dupe pairs. this temporary gap will be filled back in in the coming weeks (clearing out way more potentials too) - several specific advanced duplicate actions are now cleared out to make way for future streamlining of the filter workflow: - removed the 'duplicate_media_set_false_positive' shortcut, which is an action only appropriate when viewing confirmed potentials through the duplicate filter (or after the ' show random pairs' button) - removed the 'duplicate_media_remove_relationships' shortcut and menu action ('remove x pairs … from the dupes system'), which will return as multiple more precise and reliable 'dissolve' actions in the coming weeks - removed the 'duplicate_media_reset_to_potential' shortcut and menu action ('send the x pairs … to be compared in the duplicates filter') as it was always buggy and lead to bloating of the filter queue. it is likely to return as part of the 'dissolve'-style reset commands as above - fixed an issue where hitting 'duplicate_media_set_focused_better' shortcut with no focused thumb would throw an error - started proper unit tests for the duplicates system and filled in the phash search, basic current better/worse, and false positive and alternate components - various incidences of duplicate 'action options' and similar phrasing are now unified to 'metadata merge options' - cleaned up 'unknown/potential' phrasing in duplicate pair code and some related duplicate filter code - cleaned up wording and layout of the thumbnail duplicates menu - . - the rest: - tag blacklists in downloaders' tag import options now apply to the parsed tags both before and after a tag sibling collapse. it uses the combined tag sibling rules, so feedback on how well this works irl would be appreciated - I believe I fixed the annoying issue where a handful of thumbnails would sometimes inexplicitly not fade in after during thumbgrid scrolling (and typically on first thumb load–this problem was aggravated by scroll/thumb-render speed ratio) - when to-be-regenerated thumbnails are taken off the thumbnail waterfall queue due to fast scrolling or page switching, they are now queued up in the new file maintenance system for idle-time work! - the main gui menus will now no longer try to update while they are open! uploading pending tags while lots of new tags are coming in is now much more reliable. let me know if you discover a way to get stuck in this frozen state! - cleaned up some main gui menu regeneration code, reducing the total number of stub objects created and deleted, particularly when the 'pending' menu refreshes its label frequently while uploading many pending tags. should be a bit more stable for some linux flavours - the 'fix siblings and parents' button on manage tags is now a menu button with two options–for fixing according to the 'all services combined' siblings and parents or just for the current panel's service. this overrides the 'apply sibs/parents across all services' options. this will be revisited in future when more complicated sibling application rules are added - the 'hide and anchor mouse' check under 'options->media' is no longer windows-only, if you want to test it, and the previous touchscreen-detecting override (which unhid and unanchored on vigorous movement) is now optional, defaulting to off - greatly reduced typical and max repository pre-processing disk cache time and reworked stop calculations to ensure some work always gets done - fixed an issue with 'show some random dupes' thumbnails not hiding on manual trashing, if that option is set. 'show some random dupes' thumbnail panels will now inherit their file service from the current duplicate search domain - repository processing will now never run for more than an hour at once. this mitigates some edge-case disastrous ui-hanging outcomes and generally gives a chance for hydrus-level jobs like subscriptions and even other programs like defraggers to run even when there is a gigantic backlog of processing to do - added yet another CORS header to improve Client API CORS compatibility, and fixed an overauthentication problem - setting a blank string on the new local booru external port override option will now forego the host:port colon in the resultant external url. a tooltip on the control repeats this - reworded and coloured the pause/play sync button in review services repository panel to be more clear about current paused status - fixed a problem when closing the gui when the popup message manager is already closed by clever OS-specific means - misc code cleanup - updated sqlite on windows to 3.28.0 - updated upnpc exe on windows to 2.1 next week The duplicates work this week took more time than I expected. I still have many small jobs I want to catch up on, so I am shunting my rotating schedule down a week and doing a repeat. I will add some new shortcuts, some new tab commands, and hopefully a clipboard URL watcher and a new way of adding OR search predicates.
Has ordered parent images and siblings been implmented yet? I want image A, image B, image C, image D to be ordered like so as they are directly related to each other, so would like hydrus to sort this set of related images in this order, like a motion picture book. Can it do this?
(114.40 KB 1367x1052 Untitled.png)

I used the temp_dir command to set the temp dir for Hydrus but it is still writing to my Windows temp folder while going through my subscriptions. I noticed because it's triggering Windows Defender with what I assume is false positives while downloading from a yiff.party subscription. Pic related.
>>12755 >- tag blacklists in downloaders' tag import options now apply to the parsed tags both before and after a tag sibling collapse. it uses the combined tag sibling rules, so feedback on how well this works irl would be appreciated Not sure about blacklists, but whitelists are working fine now. Before, whitelisting a specific namespaced tag like "rating:safe" didn't work, only the entire namespace "rating:" could be pulled.
I might be misunderstanding what you wrote, but let's say there's images A, B, C, and you assign both A and B and C and B as alternates, will A and C be automatically set as alternates? This is bad logic, as A and C necessarily aren't alternates, as they could also be duplicates. Here's an example situation where images will be invalidly set as alternates: 1. let A and C be duplicates 2. let B be an alternate of both A and C (since they're the same) 3.duplicate filter shows A and B, user correctly sets them as alternates 4.duplicate filter shows B and C, user correctly sets them as alternates 5.the system will set A and C as alternates, while in fact they're duplicates
>>12767 got it wrong, all I asked if there was a way to set images in order a, b, c, 1, 2, 3, 4. You name a folder 1, you name another folder 2 and another 3 windows file system will order the folder from 1, 2 then 3. Simillary to pictures, I want them ordered in hydrus as theyre like a picture book
>>12769 I meant that for hydrus_dev, not you.
(85.53 KB 400x267 1351796209766.jpg)

yo hydev, someones using all the fucking bandwidth in the first week of every month and we cant upload tags. am i gonna have to hunt a bitch here.
>>12771 Right, I was also going to upload. Did the user base / amount of tags uploaded increase this much, or is it some kind of DoS/defective client?
>>12771 My 70k tags are uploading fine. Either my internet is so shit I'm already used to it or it's fixed. Either way, it was probably a bunch of clients uploading and downloading shit at the same time the second the bandwidth limit reseted.
>>12767 I think I may have just experienced this with a CG set that got downloaded again in lower quality. It made me compare the new low quality CGs to each other (all alternates) and then only compared one new low quality > old high quality, and then set them all as alternates. Although, I may have gone full retard and right clicked instead of left clicked on that one pair, but I don't think so.
(18.38 KB 473x222 abc.jpg)

>>12767 >>12775 Well, I can confirm this now, with your exact example. A and B were already set as alternates. Duplicate filter showed me C and B, but never A and C.
Would it be possible to get a rating service that has multiple states in just one pip? Click once for red, twice for yellow, thrice for green and so on? It would work much better for tracking workflows than like/dislike and take less space than the numeric ratings.
>>12772 >>12771 >>12773 I think PTR tag uploads should be limited to tags of a certain popularity, at least for a limited time, to save bandwidth. This also should prevent shitty tags being uploaded
>>12771 maybe it's not a problem of mass upload tags, but of mass download tags. is this bandwith limit really necessary? it's 1 day of month and there's already 9GB used
>>12775 >>12767 >>12776 Thank you very much for these reports. I agree this is a problem, and a mistake in my alternate-setting code in how aggressive in how it clears out potentials pairs. I am ok with the current structure of transitively applying alternates, even amongst potential duplicates, but the problem in the 354 code is that it clears out all potentials between all alts in an alternate group. Instead I should only clear out potentials between members of the set alternate pair, which leaves the potentials that have not yet been considered to still be queued up in the duplicate filter and merged if duplicates. I will fix the over-aggressive potential clearing for 355. Once the rewrite is done, there will be manual commands to merge alternates by right-clicking thumbnails, and for manually requeueing files in the duplicate filter. Can you rate how significantly this problem has affected you? Would you like me to automatically requeue your alts into the duplicate filter, giving you more decisions to make but covering all situations, or would you rather fix it manually later on? I just replied to an email on the same issue, so I'll spam the relevant bit of my answer here as well: I am going to define A and B being costume alternates. Let's say A is a picture of a girl in a red dress. B is a picture of the same girl in a yellow dress. C is the some yellow dress as B, but higher quality. You have already set A alt B. Starting with: A alt B A pot C B pot C —— Original 354 code: Setting A alt C makes: A alt B alt C —– Better 355 code: Setting A alt C makes: A alt B alt C B pot C —– Hence in 355 a B/C pair will still be queued up for processing in the duplicate filter. Setting B < C would then merge B into C, so we would have the situation: A alt C I believe this fixes the issue. Please let me know if you think it is insufficient.
>>12756 I did consider this when making the decision to elevate false positive to apply to whole alternate groups, but I ultimately concluded that the 'progression' issue is an edge-case. The value of not having to true-false-positive a file with each of 150 game cgs with 150 decisions (or worse, 50 alts with 150 alts for something like 7,500 decisions) is, I think, much greater than resolving a handful false-false-positives within unusual groups of alternates. Once this whole overhaul is done, I expect to have comprehensive thumbnail right-click actions to clean up mistakes, and I think that is the best avenue here, at least for now. I think in the current scheme, your situation would ultimately end up with two alternate groups with a single false-false-positive relation between them. Once you discovered this, it would be a single two-thumbnail selection and right-click action to fix and merge the two groups. I am prepared to change this if it proves to be a common problem IRL. Please give me feedback as this system matures if you actually encounter this situation, and how it tends to happen. If it is a problem, I suspect there may be other remedies such as choosing the sort order of potential pairs in the duplicates filter in some more clever alternate-aware way.
>>12757 >>12769 No, file alternates do not have a rich storage system or workflow yet. At the moment, they are set as simple 'alternates', basically in a holding area away from the actual duplicate storage so that we can revisit them once I have written some alternate workflow and presentation (things like notifying and quick-navigating to alts in the media viewer). Adding rich file alternate relationships will be a big job that will go in the regular big job polls. This current duplicates db overhaul is in part a preparation for it. I expect it to be popular. I would like to have labels and indices in the file alt metadata, so you'll be able to set, say, a 'WIP' (work in progress) label to four art files and 1, 2, 3, 4 so you have a timeline.
>>12759 Thank you for this report. Can you just confirm a couple of things for me? You don't have to say the paths or do screenshots, just confirm which string is correct where, if any. Hit help->debug->report modes->subprocess report mode. Then help->about. Is the 'temp dir' listed there correct, or your AppData\Local\Temp? The subprocess report mode should have spammed a big popup. If you look down it (or check what it wrote to your log file under install_dir/db/client - 2019-06.log, it should have one or both variables TMP or TEMP, with the path written right after. Are those incorrect? I will look at this code this week.
>>12761 That's interesting. I am not sure if my change here did that, as this made the tests more restrictive. If you don't mind, can you explain a bit more, maybe with an example URL? Was it that a URL with 'rating:safe' was not getting through before, despite the whitelist rule? -or- Was it that a URL without 'rating:safe' was getting through, despite the whitelist rule? This week's change does sibling stuff, so is there any chance the site you are pulling from instead gives something like 'sfw' that is being mapped in your client to 'rating:safe'?
>>12771 >>12772 >>12773 >>12779 >>12778 I think it has reset now. Maybe you were close to some timezone midnight somewhere and hadn't seen it reset yet. But yeah, this problem is on my mind. I hope to address it completely after this duplicate work, likely in 6-8 weeks. I will explore using IPFS and hydrus-specific mirror repositories that some users have kindly offered to run to help me with bandwidth. I am not certain what I will do yet. I will also do a round of network updates to clear out a little of the more obvious non-useful tags (stuff like 'banned artist', which comes from boorus and has no use for us), and help you shape what tags you see better. Unless we accidentally finally reach tag-plateau-nirvana, I think the last weeks of the month will run out of bandwidth again. Please bear with the situation for a couple more months!
>>12772 It has been a slow growth. We have been getting closer and closer to my original 256GB/month limit for a long time, and then a combination of some new users (who are going to eat more bandwidth per day on average as they catch up) and some new parsing gluts from new sites being parsed accelerated us over the limit a little earlier than I expected. Being increasingly popular and active is a nice problem to have, but still a problem. Thankfully, due to the way hydrus works (all update files are the same for all users), it is technically not super difficult to solve–I just need a few weeks to figure out the solution(s).
>>12777 That's an interesting idea, and I like it. A multistate single-pip rating service is basically just like/dislike with n states instead of two, so perhaps I could generalise the like/dislike code into that? I will think about this a bit and make a proper job for it. It is probably a bit too complicated to fit into small weekly work, but perhaps I can sneak it into an 'ongoing' week.
>>12786 What's the limit even for, is it data caps or just a protection against getting the DB bloated with a fuckton of requests? If it's the latter, wouldn't a daily or weekly limit be better?
(243.17 KB 1292x617 Capture.png)

>>12785 Consider that I was downloading from gelbooru (no search limit) so my query were extensive and all the whitelist tags were guaranteed to be there. I only wanted the whitelisted tags/namespaces to be allowed in my local tags repo and to ignore the rest. Despite the images having the tags, they were being ignored and only the highlighted ones were being stored. Not even sure if it is intended to work like this desu. All I know is that after the update the tags started to show on my tags repo.
>>12781 Is the scenario in the 355 code a typo? Shouldn't it be 'setting A alt C' rather than 'setting A alt B', as we already started with A alt B? Otherwise, the logic seems sound. My understanding of the most efficient you can make this without defining a duplicate relationship falsely is the following: False positives can be defined transitively to alternate, equals, and better/worse images. Alternates cannot be defined transitively to alternate images, however, they can be defined transitively to equal and better/worse images as those are effectively all the same image. Equals cannot be defined transitively to alternates or better/worse images, but they can be defined transitively to equal images. Better/worse pairs can be partially defined transitively, essentially all the images worse than the worse image should be defined as worse than the better image, and all the images better than the better image should be defined as better than the worse image. The rest of the better/worse relations are ambiguous and need user intervention and some more logic that's somewhat pointless to define as most users will not care that a particular image is the 3rd best or 8th worst image, or at least I do not. Further, most better/worse logic defined in this paragraph is pointless as most users delete the worse image.
>>12784 >Is the 'temp dir' listed there correct, or your AppData\Local\Temp It's wrong, AppData\Local\Temp. >Are those incorrect? No they are both correct to my wanted temp dir. Thanks.
(266.19 KB 1920x1080 muh dupe schema.png)

>>12818 Shit, yes that was a typo, thanks for pointing it out. I edited the post. Yeah, my original stab at this system attempted to rank duplicate images completely from worst to best in a chain. And my first thought on this update was to improve the storage system to better support the logic and storage of that chain and deal with equal-value pairs neatly and so on, but looking at the workflow and processing hell it would take to truly deal with that, when what we really care about98 % of the time is which is the best of a set of dupes, I decided to move to a 'king' system. Basically, for file dupes, I will group them into a single media group that has one 'king', meaning the best quality file of the group. Setting something equal or worse than the king will merge it into the group, and setting something better than the king merges into the group and sets a new king. Alternate groups are groups of media groups, and our false positive relations are pairs of alternate group identifiers. I will extend the system in a future iteration to add more alternate metadata, probably an optional label and index, so you can do like 'WIP 3' and 'costume change child' kind of thing. I've done other versions of pic related before as a shit-tier demonstration of the new system, but it is actually what I am aiming for. Everything is moving to groups, and there are two tiers of group–duplicates groups and alternates groups. The different relationships between the files are now stored more implicitly (rather than the old system, which is a rats' nest of pairs competing all at the same level).
>>12807 The PTR is a laptop under my desk and I have a monthly bandwidth cap here, 1TB a month. I don't have much spare at the end of the month, so I can't easily raise the rule. I would still like to solve the distribution problem to better decentralise the PTR (having mirrors or IPFS support will help if the PTR goes down for a long time and so on), but I also have the option of paying for more bandwidth. I am seriously considering this, especially as my other bandwidth drains are increasing as well. If 4k starts being legit on youtube or whatever in the next couple of years and I find I like it, I guess I'll have to bite the bullet anyway. I'm debating whether to add a ~8GB/day rule just for the next couple of months to smooth out the end-of-month cutting-out problem. I am not sure if it will make some part of the network (like say people in timezone x who want to upload tags) more aggravating.
>>12817 Ah, I think I explained it incorrectly. The fixed 'blacklist' is the section in your screenshot a little above ('set file blacklist'). There you can stop files from being downloaded and imported if they have 'scat' or 'gore' or whatever you don't like. That system now checks both pre- and post-siblings collapse, in case you don't like 'gender:futanari' but the booru serves up 'futa', say. That's interesting that your local tag filter whitelist is working ok. I think I remember changing some similar tag sibling stuff in there as well recently. Perhaps that site is delivering some variant of 'rating:safe' like 'rating:sfw' and there is or isn't a tag sibling now being applied. Please keep me updated on how this goes in future! Tag siblings have no unified pipeline and are semi-logic-hell in how they are applied in many places, so bug reports on this stuff until I can properly rewrite the system are great.
>>12820 Thank you, I will check again how this is being applied.
>>12824 If it's datacaps there's nothing to do. Making the period smaller would work against fags bloating the db on purpose, but seeing as that isn't the case there wouldn't be much difference. >I am not sure if it will make some part of the network (like say people in timezone x who want to upload tags) more aggravating. If you're worried about that, how about weekly limits instead? It'd start getting full towards the end, but the first few days would be smooth sailing regardless of timezone.
>>12831 Sure, let's give it a go. 64GB/week. You can see it in your review services if you refresh account right now. If it goes funky for some odd bandwidth-cycling reason, I can always reset it.
Ok hdev, still going though video files for removal, and this is the 8th fucking video that is either the same or damn near the same as the others. So this got me thinking on if I had the ability, how would I check duplicates for video. while i'm not coming up with good answers, I do have ideas. first, could take an image every second, or so, and use them to fuzzy match, and if something fuzzy matches it goes a bit more process intensive to see if it is a real match. This would allow for both exact duplicates, and clips from larger files to be found. given what the dup finder can do with literal garbage, I think this could work if its even an option in the first place. depending on how big files are it would pull, it could bloat things out, but ultimately I think it would be an overall good.
>>12841 Yeah, I do plan to add videos to the duplicate system, and I originally designed it to eventually support them. The recent file maintenance system was a step forward in prepping for the CPU work we'll need to do to retroactively crunch the data on this in a reasonable way. I plan to do something like what you propose. The duplicates system currently works on comparing still images' shapes with each other, and it allows for multiple still image 'phashes' per file, so my task is selecting a good number of useful frames from videos that will match with others. If it is reasonably possible, I would like do something more clever than just picking one frame per x time units or frames. This would line up right if our two vids were very exact conversions or resizes, but some of the codec changes drop a frame at the start or do 29.97fps vs 30fps bullshit that would desync our comparison. My original duplicates system did add vids by using the first frame, but so many have a black/flat colour first frame that it lead to a billion false positive dupes. Vids are no longer included, and I also drop anything that looks too much like a flat colour image from the system entirely. If I could instead find the x most 'interesting' frames of a video, then 2-second gif clips of 20-second webms would have a higher chance of being matched, and 30/60fps conversions would too. I don't know, though. That is probably beyond me to do well, or maybe I can hack something that is good enough. I could do something like generating a phash for every frame in the vid and then have them compete with each other to remove similar-looking frames/phashes until the 20 most unique were left. It might pick up a bunch of false positives again with, say, a black screen with a bit of white text on (like an 'intro' title card) though. Still, I am almost ready to do this now, and dupe work is proceeding, including more efficient storage of potential dupes, so maybe the answer here is to get a simple system in and then iterate on it.
>>12846 when I say literal garbage, I mean the image is drastically different, to the point I cant even see how it thought they were dups, but those are from the asshole who specifically fucked with dup detection when creating trash images, the dup detector is able to have enough wiggle room that even if the images aren't lining up perfectly, it may spit out something useable, and I used 1 second just because no mater what I watch 1 second isnt enough time for a 100% scene change, it should pick up some duplicates from that. on the title card, you could make a generic here is a black frame with text, and have a few variations of it, this could be used as a compare to X image and trace it if it does, so it would automatically know that everything with it will bee seen as a duplicate. If you are able to, try to get in contact with the people from what anime is this, and see how they did theirs, it may give some ideas.
>>12846 Could always check how Video Comparer works. It's the best video dupe finder software I've used.


Forms
Delete
Report
Quick Reply