/t/ - Technology

Discussion of Technology

Index Catalog Archive Bottom Refresh
Options
Subject
Message

Max message length: 8001

files

Max file size: 32.00 MB

Total max file size: 50.00 MB

Max files: 5

Supported file types: GIF, JPG, PNG, WebM, OGG, and more

E-mail
Password

(used to delete files and posts)

Misc

Remember to follow the Rules

The backup domains are located at 8chan.se and 8chan.cc. TOR access can be found here, or you can access the TOR portal from the clearnet at Redchannit 3.0.



8chan.moe is a hobby project with no affiliation whatsoever to the administration of any other "8chan" site, past or present.

You may also be interested in: AI

(43.05 KB 618x656 ChannelChangerLogo_Avatar.png)

(133.69 KB 1510x924 ChannelChangerLogo.png)

ChannelChanger Development & Support Anonymous 09/06/2020 (Sun) 18:37:48 No. 1257
This is the official development and support thread for ChannelChanger. Please request help, post bugs, or offer suggestions here. What is ChannelChanger? A cross-platform, multi-site scraper and importer. It allows anyone to back up a board and then import it to their own website. https://gitgud.io/Codexx/channel_changer What do I need to run this? Python 3.8+ and most of the dependencies listed in requirements.txt. A basic set-up guide is provided in the readme. This software was developed and tested exclusively on Linux. I intend to support both OSX and WIndows. If you use either of these platforms and encounter any issues, please let me know. Can I scrape a board from [site] with this? Probably. There is explicit support for LynxChan, Vichan, and JSChan websites. Some vichan sites may have issues with thumbnails because their APIs do not expose thumbnail extensions; I have added an override but you may need to run two scrapes of boards on some sites to get all of the thumbnails. Vichan's API matches 4chan's with some extensions, so the scraper might work on other sites which clone the 4chan API, but this is untested. Many vichan sites have customized frontends, such as OpenIB, Lainchan, or Kissue. I've tested and confirmed these work, but can't always guarantee full compatibility with each of these, especially if they decide to alter the API or where files are stored. LynxChan sites should work fine, since the direct path for both the thumbnail and the file are in the JSON. JSChan works, but its API is presumably unstable. if it changes, please alert me and I will make the necessary tweaks. Can I import these boards to my own website? Sure, but for the moment only importing LynxChan boards from LynxChan or Vichan sites has any support. Importing is currently undergoing a heavy refactor. Once it is done, it will be possible to import from any board to a LynxChan website. Imports to other imageboard engines are planned. Can I view the board offline? Easily? No, but I am looking into an option to do this. You will have a local copy of the threads and files, but the data is not modified for local viewing. I will continue to iterate and refactor. The code is a bit of a mess at the moment, but I plan to simplify it and make it PEP8-compliant soon. It's very likely there's still some big kinks to work out. Your feedback is incredibly valuable!
Is there a way to know the total size of a board before downloading it?
>>2725 Not currently, but I could add that in. Note that it would be a lowball estimate, because thumbnails would not be included in the count.
>>2728 A rough estimate is fine.
>>2729 I pushed an update. The script will now halt after gathering thread data, display the file size, and ask if you want to proceed. I've also added an override option of -y or --yes to skip this prompt.
>>2730 >OSError: Not enough disk space for all files. It spit out an error confirming there isn't enough disk space, which is good, but doesn't provide information on how much space is needed when there isn't enough space.
>>2731 Good catch. I've added the space requirements and a notice about thumbnails not being included to the error.
>>2745 It appears to be erroring out because my home partition doesn't have enough space but the drive I'm pointing the download to does have enough space.
>>2752 Forgot to pass it the working directory. Should be good now.
If I download a board to the same folder more than once will it add new content while keeping the remaining content intact? What if posts or threads have been deleted since last backing up the board?
>>2837 I'd also wonder if the src and thumbnails could be organized into folders based on their thread number or potentially thread subject if available? I know the intent of the tool was for importing it into other sites databases and the like but its main use for me is archiving.
>>2837 If you have a copy of a thread on disk, and you scrape it again, the json and html retrieved will entirely overwrite the last one. So a post deleted between scrapes will go missing. Expired/deleted threads will remain as they were, since they will not be overwritten. If you kept scraping to the same directory over a long period of time, you'd effectively build up a catalog of thousands of threads, and all their media. More than would fit on a live board. It effectively accumulates. You're really only at risk of losing deleted posts inside of threads. If you have a great need to retain those, too, my recommendation would be to put the json and html into source control. >>2838 I'll see what I can do, but the current system is designed to reduce redundant downloads. The directory structure mimics a Vichan server, which stores media per-board. That said, I would like the tool to be useful as a general-purpose scraper. Ideally I'd like a way to do it without making archives unusable for import later. Let me know what else you'd need for archival purposes and I'll keep it in mind when I revisit scraper functionality.
>>2850 >If you have a copy of a thread on disk, and you scrape it again, the json and html retrieved will entirely overwrite the last one. So a post deleted between scrapes will go missing. Expired/deleted threads will remain as they were, since they will not be overwritten. If you kept scraping to the same directory over a long period of time, you'd effectively build up a catalog of thousands of threads, and all their media. More than would fit on a live board. It effectively accumulates. You're really only at risk of losing deleted posts inside of threads. If you have a great need to retain those, too, my recommendation would be to put the json and html into source control. Interesting, that's workable information. I assume the files/media don't get deleted alongside the posts then? >That said, I would like the tool to be useful as a general-purpose scraper. Ideally I'd like a way to do it without making archives unusable for import later. >Let me know what else you'd need for archival purposes and I'll keep it in mind when I revisit scraper functionality. It's basically completely functional for archiving, it's just a matter of convenience in how files are organized for offline/personal viewing. I'd agree that you shouldn't impede ease of importing in achieving this though, since being able to re-import the content as is down the line is just as important for archival.
>>2879 > I assume the files/media don't get deleted alongside the posts then? Exactly. It always checks if a conflicting file exists, and if so, it will skip the download. It does this for both files and thumbnails. I believe it relies on the filename for this, which is an sha256 hash on LynxChan, JSChan, and some Vichan forks, so outside of some Vichan instances you're guaranteed files are unique. The upside is you can download a huge board and then, once finished, grab an updated copy and only retrieve missing files. >it's just a matter of convenience in how files are organized for offline/personal viewing. It's possible I don't actually need the HTML for anything at the moment (it's mostly there to mirror Vichan's folders) so once I finish up importing I'll look at an option to modify it for local viewing. The JSON is far more important for imports.
this would be like 10 lines in bash if you used wget
>The 9th Circuit has defended the right to scrape publicly-accessible data (Archive). In the same ruling, they explicitly disallow measures which discriminate against scrapers, holding that they have the right to access any data a user with a web browser does. Who the fuck is the 9th circuit and why should anyone care? And also you're wrong faggot, you don't have the right to rape web servers with automated requests because that stupid article lies on a fallacy that bot activity = "a user with a web browser" which is clearly wrong unless you limit your tool to 20KB/s which I doubt is the case so fuck off.
>>3513 https://infogalactic.com/info/United_States_Court_of_Appeals_for_the_Ninth_Circuit Maybe you should've actually looked it up instead of fagging out about it. >And also you're wrong faggot, you don't have the right to rape web servers with automated requests. Explain what the actual issue with scraping public data is, aside from "muh requests", get a better VPS or connection then if you're hosting public data that's starting to be scraped and you otherwise have no objections to such scraping.
By request, I've added experimental support for scraping Yotsuba on a side branch. It's hacky and I probably won't add it into master, but if anyone wants to use the tool for scraping halfchan it should now be able to handle that.
How to import threads from 8chan.moe into jschan? The fatchan admin directed me here. Please help
>>6715 cry
>>6715 just regex the differences in the html files
>>6715 Supporting this is on my todo list. I need to refactor importing and then learn the JSChan database structure.
>>2601 It's torsocks nowadays
Could not find module "gridfs" Error happens on macos 10.13 and 10.15.
>>3319 It's complex to download 8chan.moe's files with wget because of the refer header or something like that. Does someone know another way?
>>7446 I had to wine ChromeCacheExtracter by Nir Soft. The file I wanted, an mp3, was cut into several parts with range headers. Disgusting.
>>7446 Can't you just pass the header? You can on curl and it's a pretty basic operation so I don't know why not. Otherwise just use curl.
>>7446 Should be doable with curl. ChannelChanger should be able to do it. You can use the whitelist to just download a single thread. >>7445 Probably forgot to update the requirements.txt. I'll look at it.
(32.56 KB 1047x523 ksnip_20220307-220941.png)

Getting this in jewbuntu 20.04. Already tried installing python3.8.12 from source which fixed the pip install setup displaying errors when attempting to install grapheme. Still not working though.
>>7736 Looks like the progress bar library I'm using has updated a lot over the past year, and they must have renamed their animations at some point. I've updated the requirements.txt to force it to use the specific version I last tested on. In the future, I'll look into updating it to use the newer names and features. For now, go ahead and pull the latest version and run the pip command again.
can this grab threads from 4chan and plebs/desu ?
>>10628 Not yet. Their API is totally different these days. But I intend to support it.
Hey Codexx, was hoping you could provide some updates on your future plans(?) for Channel Changer. I have a lynxchan board I'd really like to get over to jschan but my brain isn't as big as yours. Hopefully my penis is bigger, because knowing you're smarter AND have a bigger wiener might mean I'll have to kys myself. But that's neither here nor there. import to jschan when? and what's the best way to get in touch with you privately? pls respond and thanks.
>>11585 I've remade a simple CC clone using chatgpt scripts, it's obviously fucked beyond belief when it comes to reuploading content but I will keep you posted.
>>11612 Please do. I've been trying to figure things out on my own and made minor progress, but jschan seems to have a lot of moving parts when it comes to importing data. If any experts here want to chime in with advice on taking scraped data and shoving it deep down inside jschan, the advice wouldn't go unappreciated.
>>11585 Email me at codexx at cock.li and we can talk. If you have a jschan instance already running then I can help you probe and explain some if the holdups. >>11612 You're probably going to need manual tweaks based on the DB structure. Chatgpt is going to struggle with logic problems and undocumented data structures.
>>11634 Sent ;)


Forms
Delete
Report
Quick Reply