
AI Content Archive
Links
Overview
An archive of ai content including still images, video clips, music, voice-overs, and concept art trailers from various AI tools like Suno, Udio, Hailou, and Luma. All media content is hosted on-chain with a scalable decentralized backend.
Project Steps
- Scrape material sources.
- Primarily files from 4chan.org/wsg threads
- script:
scraping/chan/dl.py
- See blog post ’web-archiving-with-python-and-wget’ for the scraper implementation details.
- scraped threads are then saved in the below format so that the offline html file can be viewed in the browser with all files being served locally and identically to the live site.
tree scrape_output/4chan/wsg/5601498
├── boards.4chan.org
│ └── wsg
│ └── thread
│ └── 5601498.html
├── files
│ ├── file1_real_name.gif
│ ├── file2_real_name.webm
│ ├── truncated
├── i.4cdn.org
│ └── wsg
│ ├── file1_thumb.jpg
│ └── file2_thumb.jpg
├── is2.4chan.org
│ └── wsg
│ ├── file1_4chan_name.gif
│ └── file2_4chan_name.webm
├── links.txt
├── s.4cdn.org
│ ├── css
│ │ └── truncated
│ ├── image
│ │ └── truncated
│ └── js
│ └── truncated
└── timestamp.txt
- Use regex to find the exact content/threads you are looking for by scanning the thread subjects.
- Script:
scraping/chan/scan_scraped_threads.py
- The regex search string I use to find AO related threads is:
(^| )(ai|hailou|luma|aicg)($| )
- Script:
- Point the import script at the thread directory you wish to import.
- Script:
scraping/chan/convert_thread_to_hashed.py
- This script will copy the file to a new directory and rename it to its SHA256 checksum. For each file scraped, a json side-car file is generated that contains data about the file such as its tags, source thread, mime type, size, etc. If the scraped file is a video, this script will also use ffmpeg to generate a thumbnail that is suffixed with _thumb.jpg
- An imported file will look like this:
- Script:
# json sidecar file
e35ad205ed53f7a2716db94b1cfac7c81032e891216ba8e178a235ce925acb43.json
# thumbnail
e35ad205ed53f7a2716db94b1cfac7c81032e891216ba8e178a235ce925acb43_thumb.jpg
# scraped file
e35ad205ed53f7a2716db94b1cfac7c81032e891216ba8e178a235ce925acb43.webm
- The file and thumbnail then get chunked and loaded into the CanDB backend ‘media’ partition canister map, with the json file populating the matching file’s metadata entry.