Archiving sites with Python and Wget

Published: 2023-07-27
Updated: 2023-07-27

Overview

This repos is used for scraping 4chan. It leverages Python and wget to store threads locally, retaining their html, css, and files so that the page can be viewed offline just as it would appear on the live site.

Execution

venv/bin/activate.fish
python -m dl <FLAG> <BOARD|URL>

Examples

venv/bin/activate.fish
python -m dl -b 'wsg'
python -m dl -t 'https://boards.4chan.org/wsg/thread/5686898'

Flags

parser = ArgumentParser()
parser.add_argument('-b', '--board', help='board', default=None, required=False)
parser.add_argument('-t', '--thread', help='thread', default=None, required=False)
parser.add_argument('-f', '--from_file', help='from_file', default=None, required=False)
parser.add_argument('-e', '--external_links', help='external_links', default=None, required=False)

-b ‘board’
- Scan through entire board catalog and scrape all active threads
-t ‘thread’
- Scan a single thread with its thread url
-f ‘from_file’
- Scan urls listed in a files
-e ‘external_links
- download any external links found in the thread (youtube)

How it works

When using the board flag, the below function scans the catalog of the board parameter using the 4chan API and collects all the active thread urls.
- More information about 4chan’s apis can be found here

import json
from urllib.request import Request, urlopen

def get_threads(board):
    thread_list = 'https://a.4cdn.org/'+board+'/threads.json'
    threads = []
    with urlopen(thread_list) as url:
        data = json.load(url)
        for page in data:
            for thread in page['threads']:
                thread_url = 'https://boards.4chan.org/'+board+'/thread/'+str(thread['no'])
                threads.append(thread_url)
    return threads

Once the array of thread URLs are constructed, it is looped through and passed into the download_thread function. This function uses wget to mirror the site locally and downloads all files locally that exist int he specified cdns.

def download_thread(thread_url,scrape_path):
    board = thread_url.split('/')[-3] # biz
    thread = thread_url.split('/')[-1] # 1234
    local_thread_path = scrape_path+'4chan/'+board+'/'+thread
    html_path = local_thread_path+'/boards.4chan.org/'+board+'/thread/'+thread+'.html'
    cdns = 'i.4cdn.org,is2.4chan.org,i.4pcdn.org,s.4cdn.org'
    args = ['wget', 
      '-P', thread_path, # Specifies the directory where downloaded files are to be saved.
      '-r', # Enables recursive downloading.
      '-np', # Disables getting files from any parent directory.
      '-k', # Convert the links in the document to make them suitable for local viewing
      '--adjust-extension', # Causes the suffix ‘.html’ to be appended to the local filename
      '-e', 'robots=off', # Specify whether the norobots convention is respected by Wget
      '--wait', '2', '--random-wait', '--continue', # Randomly wait between requests
      '--domains', cdns, # Specifies the domains to download from
      '-H', # Enable spanning across hosts when doing recursive retrieving
      '--user-agent', "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36", # User agent identification sent to the HTTP Server
      thread_url # Thread url to download
    ]
    job = subprocess.Popen(args, stdout=subprocess.PIPE)
    out, err = job.communicate()
    return local_thread_path,html_path

Optional: Save the files separately with their real names.

This function takes in a the local raw html of the thread and creates a lookup table between the cdn file name (string of numbers) to the actual file name.

def get_real_images(soup,dirpath):
    link_dict = dict()
    files_path = dirpath+'/files/'
    for link in soup.find_all('div', {'class': 'fileText'}):
        # local file
        relative_file_url = link.find('a')['href']
        file_url = '/'.join(relative_file_url.split('/')[3:])
        file_path = dirpath+'/'+file_url
        # new file
        try: file_name = link.find('a')['title']
        except: file_name = link.find('a').getText()
        link_dict[file_path] = files_path+file_name
    print(link_dict)
    base = "/home/arch/repos/scraping/chan/"
    for key, value in link_dict.items():
            pathlib.Path(files_path).mkdir(parents=True, exist_ok=True)
            if not os.path.isfile(value):
                try:
                    copyfile(key, value)
                except: 
                    print('failed to copy \n'+base+key+' to \n'+value)
    return files_path

After these steps complete for each thread scraped, the outputted thread folder looks like this:

tree scrape_output/4chan/biz/5601498
├── boards.4chan.org
│   └── biz
│       └── thread
│           └── 5601498.html
├── files
│   ├── file1_real_name.gif
│   ├── file2_real_name.webm
│   ├── truncated
├── i.4cdn.org
│   └── wsg
│       ├── file1_thumb.jpg
│       └── file2_thumb.jpg
├── is2.4chan.org
│   └── wsg
│       ├── file1_4chan_name.gif
│       └── file2_4chan_name.webm
├── links.txt
├── s.4cdn.org
│   ├── css
│   │   └── truncated
│   ├── image
│   │   └── truncated
│   └── js
│       └── truncated
└── timestamp.txt

Clicking on the boards.4chan.org/biz/thread/5601498.html will open the locally saved thread in the browser, with all files being accessed locally.