Back Original

Why is multithreading Selenium lousy on MacOS?

This blog post might be the start of a series, depending on how much bandwidth I have to investigate this further...

I've been working on a new data problem that has necessitated using Selenium to extract information expediently. To further speed up the process because I'm impatient as hell, I decided to utilize the ThreadPoolExecutor from the concurrent.futures in my python script to spin up a bunch of Chrome instances like this:

def setup_driver():
    chrome_options = Options()
    chrome_options.add_argument("--headless=new")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--window-size=1920,1080")
    return webdriver.Chrome(options=chrome_options)
    
def search_range(range_tuple):
    start_num, end_num, thread_num = range_tuple
    driver = setup_driver()
    
    
    
def main():
    start_entry = 0000
    end_entry = 5000
    max_threads = 10
    chunk_size = (end_entry- start_entry) // max_threads

    ranges = []
    for i in range(max_threads):
        range_start = start_entry + (i * chunk_size)
        range_end = range_start + chunk_size - 1 if i < max_threads else end_entry
        ranges.append((range_start, range_end, i))

    with ThreadPoolExecutor(max_workers=max_threads) as executor:
        futures = [executor.submit(search_range, range_) for range_ in ranges]
        for future in futures:
            try:
                future.result()
            except Exception as e:
                print(f"Thread crashed with error: {str(e)}")
                traceback.print_exc()

The chrome_options specified are mainly to optimize performance since I am running it headless.

I have two machines with similar(ish, though now I'm doubting this) specs and bought around the same time in 2022:

The M1 performance with the above code was terrible (I think it's the first time I've really heard my fans spin up). Inspecting the performance in htop was practically bewildering, especially when I then looked at the Thinkpad running the exact same script.

MacOS

At startup

macOS at startup

Running script

macOS running script

Linux

At startup

Linux at startup

Running script

Linux running script

Interesting Observations

Next Steps?

I don't have time to dig into this right now, but if I manage to revisit it, I think the first step would be to try replicating the results in containers. It looks like there's actually a macOS VM via Docker-OSX, so that might be a good place to start. A bit of googling also revealed this issue, but seeing as it was resolved over 2 years ago, I doubt this is still the problem.

For now I'd say, proceed with caution if you're going to try multithreading with Selenium on a Mac M1 (or use the opportunity to warm your lap in the dead of winter).