This blog post might be the start of a series, depending on how much bandwidth I have to investigate this further...
I've been working on a new data problem that has necessitated using Selenium to extract information expediently. To further speed up the process because I'm impatient as hell, I decided to utilize the ThreadPoolExecutor
from the concurrent.futures
in my python script to spin up a bunch of Chrome instances like this:
def setup_driver():
chrome_options = Options()
chrome_options.add_argument("--headless=new")
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--window-size=1920,1080")
return webdriver.Chrome(options=chrome_options)
def search_range(range_tuple):
start_num, end_num, thread_num = range_tuple
driver = setup_driver()
def main():
start_entry = 0000
end_entry = 5000
max_threads = 10
chunk_size = (end_entry- start_entry) // max_threads
ranges = []
for i in range(max_threads):
range_start = start_entry + (i * chunk_size)
range_end = range_start + chunk_size - 1 if i < max_threads else end_entry
ranges.append((range_start, range_end, i))
with ThreadPoolExecutor(max_workers=max_threads) as executor:
futures = [executor.submit(search_range, range_) for range_ in ranges]
for future in futures:
try:
future.result()
except Exception as e:
print(f"Thread crashed with error: {str(e)}")
traceback.print_exc()
The chrome_options
specified are mainly to optimize performance since I am running it headless.
I have two machines with similar(ish, though now I'm doubting this) specs and bought around the same time in 2022:
The M1 performance with the above code was terrible (I think it's the first time I've really heard my fans spin up). Inspecting the performance in htop was practically bewildering, especially when I then looked at the Thinkpad running the exact same script.
running
processes (though the script is obviously running, and I could see many Chrome processes listed in htop). On macOS this consistently showed up at 10
while I was running the script.Load average
was also substantially higher on macOS vs LinuxI don't have time to dig into this right now, but if I manage to revisit it, I think the first step would be to try replicating the results in containers. It looks like there's actually a macOS VM via Docker-OSX, so that might be a good place to start. A bit of googling also revealed this issue, but seeing as it was resolved over 2 years ago, I doubt this is still the problem.
For now I'd say, proceed with caution if you're going to try multithreading with Selenium on a Mac M1 (or use the opportunity to warm your lap in the dead of winter).