If you’ve spent any time coding in InfoSec, you’ve probably used a ton of curl to pull websites, check them for various issues or attributes, etc.
This will follow redirects and provide a non-curl User Agent.
curl -LA 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5' reddit.com
This used to work quite well, but now—not so much.
For one, curl doesn’t parse and render JavaScript, and that’s what the internet is made out of. But perhaps even worse, many companies are employing technologies to outright detect and block curl because it’s often used for scraping.
Either way, if you use curl to pull a lot of sites en masse, you’re likely to have a massive failure rate in getting the HTML you’re looking for.
What we’ve needed for quite some time is something like curl, i.e., command-line and relatively simple, but that renders sites fully.
I’ve been using chromium (part of the Chrome project) to solve this problem for years, and I wanted to pass along the syntax for others.
I am usually doing things from Ubuntu, but you can get this to work on most UNIXy systems.
cat domains.txt | xargs -I {} -P 4 sh -c timeout 25s chromium-browser –headless –no-sandbox –user-agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537. 36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' –dump-dom https://{} 2> /dev/null > {}.html
That’s a lot to uravel, so:
sending domains.txt to xargs with -P 4 means spin up 4 processes to run xargs on incoming domains from your list, which makes things go quite fast.
a timeout of 25 seconds keeps things from timing out while you’re waiting for xargs to do its thing, and/or for the site to respond.
headless means don’t display a GUI, and no-sandbox is a security issue if you’re running as root, so be careful with that.
dump-dom means pull everything that comes back from the render.
the {} bits are placeholders for the content of the current cycle of xargs
the 2> /dev/null is because Chromium can be noisy
the {}.html writes the file based on the name of the domain coming from domains.txt.
What you basically end up with—assuming you have a decent machine to run this on—is hundreds of nicely rendered HTML files being created very quickly. Chromium is Chrome, so you’re getting the full rendering of the JavaScript and all the goodness that comes with that.
Anyway, I hope this helps someone who’s smashing their face on the desk because of curl.