If you’ve spent any time coding in InfoSec, you’ve probably used a ton of curl
to pull websites, check them for various issues or attributes, etc.
This will follow redirects and provide a non-curl User Agent.
curl -LA 'Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5' reddit.com
This used to work quite well, but now—not so much.
For one, curl
doesn’t parse and render JavaScript, and that’s what the internet is made out of. But perhaps even worse, many companies are employing technologies to outright detect and block curl
because it’s often used for scraping.
Either way, if you use curl
to pull a lot of sites en masse, you’re likely to have a massive failure rate in getting the HTML you’re looking for.
What we’ve needed for quite some time is something like curl
, i.e., command-line and relatively simple, but that renders sites fully.
I’ve been using chromium
(part of the Chrome project) to solve this problem for years, and I wanted to pass along the syntax for others.
I am usually doing things from Ubuntu, but you can get this to work on most UNIXy systems.
cat domains.txt | xargs -I {} -P 4 sh -c timeout 25s chromium-browser –headless –no-sandbox –user-agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537. 36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' –dump-dom https://{} 2> /dev/null > {}.html
That’s a lot to uravel, so:
- sending domains.txt to
xargs
with -P 4 means spin up 4 processes to runxargs
on incoming domains from your list, which makes things go quite fast. - a timeout of 25 seconds keeps things from timing out while you’re waiting for
xargs
to do its thing, and/or for the site to respond. - headless means don’t display a GUI, and no-sandbox is a security issue if you’re running as root, so be careful with that.
- dump-dom means pull everything that comes back from the render.
- the
{}
bits are placeholders for the content of the current cycle ofxargs
- the
2> /dev/null
is because Chromium can be noisy - the
{}.html
writes the file based on the name of the domain coming from domains.txt.
What you basically end up with—assuming you have a decent machine to run this on—is hundreds of nicely rendered HTML files being created very quickly. Chromium is Chrome, so you’re getting the full rendering of the JavaScript and all the goodness that comes with that.
Anyway, I hope this helps someone who’s smashing their face on the desk because of curl
.