Node makes scraping images off the web extremely easy using a couple handy packages: Xray and Download. Simple scrape the img tag, grab all the src attributes, filter out images you don't want, then hand them over to Download to grab them.
Let's say I put my application on an application server, how will things download then? Won't it download the images on the server? If so how would I be able to do it on the clients pc?
Hi. You can't create(and download) any files on a client machine because of the security issues - https://en.wikipedia.org/wiki/JavaScript#Security
FYI if you're scraping large files like mp3s rather than small images you might not want to start downloads in a simple forEach
. I don't know exactly what happens if you attempt to download 250 large files at once, but it probably isn't good! :) Another reason to avoid this would be to not accidentally DOS the site if it's a small mom & pop server rather than google.
A function like async's parallelLimit
will allow you to say "download in parallel, but only 5 at a time" which may work better for you and the site operator.
hmm this download npm module doesn't seem to want to work for that for loop. I removed the for loop and just used url-download module passing in the whole arrary to download. var download = require("url-download"); download(results, './images').on('close', function (err, url) { console.log(url + ' has been downloaded.'); How this helps someone reading this.
This lesson doesn't appear to work at all anymore, copied the code to node v5.3.0. Fails with no errors.
{ "name": "xray-tuts", "version": "1.0.0", "description": "", "main": "app.js", "scripts": { "test": "echo "Error: no test specified" && exit 1" }, "author": "", "license": "ISC", "devDependencies": { "download": "^5.0.2", "x-ray": "^2.3.1" } }
Well, apparently the Download package has been updated. I checked their docs and fixed the code accordingly:
// ... ^^ imports and xray config
(function(err, result) {
var images = result.filter(function(img) {
return img.width > 100;
})
.map(function(img) {
// Here is the new download code.
// Download takes asset url and download destination.
// I used map() here, but forEach would provide the
// same output
Download(img.src, './images');
});
// Write the original return result to JSON file
s.writeFile('./results.json', JSON.stringify(result, null, '\t'));
});
Thanks Vinny!