[GIVEWAY]Content crawler Nodejs

rafongol · Jan 6, 2021

Hello Babiato community this is my first contribution in this community hope I'm in the right forum

anyway today I'll share with you a simple script that i create to scrap content from websites specially the one without .htaccess restriction and I'll try also to explain how it work exactly so it will be easy for everyone.

So first of all I want to mention that this method is the same as (wget method) that we usually use to crawl some content from a website but this script it make it more easier.

What I need to know before I use the script ?
Absolutely nothing !

What I need to start scraping web pages ?
You only need to download Nodejs

After you downloading Nodejs you need to download the script from the attached files
Then you need to save the script in a directory we will call this directory for example nodeAPP so we will have this tree

|--nodeAPP(This is our directory that we created)
|----index.js(This is our script that we already downloaded)

After that you need to open your CMD( CMD is the default command-line interpreter for windows or terminal if you are using linux),
Then you need to go to your nodeAPP directory through the CMD and their you need to type: npm install website-scraper .

Now after the installation of the module is done you need to open the script with any text editor and change the line number 4
urls: ['https://www.hereyouputyoururl.com'],

here you put the website that you want to crawled.

And you change the line number 5
directory: './directory_name', here you put the name of the directory that you want to be created this directory will have after all the content that you downloaded with the script.

After that you open again the CMD and you navigate to our nodeAPP directory and you type : node index.js
you wait a little bit then.. VOILA! you will have your crawled content in the same directory with the same name you put it in the line 5
so for example if we put in the line 5 ( directory: './xwebsite' ) then my tree it will be exactly like this

|--nodeAPP(This is our directory that we created)
|----node_modules(This directory will be created automatically after you install the npm module)
|----xwebsite(This directory was created by the script and he has all the crawled content)
|----index.js(This is our script that we already downloaded from babiato)

That's it I hope it was clear for you it's not very complicated it's super easy give it a try Cheers, Mates!

Here you can find the wesite-scraper plugin from npm if you want to read about it.

kakalotfreedom · Jan 11, 2021

Thanks for your sharing. But it looks like a kind of website downloader/clone, not content cralwer, is it right?

rafongol · Jan 11, 2021

kakalotfreedom said:
Thanks for your sharing. But it looks like a kind of website downloader/clone, not content cralwer, is it right?

if you mean by content crawler that you can’t crawl the data of the website it’s true but if you mean that tou cannot get all the website files and scripts it’s wrong

hiyakazeman · Feb 6, 2021

thanks a lot for this!

Search

Search

[GIVEWAY]Content crawler Nodejs

rafongol

Member

Attachments

kakalotfreedom

Member

rafongol

Member

hiyakazeman

Member

Similar threads

Latest posts

Forum statistics

Share this page