• You MUST read the Babiato Rules before making your first post otherwise you may get permanent warning points or a permanent Ban.

    Our resources on Babiato Forum are CLEAN and SAFE. So you can use them for development and testing purposes. If your are on Windows and have an antivirus that alerts you about a possible infection: Know it's a false positive because all scripts are double checked by our experts. We advise you to add Babiato to trusted sites/sources or disable your antivirus momentarily while downloading a resource. "Enjoy your presence on Babiato"

[GIVEWAY]Content crawler Nodejs

rafongol

Member
Dec 7, 2020
96
77
18
Hello Babiato community this is my first contribution in this community hope I'm in the right forum 🙏😂 anyway today I'll share with you a simple script that i create to scrap content from websites specially the one without .htaccess restriction and I'll try also to explain how it work exactly so it will be easy for everyone.

So first of all I want to mention that this method is the same as (wget method) that we usually use to crawl some content from a website but this script it make it more easier.

What I need to know before I use the script ?
Absolutely nothing !

What I need to start scraping web pages ?
You only need to download Nodejs 👈👈

After you downloading Nodejs you need to download the script from the attached files
Then you need to save the script in a directory we will call this directory for example nodeAPP so we will have this tree

|--nodeAPP(This is our directory that we created)
|----index.js(This is our script that we already downloaded)

After that you need to open your CMD( CMD is the default command-line interpreter for windows or terminal if you are using linux),
Then you need to go to your nodeAPP directory through the CMD and their you need to type: npm install website-scraper .

Now after the installation of the module is done you need to open the script with any text editor and change the line number 4
urls: ['https://www.hereyouputyoururl.com'], 👈 here you put the website that you want to crawled.

And you change the line number 5
directory: './directory_name', 👈 here you put the name of the directory that you want to be created this directory will have after all the content that you downloaded with the script.

After that you open again the CMD and you navigate to our nodeAPP directory and you type : node index.js
you wait a little bit then.. VOILA! you will have your crawled content in the same directory with the same name you put it in the line 5
so for example if we put in the line 5 ( directory: './xwebsite' ) then my tree it will be exactly like this

|--nodeAPP(This is our directory that we created)
|----node_modules(This directory will be created automatically after you install the npm module)
|----xwebsite(This directory was created by the script and he has all the crawled content)
|----index.js(This is our script that we already downloaded from babiato)

That's it I hope it was clear for you it's not very complicated it's super easy give it a try Cheers, Mates!

Here you can find the wesite-scraper plugin from npm if you want to read about it.
 

Attachments

  • index.zip
    322 bytes · Views: 33
Last edited:
Thanks for your sharing. But it looks like a kind of website downloader/clone, not content cralwer, is it right?
if you mean by content crawler that you can’t crawl the data of the website it’s true but if you mean that tou cannot get all the website files and scripts it’s wrong
 
AdBlock Detected

We get it, advertisements are annoying!

However in order to keep our huge array of resources free of charge we need to generate income from ads so to use the site you will need to turn off your adblocker.

If you'd like to have an ad free experience you can become a Babiato Lover by donating as little as $5 per month. Click on the Donate menu tab for more info.

I've Disabled AdBlock