DENSE0

write monstrously fast web scrapper

I've lots of experience with web scraping. Throughout my development career, I wrote too many web scrappers. My biggest project was LGU crawler in which I was scrapping my university timetable almost 1200 pages in under 30 seconds. After reading this post you will be also able to scrap some simple websites.

Before we begin...

if you don't know what web scrapping is? it is used to extract data from any website. Before scrapping the website it is important to understand what kind of rendering strategy they are using. Regarding rendering strategies, there are two main ways in which content gets rendered on the website.

  1. Server Side Rendering: In this strategy, when a browser requests a page from the server it sends pre-build HTML page to the browser This means we can easily extract data by selecting the DOM nodes from the website.

    Ways to check if the page is being statically rendered,

    • Go to the website & open dev tools using `ctrl + shift + i` then navigate to the network tab and reload the page, after reloading the page check the first request response and see if the server is returning the data you are looking for.

    • you can also use the CURL command from the git bash terminal to get the HTML server is sending back

      curl https://example.com/

  2. Client Side Rendering: In this strategy, the server sends just basic HTML & javascript to the browser. After that when javascript gets loaded it requests the server to get the data. if the website is using client-side rendering it makes it harder for the scrapper to scrap the website because we need to execute all the javascript on the client side before scrapping the data. There is another approach too which scraps data directly from the JSON endpoints, This approach offers several advantages, such as greater flexibility and control. However, it can become cumbersome if the server has numerous or overly complex endpoints.

ℹ FAQ:

you might be thinking can we scrap social media sites? The answer is both yes and no. When we send to the server it also tracks our IP if you will send to request from a single IP it can lead to your IP being banned. Most of the time, we can rotate our IP using a VPN connection but still, modern sites have methods to detect bots like captchas where they ask you to solve puzzles to prove you are Human. still, there are some advanced techniques to scrap that data. But unfortunately, I'm not going to cover these :)

Scrap server-side render pages

Let's start with a very simple website. we gonna scrap https://example.com/ website. I know it is so basic but it will help you understand how to select elements from the page extract them and convert them into serializable form like JSON. for the code example I'm using javascript and the cheerio library.

To create a new project run the following commands one by one (make sure you have node js installed)

mkdir web-scrapper && cd web-scrapper
npm init -y
npm i cheerio
mkdir src && cd src && echo > index.js

continue...