Not Obvious Top Popular JavaScript Libraries for Web Scraping in 2020

栏目: IT技术 · 发布时间: 3年前

内容简介:We’d like to continue the sequence of our posts about Top 5 Popular Libraries for Web Scraping in 2020 with a new programming language – JavaScript.JS is a quite well-known language with a great spread and community support. It can be used for both client

We’d like to continue the sequence of our posts about Top 5 Popular Libraries for Web Scraping in 2020 with a new programming language – JavaScript.

JS is a quite well-known language with a great spread and community support. It can be used for both client and server web scraping scripting that makes it pretty suitable for writing your scrapers and crawlers.

Most of these libraries’ advantages can be received by using our API and some of these libraries can be used in stack with it.

So let’s check them out.

The 5 Top JavaScript Web Scraping Libraries in 2020

1.Axios

Axios is a promise-based HTTP client for the browser and Node.js.

But why exactly this library? There are a lot of libraries that can be used instead of a well-known request: got, superagent, node-fetch. But Axios is a suitable solution not only for Node.js but for client usage too.

Simplicity of usage is shown below:

const axios = require('axios');

// Make a request for a user with a given ID
axios.get('/user?ID=12345')
  .then(function (response) {
    // handle success
    console.log(response);
  })
  .catch(function (error) {
    // handle error
    console.log(error);
  })
  .then(function () {
    // always executed
  });

Promises are cool, isn’t it?

To get the library you can use one of the preferable ways:

Using npm:

npm install axios

Using bower:

bower install axios

Using yarn:

yarn add axios

GitHub repository: https://github.com/axios/axios

2. Cheerio

Cheerio implements a subset of core jQuery. In simple words – you can just swap your jQuery and Cheerio environments for web scraping. And guess what? It is the same benefit as Axios have – you can use it from client and Node.js as well.

For the sample of usage, you can check another of our articles: Amazon Scraping. Relatively easy.

3. Selenium

Selenium is a most popular Web Driver that have a lot of wrappers for the most of programming languages. Quality Assurance engineers, automation specialists, developers, data scientists – all of them at least once used this perfect tool. For the Web Scraping it’s like a swiss knife – no any additional libraries needed, because any action can be performed with browser like a real user: page opening, button click, form filling, captcha resolving and much more.

Selenium may be installed via npm with:

npm install selenium-webdriver

And the usage is simple too:

const {Builder, By, Key, until} = require('selenium-webdriver');

(async function example() {
  let driver = await new Builder().forBrowser('firefox').build();
  try {
    await driver.get('http://www.google.com/ncr');
    await driver.findElement(By.name('q'));.sendKeys('webdriver', Key.RETURN);
    await driver.wait(until.titleIs('webdriver - Google Search'), 1000);
  } finally {
    await driver.quit();
  }
})();

Official docs URL: https://selenium-python.readthedocs.io/

GitHub repository: https://github.com/SeleniumHQ/selenium

4. Puppeteer

There are a lot of words that we can tell about Puppeteer: it’s reliable and production-ready library with a great community support. Basically Puppeteer is a Node.js library which offers a simple and  efficient API that enables you to control Google’s Chrome or Chromium browser. So you can run particular site JavaScript (as well as with Selenium) and scrape single-page applications based on Vue.js, React.js, Angular, etc.

We have a great example of using Puppeteer for scraping Angular-based site, you can check it here: AngularJS site scraping. Easy deal?

Also, we’d like to suggest you check out a great curated list of awesome Puppeteer resources: https://github.com/transitive-bullshit/awesome-puppeteer

As well, useful official resources:

Official docs URL: https://developers.google.com/web/tools/puppeteer

GitHub repository: https://github.com/GoogleChrome/puppeteer

5.Playwright

Not so well-known library as Puppeteer, but can be named as a Puppeteer 2, as the Playwright is a library maintained by former Puppeteer contributors. Unlike Puppeteer it supports Chrome, Chromium, Webkit and Firefox backend.

To install it just run the following command:

npm install playwright

To be ensured, that the API is pretty same, just take a look to official example:

const playwright = require('playwright');

(async () => {
  for (const browserType of ['chromium', 'firefox', 'webkit']) {
    const browser = await playwright[browserType].launch();
    const context = await browser.newContext();
    const page = await context.newPage();
    await page.goto('http://whatsmyuseragent.org/');
    await page.screenshot({ path: `example-${browserType}.png` });
    await browser.close();
  }
})();

Official docs URL: https://github.com/microsoft/playwright/blob/master/docs/README.md

GitHub repository: https://github.com/microsoft/playwright

Conclusion

It’s always up to you to decide what to use for your particular web scraping case, but it’s also pretty obvious that the amount of data on the Internet increases exponentially and data mining becomes a crucial instrument for your business growth.

But remember, instead of choosing a fancy tool that may not be of much use, you should focus on finding out a tool that suits your requirements best.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

工程问题C++语言求解

工程问题C++语言求解

Delores M.Etter、Jeanine A.Ingber / 冯力、周凯 / 机械工业出版社 / 2014-8 / 79元

本书介绍了如何利用ANSIC++编程语言以基于对象的编程方式来解决工程问题。书中引用了大量来自于不同工程、科学和计算机科学领域的示例,是一本理论和实践结合紧密的教材。针对C++基本语法的各个部分,由浅入深地进行讲解。每讲解一部分基础知识,同时会结合多个相关实例,实例内容详实,紧贴所讲内容,使读者能够立刻对所学知识进行练习,实战性强。一起来看看 《工程问题C++语言求解》 这本书的介绍吧!

JS 压缩/解压工具
JS 压缩/解压工具

在线压缩/解压 JS 代码

html转js在线工具
html转js在线工具

html转js在线工具

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具