Crawlee / Puppeteer / Playwright - web scraping and browser automation

Costas · Aug 23, 2022

Web Scraping and Generating PDFs Using C# and NET (mirror) -- use of AngleSharp

Costas · Oct 25, 2022

scrapper - google Puppeteer ( also downloads a compatible version of Chromium )
https://developer.chrome.com/docs/puppeteer/

berstend.puppeteer-extra & puppeteer-extra-plugin-stealth & puppeteer-extra-plugin-recaptcha (via 2captcha)
https://github.com/berstend/puppeteer-extra/issues/399
https://github.com/berstend/awesome-puppeteer
hellosurbhi/puppeteer-boilerplate
c# implementation - hardkoded/puppeteer-sharp --- tutorial (is .net standar but can be compiled in .netstandar for framework 4.6.x)

testing suite - microsoft Playwright (headless, patched versions of Chromium and Firefox and WebKit)

the difference between 'scrapper' and 'test suite', the later give us UI to select the elements

+ in case of playwright produces the script for any language..

testing suites - https://www.cypress.io/ - https://www.npmjs.com/package/cypress
alternative

- https://github.com/electron/electron/issues/27577
https://www.npmjs.com/package/electron
Web scraping with Electron
On Migrating from Cypress to Playwright
specflow - enhance your automated tests [2]

Web Automation: Don't Use Selenium, Use Playwright - Playwright records your steps and gives you a running Python script

intro to playwright

nstallation :
create a new dir c:\x, go in it and execute on windows CMD :

Bash:

#set this variable as is.. - https://www3.ntu.edu.sg/home/ehchua/programming/howto/Environment_Variables.html
set PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD=1

#will create the package.json
npm init -y

#because of PLAYWRIGHT_SKIP_BROWSER_DOWNLOAD we setted at beginning, install barebone playwright (excluding browsers)
npm i -D playwright

#order chromium+ffmpeg installation only (if not already exist to C:\Users\%username%\AppData\Local\ms-playwright\)
#in that way we excluding the firefox (78mb) + webkit (73mb)
npx playwright install chromium

#you can even, skip to download chromium and use an existing OS browser https://playwright.dev/docs/browsers#google-chrome--microsoft-edge

#init a project - this will ask you to 'Install Playwright browsers' choose NO!
npm init playwright

#now you can use the famous 'codegen', an empty browser will open, browse to target page do what is needed
#now everything you doing on this Chromium instance, recorded to a script, this script afterwards can be in any of the following languages
#in the end go to 'inspector window' and save the script (ex hi.js) as /library/
npx playwright codegen

#executing as, will do the job
node hi.js

in any variant chosen, your application has to reference the playwright library ( .net / python / java / npm ).

Sounds interesting I KNOW! read more - https://playwright.dev/docs/cli

Chromium is ahead of the branded browsers, when the world is on Google Chrome N, Playwright / puppeteer supports Chromium N+1 that will be released in Google Chrome in a few weeks.

JavaScript:

//show browser + slow mode 100ms
const browser = await chromium.launch({ headless: false ,  slowMo: 100 });

await page.goto('https://test.com');
//insert value to inpufield with delay 50ms per char
const x = await page.$('#email');
email.type('test@test.com', { delay : 50 });

//click an element
await page.click('#table > tr > td > a');

//select value from /select/
const dropdown = await page.$('#dropdown');
//by value
await dropdown.selectOption({value:'1'});
//by text
await dropdown.selectOption({label:'1'});
//by index
await dropdown.selectOption({index:'1'});

//loop through /select/ items
const dropdownItems = await dropdown.$$('option');
foreach (let i =0; i < to dropdownItems.length;i++)
  console.log(await dropdownItems[i].innerText());

//more at - https://testautomationu.applitools.com/js-playwright-tutorial/

--

#You can create also tests
#https://playwright.dev/docs/intro
#https://playwright.dev/docs/running-tests

#will create the playwright.config.js > edit it and leave only chormium.
#will create the project and lastly will output these :
#Runs the end-to-end tests
npx playwright test

#Runs the tests only on Desktop Chrome.
npx playwright test --project=chromium

#will start the test.specs.js with 'inspector' and will run the script line by line
npx playwright test --debug
#or without debug, you can use in code
#await page.pause();

#will start the browser and you can see what is doing
npx playwright test --headed

Pywright - Playwright API Server - API service written in python that use Playwright to render javascript websites.

Fitter CLI - powerful command line scrapper with browser support

Playwright OAuth2 Access Token Acquisition

Costas · Oct 26, 2022

intro to puppeteer

installation :
create a new dir c:\test, go in it and execute :

npm I puppeteer

this will also download a standalone chromium.
by running :

console.log(puppeteer.executablePath());
return;

we can discover where exists in my case C:\Users\%username%\.cache\puppeteer\chrome\

in case of firewall, just need the chrome.exe to be added to whitelist.

create a new file c:\test\test.js reference

JavaScript:

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://pipiscrew.com');
  await page.screenshot({ path: 'example.png', fullPage: true });

  await browser.close();
})();

execute

npm test.js

this will produce the example.png, near test.js

This is headless, If you want to see what is happening replace the browser variable with

JavaScript:

const browser = await puppeteer.launch({headless: false, devtools: true});

tip with devtools: true and adding debugger; to code, browser will stop there. source (ndb)

selector

before run anything, always use :

JavaScript:

const resultsSelector = '#example';
await page.waitForSelector(resultsSelector);

Spread (...) Operator = make the items iterable.

page.$eval() - this method runs document.querySelector within the page and passes the result as the first argument to the pageFunction.

JavaScript:

const inner_html = await page.$eval('#example', element => element.innerHTML);
console.log(inner_html);

page.$$eval - runs Array.from(document.querySelectorAll(selector)) within the page and passes the result as the first argument to the pageFunction. The following list all href for the selected elements

JavaScript:

  const list = await page.$$eval('li.example>a', a => a.map(a =>a.href));
  console.log(list);

page.evaluate() - evaluates a function in the page's context and returns the result. The return value must be serializable. Everything is running inside the function is done in the browser context, so read the console.log on browser.

JavaScript:

*simple*
const inner_html = await page.evaluate(() => document.querySelector('#example').innerHTML);

*advanced*
const data = await page.evaluate(() => {
    const li = Array.from(document.querySelectorAll('li.title>a'));
 
    return li.map(td => {
        return td.innerHTML;
    });
});

page.$ / page.$$ - runs document.querySelector or document.querySelectorAll within the page.

JavaScript:

//1
  const element = await page.$('div#example');
  const element_property = await element.getProperty('innerHTML');
  const inner_html = await element_property.jsonValue();

//2
  const item = await page.$(resultsSelector);
  const data = await (await item.getProperty('textContent')).jsonValue();
  console.log(data);

//3 - ref https://qiita.com/go_sagawa/items/85f97deab7ccfdce53ea
  const item = await page.$(resultsSelector);
  const data = {
        href: await (await item.getProperty('href')).jsonValue(),
        textContent: await (await item.getProperty('textContent')).jsonValue(),
        innerHTML: await (await item.getProperty('innerHTML')).jsonValue()
  };
  console.log(data);

//4
  const list = await page.$$(resultsSelector);
  const datas = [];
  for (let i = 0; i < list.length; i++) {
    datas.push(await (await list[i].getProperty('textContent')).jsonValue())
  }
  console.log(datas);

click

page.$$eval - runs Array.from(document.querySelectorAll(selector)) within the page and passes the result as the first argument to the pageFunction. Everything is running inside the function is done in the browser context, so read the console.log on browser.

JavaScript:

//1
await page.$$eval('._qv64e', elements => elements[3].click());

//2
//loop through items, find the one with specific text and click it
await page.$$eval(resultsSelector, elements => {
  // const element = elements.find(element => element.firstElementChild);
  // const element = elements.find(element => element.innerHTML === '<h1>Hello, world!</h1>');
  // console.log(element);
  // element.click();
  elements[0].click();
});

page.evaluate() - evaluates a function in the page's context and returns the result.

JavaScript:

//1
await page.evaluate(() => { document.querySelector('div.example>button').click(); });

//2
await page.evaluate(() => { document.querySelectorAll('._qv64e')[80].click(); });

//3
  const dd = await page.evaluate(() => {
    const elements = [...document.querySelectorAll('div#example>button')];
    // const element = elements.find(element => element.firstChild);
    const element = elements.find(element => element.textContent == 'Press for more');
    element.click();
  });

page.$ / page.$$ - runs document.querySelector or document.querySelectorAll within the page.

JavaScript:

//1
  const x = await page.$('div.example>button');
  if (!x)
    console.log("nofound");
  else {
    console.log(x.length);
    await x.click();
  }

//2
  const selectAll = await page.$$('div.example>button');
  console.log(selectAll.length);
  await selectAll[0].click();

page.evaluateHandle - The only difference between page.evaluate and page.evaluateHandle is that evaluateHandle will return the value wrapped in an in-page object.

JavaScript:

  const element = await page.evaluateHandle(() => {
    const elements = [...document.querySelectorAll('div#example>button')];
    //const element = elements.find(element => element.firstChild);
    const element = elements.find(element => element.textContent == 'Press for more');
    return element;
  });
 
  await element.click();

#2
//ref - https://github.com/osfunapps/os-puppeteer-helper-npm/blob/master/index.js
 const last = await page.$('.item:last-child');
 const prev = await page.evaluateHandle(el => el.previousElementSibling, last);
 console.log(await (await prev.getProperty('innerHTML')).jsonValue());

misc

page.waitForSelector - wait for the selector to appear in page and handle if not.

JavaScript:

//https://docs.apify.com/tutorials/scraping-dynamic-content#timeout-and-errors
  let exitSwitch = 0;
  await page.waitForSelector('my-selector', { timeout: 10000 })
    .catch(() => {console.log('Wait for my-selector timed out');  exitSwitch=1;}
  );

  if (exitSwitch==1)
  {
    await browser.close();
    return;
  }

plain JS - parse all TD content

JavaScript:

  let all = [...document.querySelectorAll('td')].map(elem => elem.innerText);
  console.log(all);

Trusted events: events generated by users interacting with the page, e.g. using a mouse or keyboard.
Untrusted event: events generated by Web APIs, e.g. document.createEvent or element.click() methods. read more

references :
https://pptr.dev/api/ (feeded by github.com/puppeteer/puppeteer/blob/main/docs/index.md)
https://github.com/puppeteer/puppeteer/blob/v15.2.0/docs/api.md#pageselector-1
https://pptr.dev/api/puppeteer.browser
https://github.com/puppeteer/puppeteer/tree/main/examples
https://developer.chrome.com/docs/puppeteer/get-started/
https://developer.chrome.com/docs/puppeteer/examples/
https://github.com/checkly/puppeteer-examples
https://www.digitalocean.com/community/tutorials/how-to-scrape-a-website-using-node-js-and-puppeteer
https://stackoverflow.com/a/60713343
https://www.npmjs.com/package/puppeteer

go_sagawa - How to get element in puppeteer (2017) (mirror)
https://www.npmjs.com/package/mysql2#using-prepared-statements (how to add password)

Puppeteer in Node.js: Common Mistakes to Avoid

Costas · Oct 26, 2022

use custom function inside puppeteer page.$$eval function

JavaScript:

//puppeteer START
(async () => {
    const browser = await puppeteer.launch({ headless: false, devtools: true });
    const page = await browser.newPage();

    // https://stackoverflow.com/a/69127872
    await page.evaluateOnNewDocument(() => {
        window.parseFields = function parseFields(el) {
            //your function logic to be used inside page.$$eval
        }
    });

    // UPDATE - expose the function "logInNodeJs" for use from inside page context  - https://stackoverflow.com/a/73964712
    await page.exposeFunction('logInNodeJs', (value) => console.log(value));

    //browser to page
    await page.goto('https://yoursite.here');

    const rows = await page.$$eval('.table tr', list => {

        logInNodeJs("check");

        for (let i = 0; i < list.length; i++) {

            let TDs = list[i].querySelectorAll('td');
            col1 = parseFields(TDs[5].querySelectorAll('.inline-block'));

            var data = {
                col1: col1,
                cells: [...list[i].querySelectorAll('td')].map(elem => elem.innerText),
            };

            datas.push(data);
        }

        return datas;

    });
});

passing variable inside puppeteer page.$$eval function

JavaScript:

//https://stackoverflow.com/a/65283313
//before
const rows = await page.$$eval('.table tr', list => {
//you code here
});

//after
const rows = await page.$$eval('.table tr', (list, currentURL) => {
    console.log(currentURL);
}, theURL);

Crawlee / Puppeteer / Playwright - web scraping and browser automation

Costas

Administrator

Costas

Administrator

Costas

Administrator

Costas

Administrator