Web Scraping Drupal Websites with Node.js and Puppeteer

Web Scraping Drupal Websites with Node.js and Puppeteer

In the realm of web development, we often encounter repetitive tasks that can become tedious over time. Many developers wish for efficient ways to streamline these processes, either by accelerating their execution or eliminating the need for manual repetition altogether.

One day at work, I was tasked with scanning several Drupal websites to ensure that the Twitter icon was no longer present, as the company had rebranded to X and wanted all icons updated accordingly. Initially, I started manually inspecting each website, meticulously searching through different content types and blocks to locate the Twitter icon. However, it soon occurred to me that there had to be a more efficient approach. That's when I realized the potential of creating an automated script to scan all webpages on these websites and generate a report indicating which pages did or did not contain the Twitter icon. I decided to use Puppeteer, a Node.js library, to accomplish the task. Puppeteer facilitates the automation of web page interactions, enabling actions like navigation, form submission, DOM manipulation, and screen capturing, all programmatically using JavaScript or Node.js scripts. This powerful tool streamlines the process of automating tasks that would otherwise be manual and time-consuming when working with web applications.

In this article, I will detail the step-by-step process I undertook to develop a web scraping script using Puppeteer, highlighting the valuable lessons I learned along the way.

  1. First, you'll need to download and install Node.js. Once installed, you can verify the installation by entering the following command in your terminal:
node -v
npm -v
  1. Next, create your project directory and initialize your Node.js project by running the following commands in your terminal:

    1. Open a terminal.

    2. Navigate to your project directory using cd command.

    3. Run npm init -y to initialize a new package.json file with default settings.

  2. Install the following packages:

npm i puppeteer dotenv

The provided commands will install the Puppeteer and dotenv packages. dotenv is particularly useful for managing environment variables within your project.

  1. Set up the following file structure:
/index.js
_/utils 
___/retrieveLinks.js
____/wordSearch.js
/.env
  1. Let's begin by defining an asynchronous function in index.js that calls retrieveLinks()
require("dotenv").config();
const retrieveLinks = require("./utils/retrieveLinks");

(async () => {
  const links = await retrieveLinks(`${process.env.EXAMPLE_URL}`);
  return links;
})();

In the retrieveLinks() function, we define a constant variable named links. This variable invokes retrieveLinks(), which we will create shortly. The concept is that by running this function and passing in the URL of your webpage, you will retrieve all the links from your website. The statement const retrieveLinks = require("./utils/retrieveLinks") is used in Node.js to import the retrieveLinks() function from a specific file or module located at ./utils/retrieveLinks. This approach allows you to include and use the retrieveLinks() function within your application. The require("dotenv").config() statement is used to load and parse environment variables from a .env file into the process.env object in your Node.js application. This allows you to securely store and access sensitive configuration settings without hardcoding them into your codebase.

  1. In your .env file, you may have a variable declared like this:
EXAMPLE_URL="https://www.example.com"
  1. Let's now transition to the retrieveLinks.js file. Here, we will begin by defining an asynchronous function that takes a URL as a parameter and logs it to the console. We'll also ensure that this function can be accessed outside of its scope by using module.exports.
async function retrieveLinks(websiteUrl) {
  console.log(websiteUrl);
}

module.exports = retrieveLinks;

You should receive the following output to console:

https://www.example.com

Let's integrate Puppeteer into our project by adding the library with the command const puppeteer = require("puppeteer"). We'll then create a headless Chrome browser instance using puppeteer.launch(). To configure the browser for non-headless mode (i.e., with a visible window), we'll include the headless: false argument. Additionally, to slow down the browser's operations for easier debugging, we'll use the slowMo: 50 argument.

  1. Here's how you can set up Puppeteer with these configurations in your Node.js project:
const puppeteer = require("puppeteer");

async function retrieveLinks(websiteUrl) {
  // Launch the browser and open a new blank page
  const browser = await puppeteer.launch({
    headless: false,
    slowMo: 50, // slow down by 250ms
  });
  const page = await browser.newPage();
  await page.setViewport({ width: 1920, height: 1080 });
  await page.goto(`${websiteUrl}`);
  await browser.close();
}

module.exports = retrieveLinks;

The page variable is used to create a new browser page. In this scenario, the page's viewport is set to a screen size of 1920 x 1080 pixels. The browser then navigates to the specified URL, and once the operations are complete, the browser instance is closed.

Once you've confirmed that the correct webpage is opening successfully, you can proceed with the login process. Different organizations use various methods to log in to their Drupal sites. In my case, we use SAML for authentication, and I'll demonstrate this process in the article. However, if your organization uses a different login method, you may need to adapt the script accordingly. The fundamental principles of automating the login process remain consistent regardless of the authentication method employed.

const puppeteer = require("puppeteer");

async function retrieveLinks(websiteUrl) {
  // Launch the browser and open a new blank page
  const browser = await puppeteer.launch({
    headless: false,
    slowMo: 50, // slow down by 250ms
  });
  const page = await browser.newPage();
  await page.setViewport({ width: 1920, height: 1080 });
  await page.goto(`${websiteUrl}/user/login`);
  await page.waitForSelector("#user-login-form");
  await page.click("#edit-samlauth-login-link");
  await page.waitForSelector("#username");
  await page.type("#username", `${process.env.USERNAME}`);
  await page.click("#password");
  await page.type("#password", `${process.env.PASSWORD}`);
  await page.click(".idp3_form-submit");
  await page.waitForSelector("#dont-trust-browser-button");
  await page.click("#dont-trust-browser-button");
  await browser.close();
}

module.exports = retrieveLinks;

In the updated configuration, the URL has been modified to "${websiteUrl}/user/login" to direct users to the login page. The subsequent command employs the waitForSelector method, which functions by pausing script execution until a designated HTML element—specifically, the user-login-form ID—becomes visible on the webpage. Upon the form's appearance, Puppeteer proceeds to select the login link, identified as edit-samlauth-login-link, and subsequently authenticates using the provided username and password credentials. The username and password credentials should be added to the previously discussed .env file as show below:

HR_URL="https://www.example.com"
USERNAME="exampleUsername"
PASSWORD="examplePassword"

Finally, after entering the username and password, the script proceeds to select the login button, which then redirects you to the homepage of your Drupal website.

In Drupal websites, you can access the URL ${websiteUrl}/admin/content to manage content. This page contains a list of every page created on your website, which is essential for obtaining links to all website pages for this project. Once you are on the homepage of your Drupal website, you will need to add the following two commands to your script in order to select the content button on the administration toolbar, which will direct you to this page.

  await page.waitForSelector(".toolbar-icon-system-admin-content");
  await page.click(".toolbar-icon-system-admin-content");

Initially, when developing this script, I encountered an unexpected error indicating that the content button on the administration toolbar could not be located. Upon investigation, I discovered that some of my websites did not automatically open the administration toolbar, requiring the "Manage" button to be selected first. If you encounter an issue where your script cannot click the .toolbar-icon-system-admin-content class, please incorporate the following two commands before attempting to select the content button:

  await page.waitForSelector("#toolbar-item-administration");
  await page.click("#toolbar-item-administration");
  await page.waitForSelector(".toolbar-icon-system-admin-content");
  await page.click(".toolbar-icon-system-admin-content");

If your script has executed successfully up to this point, you should now be directed to a page resembling the one displayed below:

Upon reaching the "Manage Content" page, we will enhance the retrieveLink function by introducing two variables: contentPageUrl and allLinks.

  • contentPageUrl will store the URL of the content page.

  • allLinks will be an array containing the URLs of every page on the website. This array will be returned at the end of the function.

These additions will facilitate the retrieval and organization of page URLs from the Drupal website.

const contentPageUrl = page.url();
const allLinks = [];

Depending on the number of pages on your Drupal website, you may need to paginate to access the next list of pages. This can be done by selecting the appropriate page number in the toolbar located at the bottom of the page.

To determine the total number of pages in the pagination toolbar, you can inspect the "Last" button element. This button is represented by a <a> tag with an href attribute structured as "?page=number". The value of number after the equal sign (=) in the href attribute corresponds to the total number of pages available. This approach allows you to evaluate the CSS and retrieve the necessary information for pagination in your script.

To retrieve the href property from the last button link using Puppeteer's evaluate method, you can use the following command:

const lastPageButton = await page.$(".pager__item--last > a");
const lastPage = await page.evaluate(
    (el) => el.href,
    page.$(".pager__item--last > a")
  );

The constant variable lastPageButton is used to select the button element representing the last page in the pagination toolbar. The lastPage function then evaluates this button element to retrieve its href property. After retrieving the lastPage variable containing a string like "?page=number", you can extract the number part and convert it into a numeric value. Here's how you can accomplish this in JavaScript:

  const numberOfPages = Number(lastPage.substring(lastPage.indexOf("=") + 1));

To retrieve all page links from the first page we can run the following command:

  const firstPagelinks = await page.$$(".views-field > a");

In Puppeteer, the $$ method is used to query the DOM to find all elements that match a specified CSS selector. In this case we are finding all links for new content provided in the list. During the development of this script, I observed that besides obtaining links to regular pages, it was also fetching links to admin and user pages, which are not required for our purposes. To address this issue, I implemented the following for loop to filter out these unnecessary links. Each link retrieved is appended to the allLinks array, which will be utilized at a later stage in the process.

 for (let i = 0; i < firstPagelinks.length; i++) {
    const link = await firstPagelinks[i].evaluate((el) => el.href, firstPagelinks[i]);
    if (
      !link.includes(`${websiteUrl}/admin/`) &&
      !link.includes(`${websiteUrl}/user/`)
    ) {
      allLinks.push(`${link}\n`);
    }
  }

Next, we will utilize the numberOfPages variable to iterate through each page. Within this loop, we will employ the same approach as previously described to retrieve the links from the content list on each page. We will initiate the loop starting at page 1, as we have already collected the links from the first page.

  for (let x = 1; x < numberOfPages + 1; x++) {
    await page.goto(`${contentPageUrl}?page=${x}`);
    await page.waitForSelector("tbody");
    const newPagelinks = await page.$$(".views-field > a");
    for (let i = 0; i < newPagelinks.length; i++) {
      const link = await newPagelinks[i].evaluate(
        (el) => el.href,
        newPagelinks[i]
      );
      if (
        !link.includes(`${websiteUrl}/admin/`) &&
        !link.includes(`${websiteUrl}/user/`)
      ) {
        allLinks.push(`${link}\n`);
      }
    }
  }

Ensure that at the end of the retrieveLinks function, you close the Puppeteer browser instance and return the allLinks array.

const puppeteer = require("puppeteer");

async function retrieveLinks(websiteUrl) {
  // Launch the browser and open a new blank page
  const browser = await puppeteer.launch({
    headless: false,
    slowMo: 50, // slow down by 250ms
  });
  const page = await browser.newPage();
  await page.setViewport({ width: 1920, height: 1080 });
  await page.goto(`${websiteUrl}/user/login`);
  await page.waitForSelector("#user-login-form");
  await page.click("#edit-samlauth-login-link");
  await page.waitForSelector("#username");
  await page.type("#username", `${process.env.USERNAME}`);
  await page.click("#password");
  await page.type("#password", `${process.env.PASSWORD}`);
  await page.click(".idp3_form-submit");
  await page.waitForSelector("#dont-trust-browser-button");
  await page.click("#dont-trust-browser-button");
  await page.waitForSelector(".toolbar-icon-system-admin-content");
  await page.click(".toolbar-icon-system-admin-content");
  const contentPageUrl = page.url();
  const allLinks = [];
  const lastPageButton = await page.$(".pager__item--last > a");
  const lastPage = await lastPageButton.evaluate(
    (el) => el.href,
    page.$(".pager__item--last > a")
  );
  const numberOfPages = Number(lastPage.substring(lastPage.indexOf("=") + 1));
  const firstPagelinks = await page.$$(".views-field > a");
  for (let i = 0; i < firstPagelinks.length; i++) {
    const link = await firstPagelinks[i].evaluate(
      (el) => el.href,
      firstPagelinks[i]
    );
    if (
      !link.includes(`${websiteUrl}/admin/`) &&
      !link.includes(`${websiteUrl}/user/`)
    ) {
      allLinks.push(`${link}\n`);
    }
  }
  for (let x = 1; x < numberOfPages + 1; x++) {
    await page.goto(`${contentPageUrl}?page=${x}`);
    await page.waitForSelector("tbody");
    const newPagelinks = await page.$$(".views-field > a");
    for (let i = 0; i < newPagelinks.length; i++) {
      const link = await newPagelinks[i].evaluate(
        (el) => el.href,
        newPagelinks[i]
      );
      if (
        !link.includes(`${websiteUrl}/admin/`) &&
        !link.includes(`${websiteUrl}/user/`)
      ) {
        allLinks.push(`${link}\n`);
      }
    }
  }
  await browser.close();
  return allLinks;
}

module.exports = retrieveLinks;

We will now proceed to create the wordSearch function, which is designed to search each page for occurrences of the word "twitter".

// Function to search for a specific word on a webpage
async function wordSearch(url, word, page, browser) {
  let results;
  try {
    await page.goto(url);

    const bodyText = await page.evaluate(
      () => document.getElementsByTagName("html")[0].innerHTML
    );
    const wordCount = countOccurrences(bodyText, word);
    if (wordCount > 0) {
      results = url;
    } else {
      results = `The word "${word}" is not found on ${url}`;
    }
  } catch (error) {
    return console.log(error);
  }
  return { page, browser, results };
}

// Helper function to count occurrences of a word in a string
function countOccurrences(text, word) {
  const regex = new RegExp(word, "gi");
  const matches = text.match(regex);
  return matches ? matches.length : 0;
}

module.exports = wordSearch;

The wordSearch function accepts four arguments: url, word, page, and browser. Here is the purpose of each argument:

  • url: This represents the URL of the page being scanned for the specified word.

  • word: This is the word that the function will search for within the HTML content of the page.

  • page and browser: These are parameters used by Puppeteer to evaluate the HTML content of each page during the execution of the function.

Passing page and browser as arguments into the wordSearch function instead of declaring them within the function itself is done for performance optimization reasons. By passing these objects as arguments, we avoid the overhead of initializing a new browser and page instance every time the function is called. This approach conserves processing power, especially when scanning a large number of pages, as it eliminates the need to repeatedly set up new browser and page instances for each function invocation.

The try-catch statement initially navigates to the provided URL and then evaluates all the HTML content using the following command:

const bodyText = await page.evaluate(
      () => document.getElementsByTagName("html")[0].innerHTML
    );

Next, we will search for the specified word within the HTML content by utilizing the countOccurrences function. This function employs a regular expression to perform a global search for the word, and if any matches are found, it returns the total number of occurrences.

// Helper function to count occurrences of a word in a string
function countOccurrences(text, word) {
  const regex = new RegExp(word, "gi");
  const matches = text.match(regex);
  return matches ? matches.length : 0;
}

If the word is not found, the following message will be logged:

      results = `The word "${word}" is not found on ${url}`;

Finally, we will complete this project by incorporating the remaining code into our index.js file. As previously discussed, we aim to pass the browser and page instances to the wordSearch function for reusability. To achieve this, we will introduce a helper function called createBrowser. Please add the following function to your index.js file, ensuring it is defined outside of any asynchronous functions.

async function createBrowser() {
  // initiate puppeteer browser/page
  let browser = await puppeteer.launch({ headless: true });
  let page = await browser.newPage();
  return { page, browser };
}

Ensure to include the require statement for Puppeteer (puppeteer) at the top of your index.js

const puppeteer = require("puppeteer");

To integrate the createBrowser function within an asynchronous function in your index.js file, you can incorporate it as follows:

require("dotenv").config();
const retrieveLinks = require("./utils/retrieveLinks");
const puppeteer = require("puppeteer");

(async () => {
  const links = await retrieveLinks(`${process.env.EXAMPLE_URL}`);
  let { page, browser } = await createBrowser();
  return links;
})();

async function createBrowser() {
  // initiate puppeteer browser/page
  let browser = await puppeteer.launch({ headless: true });
  let page = await browser.newPage();
  return { page, browser };
}

To save the results of your script to a text file, we will utilize the fs module, which enables performing file system operations including reading from and writing to files and directories in Node.js. To create a blank file using the fs.writeFile() method in Node.js, you can specify an empty string ('') as the file content. Here's how you can create an empty file:

  await fs.writeFile(`${process.env.FILE_PATH}/webpage_scan_results.txt`, "");

Next, we will implement a loop to iterate through all the links retrieved from the retrieveLinks function and append each result to the same file. During each iteration, we will utilize the wordSearch function to search for the specified word within the content of each link.

  for (let i = 0; i < links.length; i++) {
    let results = await wordSearch(links[i], "twitter", page, browser);
    await fs.appendFile(
      `${process.env.FILE_PATH}/webpage_scan_results.txt`,
      results.results,
      (err) => {
        if (err) {
          return console.error(err);
        }
      }
    );
  }

Ensure to close the Puppeteer browser instance at the end of the asynchronous function to release resources and maintain efficient execution.

(async () => {
  const links = await retrieveLinks(`${process.env.HR_URL}`);
  let { page, browser } = await createBrowser();

  await fs.writeFile(`${process.env.FILE_PATH}/webpage_scan_results.txt`, "");

  for (let i = 0; i < links.length; i++) {
    let results = await wordSearch(links[i], "twitter", page, browser);
    await fs.appendFile(
      `${process.env.FILE_PATH}/webpage_scan_results.txt`,
      results.results,
      (err) => {
        if (err) {
          return console.error(err);
        }
      }
    );
  }
  await browser.close();
})();

In conclusion, building a Drupal web scraper using tools like Puppeteer can offer significant advantages in both professional and personal contexts. From a professional standpoint, automating data extraction from Drupal websites can optimize workflows and save valuable time on tasks such as content monitoring, SEO analysis, or data aggregation for reporting. By leveraging web scraping techniques, developers and data analysts can streamline repetitive processes and focus on more strategic aspects of their work.

Moreover, beyond its practical applications, creating a Drupal web scraper can serve as an engaging portfolio project. It allows developers to showcase their skills in web development, data manipulation, and automation. Sharing a well-documented scraper project on GitHub not only demonstrates technical proficiency but also fosters collaboration within the developer community. It can inspire others to explore similar projects, contribute enhancements, or adapt the scraper for different use cases.

Ultimately, whether used to enhance professional productivity or as a creative endeavor, a Drupal web scraper built with Puppeteer exemplifies the intersection of technology, innovation, and problem-solving. Embracing web scraping tools opens up a world of possibilities for efficiency gains, data-driven insights, and collaborative learning in the dynamic landscape of web development. So, dive in, experiment, and leverage the power of web scraping to make meaningful contributions to your projects and share your journey with fellow enthusiasts.

If you're interested in exploring the complete code, please feel free to visit my GitHub repository at https://github.com/Ej1seven/Drupal_Web_Scraper.