In the realm of web development, we often encounter repetitive tasks that can become tedious over time. Many developers wish for efficient ways to streamline these processes, either by accelerating their execution or eliminating the need for manual repetition altogether.
One day at work, I was tasked with scanning several Drupal websites to ensure that the Twitter icon was no longer present, as the company had rebranded to X and wanted all icons updated accordingly. Initially, I started manually inspecting each website, meticulously searching through different content types and blocks to locate the Twitter icon. However, it soon occurred to me that there had to be a more efficient approach. That's when I realized the potential of creating an automated script to scan all webpages on these websites and generate a report indicating which pages did or did not contain the Twitter icon. I decided to use Puppeteer, a Node.js library, to accomplish the task. Puppeteer facilitates the automation of web page interactions, enabling actions like navigation, form submission, DOM manipulation, and screen capturing, all programmatically using JavaScript or Node.js scripts. This powerful tool streamlines the process of automating tasks that would otherwise be manual and time-consuming when working with web applications.
In this article, I will detail the step-by-step process I undertook to develop a web scraping script using Puppeteer, highlighting the valuable lessons I learned along the way.
- First, you'll need to download and install Node.js. Once installed, you can verify the installation by entering the following command in your terminal:
node -v
npm -v
Next, create your project directory and initialize your Node.js project by running the following commands in your terminal:
Open a terminal.
Navigate to your project directory using
cd
command.Run
npm init -y
to initialize a newpackage.json
file with default settings.
Install the following packages:
npm i puppeteer dotenv
The provided commands will install the Puppeteer and dotenv packages. dotenv is particularly useful for managing environment variables within your project.
- Set up the following file structure:
/index.js
_/utils
___/retrieveLinks.js
____/wordSearch.js
/.env
- Let's begin by defining an asynchronous function in
index.js
that callsretrieveLinks()
require("dotenv").config();
const retrieveLinks = require("./utils/retrieveLinks");
(async () => {
const links = await retrieveLinks(`${process.env.EXAMPLE_URL}`);
return links;
})();
In the retrieveLinks()
function, we define a constant variable named links
. This variable invokes retrieveLinks()
, which we will create shortly. The concept is that by running this function and passing in the URL of your webpage, you will retrieve all the links from your website. The statement const retrieveLinks = require("./utils/retrieveLinks")
is used in Node.js to import the retrieveLinks()
function from a specific file or module located at ./utils/retrieveLinks
. This approach allows you to include and use the retrieveLinks()
function within your application. The require("dotenv").config()
statement is used to load and parse environment variables from a .env
file into the process.env
object in your Node.js application. This allows you to securely store and access sensitive configuration settings without hardcoding them into your codebase.
- In your
.env
file, you may have a variable declared like this:
EXAMPLE_URL="https://www.example.com"
- Let's now transition to the
retrieveLinks.js
file. Here, we will begin by defining an asynchronous function that takes a URL as a parameter and logs it to the console. We'll also ensure that this function can be accessed outside of its scope by usingmodule.exports
.
async function retrieveLinks(websiteUrl) {
console.log(websiteUrl);
}
module.exports = retrieveLinks;
You should receive the following output to console:
https://www.example.com
Let's integrate Puppeteer into our project by adding the library with the command const puppeteer = require("puppeteer")
. We'll then create a headless Chrome browser instance using puppeteer.launch()
. To configure the browser for non-headless mode (i.e., with a visible window), we'll include the headless: false
argument. Additionally, to slow down the browser's operations for easier debugging, we'll use the slowMo: 50
argument.
- Here's how you can set up Puppeteer with these configurations in your Node.js project:
const puppeteer = require("puppeteer");
async function retrieveLinks(websiteUrl) {
// Launch the browser and open a new blank page
const browser = await puppeteer.launch({
headless: false,
slowMo: 50, // slow down by 250ms
});
const page = await browser.newPage();
await page.setViewport({ width: 1920, height: 1080 });
await page.goto(`${websiteUrl}`);
await browser.close();
}
module.exports = retrieveLinks;
The page
variable is used to create a new browser page. In this scenario, the page's viewport is set to a screen size of 1920 x 1080 pixels. The browser then navigates to the specified URL, and once the operations are complete, the browser instance is closed.
Once you've confirmed that the correct webpage is opening successfully, you can proceed with the login process. Different organizations use various methods to log in to their Drupal sites. In my case, we use SAML for authentication, and I'll demonstrate this process in the article. However, if your organization uses a different login method, you may need to adapt the script accordingly. The fundamental principles of automating the login process remain consistent regardless of the authentication method employed.
const puppeteer = require("puppeteer");
async function retrieveLinks(websiteUrl) {
// Launch the browser and open a new blank page
const browser = await puppeteer.launch({
headless: false,
slowMo: 50, // slow down by 250ms
});
const page = await browser.newPage();
await page.setViewport({ width: 1920, height: 1080 });
await page.goto(`${websiteUrl}/user/login`);
await page.waitForSelector("#user-login-form");
await page.click("#edit-samlauth-login-link");
await page.waitForSelector("#username");
await page.type("#username", `${process.env.USERNAME}`);
await page.click("#password");
await page.type("#password", `${process.env.PASSWORD}`);
await page.click(".idp3_form-submit");
await page.waitForSelector("#dont-trust-browser-button");
await page.click("#dont-trust-browser-button");
await browser.close();
}
module.exports = retrieveLinks;
In the updated configuration, the URL has been modified to "${websiteUrl}/user/login" to direct users to the login page. The subsequent command employs the waitForSelector method, which functions by pausing script execution until a designated HTML element—specifically, the user-login-form ID—becomes visible on the webpage. Upon the form's appearance, Puppeteer proceeds to select the login link, identified as edit-samlauth-login-link, and subsequently authenticates using the provided username and password credentials. The username and password credentials should be added to the previously discussed .env file as show below:
HR_URL="https://www.example.com"
USERNAME="exampleUsername"
PASSWORD="examplePassword"
Finally, after entering the username and password, the script proceeds to select the login button, which then redirects you to the homepage of your Drupal website.
In Drupal websites, you can access the URL ${websiteUrl}/admin/content to manage content. This page contains a list of every page created on your website, which is essential for obtaining links to all website pages for this project. Once you are on the homepage of your Drupal website, you will need to add the following two commands to your script in order to select the content button on the administration toolbar, which will direct you to this page.
await page.waitForSelector(".toolbar-icon-system-admin-content");
await page.click(".toolbar-icon-system-admin-content");
Initially, when developing this script, I encountered an unexpected error indicating that the content button on the administration toolbar could not be located. Upon investigation, I discovered that some of my websites did not automatically open the administration toolbar, requiring the "Manage" button to be selected first. If you encounter an issue where your script cannot click the .toolbar-icon-system-admin-content class, please incorporate the following two commands before attempting to select the content button:
await page.waitForSelector("#toolbar-item-administration");
await page.click("#toolbar-item-administration");
await page.waitForSelector(".toolbar-icon-system-admin-content");
await page.click(".toolbar-icon-system-admin-content");
If your script has executed successfully up to this point, you should now be directed to a page resembling the one displayed below:
Upon reaching the "Manage Content" page, we will enhance the retrieveLink
function by introducing two variables: contentPageUrl
and allLinks
.
contentPageUrl
will store the URL of the content page.allLinks
will be an array containing the URLs of every page on the website. This array will be returned at the end of the function.
These additions will facilitate the retrieval and organization of page URLs from the Drupal website.
const contentPageUrl = page.url();
const allLinks = [];
Depending on the number of pages on your Drupal website, you may need to paginate to access the next list of pages. This can be done by selecting the appropriate page number in the toolbar located at the bottom of the page.
To determine the total number of pages in the pagination toolbar, you can inspect the "Last" button element. This button is represented by a <a>
tag with an href
attribute structured as "?page=number"
. The value of number
after the equal sign (=
) in the href
attribute corresponds to the total number of pages available. This approach allows you to evaluate the CSS and retrieve the necessary information for pagination in your script.
To retrieve the href
property from the last button link using Puppeteer's evaluate
method, you can use the following command:
const lastPageButton = await page.$(".pager__item--last > a");
const lastPage = await page.evaluate(
(el) => el.href,
page.$(".pager__item--last > a")
);
The constant variable lastPageButton
is used to select the button element representing the last page in the pagination toolbar. The lastPage
function then evaluates this button element to retrieve its href
property. After retrieving the lastPage
variable containing a string like "?page=number"
, you can extract the number
part and convert it into a numeric value. Here's how you can accomplish this in JavaScript:
const numberOfPages = Number(lastPage.substring(lastPage.indexOf("=") + 1));
To retrieve all page links from the first page we can run the following command:
const firstPagelinks = await page.$$(".views-field > a");
In Puppeteer, the $$
method is used to query the DOM to find all elements that match a specified CSS selector. In this case we are finding all links for new content provided in the list. During the development of this script, I observed that besides obtaining links to regular pages, it was also fetching links to admin and user pages, which are not required for our purposes. To address this issue, I implemented the following for loop to filter out these unnecessary links. Each link retrieved is appended to the allLinks
array, which will be utilized at a later stage in the process.
for (let i = 0; i < firstPagelinks.length; i++) {
const link = await firstPagelinks[i].evaluate((el) => el.href, firstPagelinks[i]);
if (
!link.includes(`${websiteUrl}/admin/`) &&
!link.includes(`${websiteUrl}/user/`)
) {
allLinks.push(`${link}\n`);
}
}
Next, we will utilize the numberOfPages
variable to iterate through each page. Within this loop, we will employ the same approach as previously described to retrieve the links from the content list on each page. We will initiate the loop starting at page 1, as we have already collected the links from the first page.
for (let x = 1; x < numberOfPages + 1; x++) {
await page.goto(`${contentPageUrl}?page=${x}`);
await page.waitForSelector("tbody");
const newPagelinks = await page.$$(".views-field > a");
for (let i = 0; i < newPagelinks.length; i++) {
const link = await newPagelinks[i].evaluate(
(el) => el.href,
newPagelinks[i]
);
if (
!link.includes(`${websiteUrl}/admin/`) &&
!link.includes(`${websiteUrl}/user/`)
) {
allLinks.push(`${link}\n`);
}
}
}
Ensure that at the end of the retrieveLinks
function, you close the Puppeteer browser instance and return the allLinks
array.
const puppeteer = require("puppeteer");
async function retrieveLinks(websiteUrl) {
// Launch the browser and open a new blank page
const browser = await puppeteer.launch({
headless: false,
slowMo: 50, // slow down by 250ms
});
const page = await browser.newPage();
await page.setViewport({ width: 1920, height: 1080 });
await page.goto(`${websiteUrl}/user/login`);
await page.waitForSelector("#user-login-form");
await page.click("#edit-samlauth-login-link");
await page.waitForSelector("#username");
await page.type("#username", `${process.env.USERNAME}`);
await page.click("#password");
await page.type("#password", `${process.env.PASSWORD}`);
await page.click(".idp3_form-submit");
await page.waitForSelector("#dont-trust-browser-button");
await page.click("#dont-trust-browser-button");
await page.waitForSelector(".toolbar-icon-system-admin-content");
await page.click(".toolbar-icon-system-admin-content");
const contentPageUrl = page.url();
const allLinks = [];
const lastPageButton = await page.$(".pager__item--last > a");
const lastPage = await lastPageButton.evaluate(
(el) => el.href,
page.$(".pager__item--last > a")
);
const numberOfPages = Number(lastPage.substring(lastPage.indexOf("=") + 1));
const firstPagelinks = await page.$$(".views-field > a");
for (let i = 0; i < firstPagelinks.length; i++) {
const link = await firstPagelinks[i].evaluate(
(el) => el.href,
firstPagelinks[i]
);
if (
!link.includes(`${websiteUrl}/admin/`) &&
!link.includes(`${websiteUrl}/user/`)
) {
allLinks.push(`${link}\n`);
}
}
for (let x = 1; x < numberOfPages + 1; x++) {
await page.goto(`${contentPageUrl}?page=${x}`);
await page.waitForSelector("tbody");
const newPagelinks = await page.$$(".views-field > a");
for (let i = 0; i < newPagelinks.length; i++) {
const link = await newPagelinks[i].evaluate(
(el) => el.href,
newPagelinks[i]
);
if (
!link.includes(`${websiteUrl}/admin/`) &&
!link.includes(`${websiteUrl}/user/`)
) {
allLinks.push(`${link}\n`);
}
}
}
await browser.close();
return allLinks;
}
module.exports = retrieveLinks;
We will now proceed to create the wordSearch
function, which is designed to search each page for occurrences of the word "twitter".
// Function to search for a specific word on a webpage
async function wordSearch(url, word, page, browser) {
let results;
try {
await page.goto(url);
const bodyText = await page.evaluate(
() => document.getElementsByTagName("html")[0].innerHTML
);
const wordCount = countOccurrences(bodyText, word);
if (wordCount > 0) {
results = url;
} else {
results = `The word "${word}" is not found on ${url}`;
}
} catch (error) {
return console.log(error);
}
return { page, browser, results };
}
// Helper function to count occurrences of a word in a string
function countOccurrences(text, word) {
const regex = new RegExp(word, "gi");
const matches = text.match(regex);
return matches ? matches.length : 0;
}
module.exports = wordSearch;
The wordSearch
function accepts four arguments: url
, word
, page
, and browser
. Here is the purpose of each argument:
url
: This represents the URL of the page being scanned for the specified word.word
: This is the word that the function will search for within the HTML content of the page.page
andbrowser
: These are parameters used by Puppeteer to evaluate the HTML content of each page during the execution of the function.
Passing page
and browser
as arguments into the wordSearch
function instead of declaring them within the function itself is done for performance optimization reasons. By passing these objects as arguments, we avoid the overhead of initializing a new browser and page instance every time the function is called. This approach conserves processing power, especially when scanning a large number of pages, as it eliminates the need to repeatedly set up new browser and page instances for each function invocation.
The try
-catch
statement initially navigates to the provided URL and then evaluates all the HTML content using the following command:
const bodyText = await page.evaluate(
() => document.getElementsByTagName("html")[0].innerHTML
);
Next, we will search for the specified word within the HTML content by utilizing the countOccurrences
function. This function employs a regular expression to perform a global search for the word, and if any matches are found, it returns the total number of occurrences.
// Helper function to count occurrences of a word in a string
function countOccurrences(text, word) {
const regex = new RegExp(word, "gi");
const matches = text.match(regex);
return matches ? matches.length : 0;
}
If the word is not found, the following message will be logged:
results = `The word "${word}" is not found on ${url}`;
Finally, we will complete this project by incorporating the remaining code into our index.js
file. As previously discussed, we aim to pass the browser and page instances to the wordSearch
function for reusability. To achieve this, we will introduce a helper function called createBrowser
. Please add the following function to your index.js
file, ensuring it is defined outside of any asynchronous functions.
async function createBrowser() {
// initiate puppeteer browser/page
let browser = await puppeteer.launch({ headless: true });
let page = await browser.newPage();
return { page, browser };
}
Ensure to include the require
statement for Puppeteer (puppeteer
) at the top of your index.js
const puppeteer = require("puppeteer");
To integrate the createBrowser
function within an asynchronous function in your index.js
file, you can incorporate it as follows:
require("dotenv").config();
const retrieveLinks = require("./utils/retrieveLinks");
const puppeteer = require("puppeteer");
(async () => {
const links = await retrieveLinks(`${process.env.EXAMPLE_URL}`);
let { page, browser } = await createBrowser();
return links;
})();
async function createBrowser() {
// initiate puppeteer browser/page
let browser = await puppeteer.launch({ headless: true });
let page = await browser.newPage();
return { page, browser };
}
To save the results of your script to a text file, we will utilize the fs
module, which enables performing file system operations including reading from and writing to files and directories in Node.js. To create a blank file using the fs.writeFile()
method in Node.js, you can specify an empty string (''
) as the file content. Here's how you can create an empty file:
await fs.writeFile(`${process.env.FILE_PATH}/webpage_scan_results.txt`, "");
Next, we will implement a loop to iterate through all the links retrieved from the retrieveLinks
function and append each result to the same file. During each iteration, we will utilize the wordSearch
function to search for the specified word within the content of each link.
for (let i = 0; i < links.length; i++) {
let results = await wordSearch(links[i], "twitter", page, browser);
await fs.appendFile(
`${process.env.FILE_PATH}/webpage_scan_results.txt`,
results.results,
(err) => {
if (err) {
return console.error(err);
}
}
);
}
Ensure to close the Puppeteer browser instance at the end of the asynchronous function to release resources and maintain efficient execution.
(async () => {
const links = await retrieveLinks(`${process.env.HR_URL}`);
let { page, browser } = await createBrowser();
await fs.writeFile(`${process.env.FILE_PATH}/webpage_scan_results.txt`, "");
for (let i = 0; i < links.length; i++) {
let results = await wordSearch(links[i], "twitter", page, browser);
await fs.appendFile(
`${process.env.FILE_PATH}/webpage_scan_results.txt`,
results.results,
(err) => {
if (err) {
return console.error(err);
}
}
);
}
await browser.close();
})();
In conclusion, building a Drupal web scraper using tools like Puppeteer can offer significant advantages in both professional and personal contexts. From a professional standpoint, automating data extraction from Drupal websites can optimize workflows and save valuable time on tasks such as content monitoring, SEO analysis, or data aggregation for reporting. By leveraging web scraping techniques, developers and data analysts can streamline repetitive processes and focus on more strategic aspects of their work.
Moreover, beyond its practical applications, creating a Drupal web scraper can serve as an engaging portfolio project. It allows developers to showcase their skills in web development, data manipulation, and automation. Sharing a well-documented scraper project on GitHub not only demonstrates technical proficiency but also fosters collaboration within the developer community. It can inspire others to explore similar projects, contribute enhancements, or adapt the scraper for different use cases.
Ultimately, whether used to enhance professional productivity or as a creative endeavor, a Drupal web scraper built with Puppeteer exemplifies the intersection of technology, innovation, and problem-solving. Embracing web scraping tools opens up a world of possibilities for efficiency gains, data-driven insights, and collaborative learning in the dynamic landscape of web development. So, dive in, experiment, and leverage the power of web scraping to make meaningful contributions to your projects and share your journey with fellow enthusiasts.
If you're interested in exploring the complete code, please feel free to visit my GitHub repository at https://github.com/Ej1seven/Drupal_Web_Scraper.