This Film Reminds Me to Never Shrink Myself in Relationships

I remember the first time my ex-boyfriend hit me. We were in the kitchen making dinner and arguing about whatever it is young couples in love argue about. I don’t recall the words exchanged, but I do…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Scraping iframes Using Playwright for Python

A quick tutorial on web scraping iframes with Playwright

I just got done with a web scraping job for a client that involved looking for specific popups on a list of websites. All of the websites were using iframes for their overlay popups. While working on the project, I found it very difficult to find accurate and current information on how to handle iframes with Playwright for Python. So I figured I would throw this quick write-up out into the interwebs.

An iframe, or inline frame, is an embedded html document. Unlike a Shadow DOM, an iframe is a completely separate document. It has its own <html>, <head>, and <body> tags. This can make them a bit difficult to work with. I have included a very barebones example of iframes in the image below. I will be using a live site to generate the error codes, but we will be using this example for this article.

I like to break things, don’t you? So let’s get down to the nitty gritty and break this thing to see what happens when we don’t do things properly.

Great, everythng looks ready to go so let’s give it a run:

As you can see, that didn’t work like we expected. This brings us to the next thing we need to know about FrameLocator. FrameLocator expects more specificity than page.locator(). If we run the same code using a basic locator we will get a proper output like so:

The output of the above code on our example page would be “2”. If you are unable to provide a proper selector to the FrameLocator, you would have to use page.frames, which is not recommended but will be discussed further below. In order to get the FrameLocator to work on our example page, we need to pass a more precise selector like so:

So is everything good to go now? Unfortunately not quite yet. If you are reading this, you probably already know that the Playwright API runs through commands very quickly. On the other hand, iframes take a moment to load as they are rendered on the client side. So now we will get an error like so:

Great! Everything works and the script will return “1”, as it has located our designated iframe. To access the contents of the iframe, just change the selector in the page.locator() to any element located in the iframe just like you would for normal Playwright operations.

You might be wondering why I am using page.locator.count() as an example. In my most recent project, I had to check for the presence of different types of iframes. In this case, I was not able to use “src” as a selector as the overlays all had different sources. In this case, I used a broader selector like this: “iframe[data-cy*=’widget’]”. This allowed for the checking of each site’s overlay. Instead of waiting for a selector, I simulated brief user interaction with the page to trigger all the JavaScript functions on the page and then checked for iframes. Since I was using a broader selector, a few of the test cases returned more than one iframe, while some pages didn’t have them at all. Using page.locator.count() and page.locator.nth() together provided me with the functionality I was looking for like below:

This solution worked well for my specific needs as it allowed for me to broadly test many differently styled pages without throwing any Playwright errors. If the iframe wasn’t there, count returned zero and the script moved on. If there was more than one iframe using the designated selector, the nested if/for statement allowed for further conditional checks against the iframes. However, you should beware of how you build your user page interaction simulation or you might get some strange things returned.

I found out during this project that if you don’t wait long enough or if an iframe isn’t properly triggered you can get some weird functionality. I used the Python time module to simulate the behavior below:

In the code above, we first visit the page. Then Python sleeps for three seconds before looking for all the text inside the designated iframe. Three seconds is just long enough for the iframe to appear, so we don’t get the same error as shown earlier. However, it is not long enough for the JavaScript inside the iframe to fire, so we get the following as the inner text:

So there you have it, you can now access iframes and their contents with ease!

What if you want to return all of the iframes on a page similar to page.locator(“iframe”) as listed above? Playwright makes that easy with the page.frames method as seen below:

One final note on the page.frames method. If you choose to use this method, you are not being specific with Playwright. As discussed above Playwright moves quickly and since the script and Playwright are moving at different speeds, it is possible that a frame will detach before it is called to action by Python. Running the for loop in the code snippet above on the top secret website prints most of the elements, but as it approaches the last element it throws this error:

This happens because while Python is iterating through items in the page.frames list, a frame detaches before it is called by Python. There are work arounds for this that are outside of the scope of this article.

To wrap up, Playwright is the best tool to use for web testing and scraping as it makes a lot of things intuitive and straightforward. However, there are some nuances when combining the Playwright API with Python that must be kept in mind when writing your script. If you find yourself in need of scraping iframes, I highly suggest creating multiple specific conditions and using the FrameLocator over the page.frames method. I hope this article helps someone out there in the interwebs not beat their head against iframes like I had to. Thank you for reading and I highly value your feedback! Happy coding!

Add a comment

Related posts:

Welcome to Third Culture Catholic!

Welcome to Third Culture Catholic, where perspectives on the American Catholic Church are given by a TCK — Third Culture Kid. Disclaimer: I’m not a kid anymore, and not a TCK in the “traditional”…

Be with Me

Be with me when her shrill. “Be with Me” is published by Deborah Kristina.

10 definitive signs you are a consultant

Consultants get a bad wrap — I should know, I used to be one. As with any profession, there are those that operate in such a manner, they ruin it for the rest. Over the years, many of us even tried…