Web scraping dynamic content only using beautiful soup

Scott Dallman
3 min readDec 8, 2021

--

Photo by Valery Sysoev on Unsplash

Beautiful Soup has a challenge with web scraping, it can’t deal directly with dynamic content. That said what I find a lot of web scrapers doing is using Selenium to activate the dynamic content in order to scrape the identified data that they need. This is a great method, but if you are not as familiar with Selenium or don’t want to add another library don’t worry there is another way.

The extraction method

When web-scraping all we really care about is the data and the first thing we need to do is identify where the data lives. In some cases, the data will actually live in a JSON format embedded in a <script> tag. If this is the case you can scrape the data as you normally would. Identifying the script tag and parsing out the content that you need.

In most cases, the data you need will not be sitting in the script tag but will instead will be loaded by asynchronous call using AJAX. I am going to be using an example below using the cash flow data coming from finviz.com

When looking up a stock such as Tesla, the direct URL is below https://finviz.com/quote.ashx?t=tsla in the middle of the page lists the income statement, balance sheet, and cash flow. If reviewing the HTML only the income statement is displayed. In order to retrieve the AJAX information we need to find the URL the data is sourced from.

If we use developer tools, inspect the website and use the Network tab we can locate the data source. Filter the data by selecting only “Fetch/XHR”. Once completed select the cash flow text. Go ahead a click it a few times to see the URL that the data is coming from. You can see from the screen below that the URL is present and you see the TSLA stock ticker. https://finviz.com/api/statement.ashx?t=tsla&s=CA is the full URL to fetch the data.

Extraction

From this point, you have located the source of the data which is probably the hardest part and from here is just a matter of scraping the data using beautiful soup if you are looking for an individual string.

cashflow = ’https://finviz.com/api/statement.ashx?t=tsla&s=CA'cf = requests.get(cashflow, headers=headers)soup = bs(cf.content, 'html.parser')# use what ever method that you need to find the content
soup.find_all(???)
soup.find(???)

In my case, I was looking for the whole dataset. Since the data was already formatted into JSON I loaded the response in JSON

headers = {‘User-Agent’: ‘Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36’}cashflow = ’https://finviz.com/api/statement.ashx?t=tsla&s=CA'cf = requests.get(cashflow, headers=headers)cfdata = cf.json()

Conclusion

And there you have it, data loaded from dynamic content. You don’t always need to use Selenium in order to get the dynamic content you are looking for, so maybe try this method out to see if it fits your needs.

If you like the content I would appreciate a follow!

--

--

Scott Dallman
Scott Dallman

Written by Scott Dallman

Writing about technology and tech trends as a husband, father, all around technology guy, bad golfer and Googler

Responses (1)