How To Do Nested For Loops While Scraping Data Python

Web Scraping is a method of extracting useful data from a website using calculator programs without having to manually practise information technology. This data can then exist exported and categorically organized for various purposes. Some common places where Web Scraping finds its use are Market research & Analysis Websites, Cost Comparison Tools, Search Engines, Information Collection for AI/ML projects, etc.

Let's dive deep and scrape a website. In this article, we are going to take the GeeksforGeeks website and extract the titles of all the articles available on the Homepage using a Python script.

If you notice, in that location are thousands of manufactures on the website and to extract all of them, we volition take to scrape through all pages so that we don't miss out on any!

GeeksforGeeks Homepage

Scraping multiple Pages of a website Using Python

At present, there may arise various instances where you lot may want to become data from multiple pages from the aforementioned website or multiple different URLs as well, and manually writing code for each webpage is a time-consuming and tiresome task. Plus, it defines all basic principles of automation. Duh!

To solve this verbal problem, nosotros will see two main techniques that will aid u.s.a. extract data from multiple webpages:

The same website
Different website URLs

Approach:

The approach of the programme will be fairly simple, and it will be easier to understand information technology in a Signal format:

We'll import all the necessary libraries.
Fix our URL strings for making a connection using the requests library.
Parsing the available data from the target page using the BeautifulSoup library'south parser.
From the target page, Place and Extract the classes and tags which contain the information that is valuable to us.
Prototype it for one page using a loop and and so employ it to all the pages.

Example one: Looping through the page numbers

folio numbers at the bottom of the GeeksforGeeks website

Nearly websites have pages labeled from 1 to N. This makes information technology really unproblematic for the states to loop through these pages and extract data from them every bit these pages have similar structures. For example:

find the concluding section of the URL – folio/four/

Here, we tin can run across the page details at the cease of the URL. Using this data nosotros tin easily create a for loop iterating over every bit many pages as we want (by putting page/(i)/ in the URL cord and iterating "i" till Due north) and scrape all the useful data from them. The following code will give you more clarity over how to scrape information past using a For Loop in Python.

Python

              import              requests            
              from              bs4                            import              BeautifulSoup every bit bs            
              req                            =              requests.get(URL)            
              soup                            =              bs(req.text,                            'html.parser'              )            
              titles                            =              soup.find_all(              'div'              ,attrs                            =              {              'class'              ,              'head'              })            
              print              (titles[              4              ].text)            

Output:

Output for the to a higher place code

Now, using the above lawmaking, we can get the titles of all the articles by just sandwiching those lines with a loop.

Python

              import              requests            
              from              bs4                            import              BeautifulSoup equally bs            
              for              page                            in              range              (              i              ,              10              ):            
                            req                            =              requests.become(URL                            +              str              (page)                            +              '/'              )            
                            soup                            =              bs(req.text,                            'html.parser'              )            
                            titles                            =              soup.find_all(              'div'              ,attrs              =              {              'form'              ,              'head'              })            
                            for              i                            in              range              (              4              ,              19              ):            
                            if              page>              ane              :            
                            impress              (f              "{(i-three)+folio*15}"              +              titles[i].text)            
                            else              :            
                            print              (f              "{i-three}"              +              titles[i].text)            

Output:

Output for the above code

Note: The above lawmaking will fetch the first 10 pages from the website and scrape all the 150 titles of the articles that autumn under those pages.

Case 2: Looping through a listing of different URLs.

The higher up technique is absolutely wonderful, merely what if you lot need to scrape unlike pages, and you don't know their page numbers? You'll need to scrape those different URLs one by 1 and manually code a script for every such webpage.

Instead, you could just make a list of these URLs and loop through them. By simply iterating the items in the list i.e. the URLs, we will be able to extract the titles of those pages without having to write code for each page. Here's an example lawmaking of how you tin do it.

Python

              import              requests            
              from              bs4                            import              BeautifulSoup equally bs            
              for              url                            in              range              (              0              ,              2              ):            
                            req                            =              requests.get(URL[url])            
                            soup                            =              bs(req.text,                            'html.parser'              )            
                            titles                            =              soup.find_all(              'div'              ,attrs              =              {              'class'              ,              'head'              })            
                            for              i                            in              range              (              4              ,                            19              ):            
                            if              url              +              1              >                            ane              :            
                            print              (f              "{(i - 3) + url * fifteen}"              +              titles[i].text)            
                            else              :            
                            print              (f              "{i - three}"              +              titles[i].text)            

Output:

Output for the above code

How to avoid getting your IP address banned?

Controlling the crawl rate is the virtually important thing to proceed in mind when conveying out a very large extraction. Bombarding the server with multiple requests within a very short amount of time will most probable upshot in getting your IP address blacklisted. To avoid this, we tin can but deport out our crawling in short random bursts of time. In other words, we add pauses or little breaks betwixt itch periods, which assist us look like actual humans every bit websites tin hands identify a crawler considering of the speed information technology possesses compared to a human being trying to visit the website. This helps avoid unnecessary traffic and overloading of the website servers. Win-Win!

At present, how do we control the crawling rate? It's simple. By using 2 functions, randint() and slumber() from python modules 'random' and 'time' respectively.

Python3

              from              random                            import              randint            
              from              fourth dimension                            import              sleep            
              print              (randint(              one              ,              ten              ))            

The randint() part will cull a random integer betwixt the given upper and lower limits, in this instance, 10 and 1 respectively, for every iteration of the loop. Using the randint() function in combination with the sleep() function will aid in adding brusk and random breaks in the crawling rate of the program. The sleep() office will basically finish the execution of the programme for the given number of seconds. Hither, the number of seconds will randomly exist fed into the slumber office by using the randint() function. Use the code given below for reference.

Python3

              from              time                            import              *            
              from              random                            import              randint            
              for              i                            in              range              (              0              ,              3              ):            
                            ten                            =              randint(              2              ,              5              )            
                            print              (x)            
                            sleep(10)            
                            impress              (f              'I waited {x} seconds'              )            

Output

v I waited v seconds four I waited 4 seconds five I waited v seconds

To get y'all a clear idea of this function in activity, refer to the code given below.

Python3

              import              requests            
              from              bs4                            import              BeautifulSoup every bit bs            
              from              random                            import              randint            
              from              time                            import              sleep            
              for              page                            in              range              (              1              ,              10              ):            
                            req                            =              requests.get(URL                            +              str              (page)                            +              '/'              )            
                            soup                            =              bs(req.text,                            'html.parser'              )            
                            titles                            =              soup.find_all(              'div'              ,attrs              =              {              'class'              ,              'head'              })            
                            for              i                            in              range              (              4              ,              19              ):            
                            if              page>              ane              :            
                            print              (f              "{(i-3)+page*15}"              +              titles[i].text)            
                            else              :            
                            print              (f              "{i-iii}"              +              titles[i].text)            
                            sleep(randint(              2              ,              x              ))