Extract web content from website using PHP

0
985
Extract web content from any website using PHP

Web Scraping is the process of getting some data from an external website. This is especially necessary when the website you want to get data from doesn’t have API or RSS feed. Thus, you need to extract data using this method.

There are various usages of this technique such as scraping weather information, temperature, horoscope, price or other certain information. The most common technique used for scraping are:

  • Parsing through DOM
  • Search with Regular Expression

In this tutorial, we will use the first one as this is the most used one and the easiest one. You can also try with the later one but it’s generally a bad idea to parse HTML with Regex.

Parsing through DOM

We will use an open source library called PHP Simple HTML DOM Parser. First of all, you need to download it from SourceForge directory (https://sourceforge.net/projects/simplehtmldom/). After you download it, you can check simple_html_dom.php file. All the function we will be using is from this file.

You can also check the examples that comes along the library. Here is one:

find('a') as $e)
    echo $e->href . '
'; // find all image foreach($html->find('img') as $e) echo $e->src . '
'; // find all image with full tag foreach($html->find('img') as $e) echo $e->outertext . '
'; // find all div tags with id=gbar foreach($html->find('div#gbar') as $e) echo $e->innertext . '
'; // find all span tags with class=gb1 foreach($html->find('span.gb1') as $e) echo $e->outertext . '
'; // find all td tags with attribite align=center foreach($html->find('td[align=center]') as $e) echo $e->innertext . '
'; // extract text from table echo $html->find('td[align="center"]', 1)->plaintext.'

'; // extract text from HTML echo $html->plaintext; ?>

Today, we will use web scraping to extract daily horoscope from HamroPatro. If you inspect element of the page, you will see that each horoscope are nested within <div class="item"></div> code. So, we can extract the content inside of this code.

To get started, you need to import the simple_html_dom.php file in your php project.

require('simpleHtmlDom/simple_html_dom.php');

After you import library, you need to get the html of the website you want to extract. For this, we use file_get_html function which is already defined in simple_html_dom.php file.

$html = file_get_html('https://www.hamropatro.com/rashifal');

Note that you may get the following error.

Warning: file_get_contents(): stream does not support seeking in /simple_html_dom.php

To resolve this, you need to remove $offset from simple_html_dom.php file line 75. So, line number 75 code:

$contents = file_get_contents($url, $use_include_path, $context, $offset);

is changed as

$contents = file_get_contents($url, $use_include_path, $context);

After we have get the html code, we need to find all the div with item class and loop each. This library makes it very easy for this. We can do it as like:

$html->find('div.item')

Easy, isn’t it? 🙂 Simply, we can loop through this statement. For this purpose, I will create a new array container and hold all the data in it. You can process accordingly. Thus, our final code looks like below:

find('div.item') as $rashifal) {

    $horoscopeTitle       = $rashifal->find('h3', 0)->plaintext;
    $horoscopeDescription = $rashifal->find('p', 0)->plaintext;

    $horoscope[] = [
        'title'       => $horoscopeTitle,
        'description' => $horoscopeDescription
    ];
}

var_dump($horoscope);

This will output the result as below:


array(12) {
  [0]=>
  array(2) {
    ["title"]=>
    string(9) "मेष"
    ["description"]=>
    string(396) "मेष (चु, चे, चो, ला, लि, लु, ले, लो, अ) प्रेमीको सहयोग पाइनेछ । राजनीतिक कार्यमा सफलता मिल्नेछ । जोश जाँगर र हिम्मत बढ्नेछ । आँटेकोे र ताकेको कार्य पुरा होला । "
  }
  [1]=>
  array(2) {
    ["title"]=>
    string(9) "बृष"
    ["description"]=>
    string(432) "वृष (इ, उ, ए, ओ, वा, वि, वु, वे, वो) महत्वपूर्ण अवसर गुम्न सक्छ । अधिकार प्राप्तिका लागि संघर्श गर्नु पर्ला । यात्रामा मालसामान हराउन सक्छ । गरेका काममा ढिलासुस्ती हुनेछ । "
  }
  [2]=>
  array(2) {
    ["title"]=>
    string(15) "मिथुन"
    ["description"]=>
    string(405) "मिथुन (का, कि, कु, घ, ङ, छ, के, को, हा) अनुहारमा कान्ति र मनमा शान्ति छाउनेछ । शत्रु क्षय होलान् । नजिकको मित्रसँग भेटघाट हुनेछ । चिन्ता छाडेर चिन्तन गर्ने समयछ । "
  }
  [3]=>
  array(2) {
    ["title"]=>
    string(15) "कर्कट"
    ["description"]=>
    string(404) "कर्कट (हि, हु, हे, हो, डा, डि, डु, डे, डो) आम्दानी बढाउने काम सुरु गर्न सकिनेछ । चिताएको कामले तीव्रता लिनेछ । आफन्त र साथीभाइबाट सहयोग तथा हौसला प्राप्त हुनेछ । "
  }
  [4]=>
  array(2) {
    ["title"]=>
    string(12) "सिंह"
    ["description"]=>
    string(408) "सिंह (मा, मि, मु, मे, मो, टा, टि, टु, टे) आफन्तको सहयोग प्राप्त हुनेछ । साहित्यिक क्षेत्रमा रुची बढ्नेछ । धार्मिक यात्राको योगछ । कूटनीतिक नियोगको सहयोग मिल्ला । "
  }
  [5]=>
  array(2) {
    ["title"]=>
    string(15) "कन्या"
    ["description"]=>
    string(384) "कन्या (टो, पा, पि, पु, ष, ण, ठ, पे, पो) सानातिना समस्यामा अल्झिनुपर्नेछ । राजनैतिक कार्यमा बाधा आउनेछ । मुद्दा मामिला आइलाग्नेछ । प्रेममा मनमुटाव रहला ।"
  }
  [6]=>
  array(2) {
    ["title"]=>
    string(12) "तुला"
    ["description"]=>
    string(400) "तुला (रा, रि, रु, रे, रो, ता, ति, तु, ते) एकपछि अर्को अवसर आउनाले मन प्रसन्न रहनेछ । जोश जाँगर र हिम्मत बढ्नेछ । सुन्दर पहिरनले ब्यक्तित्वमा निखारता ल्याउनेछ । "
  }
  [7]=>
  array(2) {
    ["title"]=>
    string(21) "बृश्चिक"
    ["description"]=>
    string(451) "वृश्चिक (तो, ना, नि, नु, ने, नो, या, यि, यु) कला र गलाको प्रभाव बढ्नेछ । शुभ समाचार सुन्न पाईएला । वाक्चतुर्याईँले सङ्कल्प सिद्ध हुनेछ । पदीय जिम्मेवारी प्राप्त हुने सम्भावना छ । "
  }
  [8]=>
  array(2) {
    ["title"]=>
    string(9) "धनु"
    ["description"]=>
    string(413) "धनु (ये, यो, भा, भि, भु, धा, फा, ढा, भे) शुभारम्भको चर्चा चल्नेछ । भौतिक साधन जुटाउने समय छ । व्यवसायको सन्दर्भमा रमाइलो यात्रा होला । अतिथिको रूपमा सत्कार पाइएला । "
  }
  [9]=>
  array(2) {
    ["title"]=>
    string(9) "मकर"
    ["description"]=>
    string(425) "मकर (भो,जा,जि,जु,जे,जो,ख,खि,खु,खे,खो,गा,गि) अधिकार प्राप्तिका लागि संघर्श गर्नु पर्ला । कार्यमा बाधा आउनेछ । मनमा निरासा र शरीरमा आलस्य आउला । व्यवसायमा मन्दी आउनेछ । "
  }
  [10]=>
  array(2) {
    ["title"]=>
    string(15) "कुम्भ"
    ["description"]=>
    string(425) "कुम्भ (गु, गे, गो, सा, सि, सु, से, सो, दा) धनार्जनका नयाँ स्रोतहरू पत्ता लाग्नेछन् । रोमाञ्चक यात्रा होला । रोजगारको अवसर मिलनेछ । धर्म,कर्म तथा समाजसेवामा मन लाग्नेछ । "
  }
  [11]=>
  array(2) {
    ["title"]=>
    string(9) "मीन"
    ["description"]=>
    string(521) "मीन (दि, दु, थ, झ, ञ, दे, दो, चा, चि) पारिवार संग रमाइलो भेटघाट हुनेछ । रोकिएको काम दोहोर्याएर प्रयत्न गर्दा फाइदा हुनेछ । लाभदायक यात्रा हुनेछ । बिना प्रतिस्पर्धा फाइदा हुनेछ । - ज्यो.प. नारायणप्रसाद दुलाल "
  }
}

Note: You can also use web scraping using CURL. You can find out more about CURl in this article.

So this is it. You can find the source code in this GitHub Repository (https://github.com/vijaymgr/web-scraping-for-horoscope). Hope this helps you to get started web scraping to get content from any website. Please leave a feedback by commenting below.

Happy Coding!

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.