How to fetch data from website using php?

25 replies
Hi,

I want to fetch data from one website and make exactly the same website on my new domain. The data should store on mysql database.

Please guide if anyone knows this?

Thanks.
#data #fetch #php #website
  • Profile picture of the author cgimaster
    I am unsure of what you really want here, if you want to simple fetch the data and post to your site or if you want to simple clone the entire site or if u want to make a domain with some system/CMS behind and fetch the content from another site.

    Could you be a bit more specific and perhaps give us an example.

    To put it simple if you only want to fetch data for example you can use a php + dom to extract the wanted data and repost to your site, you could use a tool like httracker to clone an entire website without any CMS or system behind it, just html, you could use a CMS and copy the data or fed it automaticaly.
    {{ DiscussionBoard.errors[7288657].message }}
  • Profile picture of the author jeet020
    Hi,
    Thanks for the reply.

    I want to fetch data from other site to my site like we generally use rss feed to get data, by using rss feed we can post data to our site. I am using this in few of my wordpress sites. I am using plugin to fetch data via rss feed and whenever it gets updated on the main site , It will also update the data on my site also. So, its become complete auto pilot site.

    Like above I want to fetch data using php function or script not using rss feed. And whenever data updated on main site the script will update the data on my site too.

    Here an example I found what exactly I want to do on my site:

    Example:

    Main site:
    3gpmobilemovies.com

    Clone site (Full copied site from above site):
    3gparena.in

    Please help me if you can do this.
    Thanks and regards..
    {{ DiscussionBoard.errors[7290415].message }}
  • Profile picture of the author SteveSRS
    This is called web scraping... and might or might not be illegal so be careful.

    Secondly it is not the most easiest topic.. you can do it 'stupid' with file_get_contents() in php.
    But better is using CURL (just google) and then when you have site data you're going to have to use the DOM model to analyze and pick out the parts you want... this is hardest part. If you aren't pretty advanced in PHP (or other language) I would recommend hiring somebody.
    {{ DiscussionBoard.errors[7291570].message }}
  • Profile picture of the author cgimaster
    This plugin might work for you but the ideal would make your own php using cURL and DOM to fetch the wanted data and post back.

    WordPress › WP Web Scraper « WordPress Plugins
    {{ DiscussionBoard.errors[7291814].message }}
  • Profile picture of the author WebThinker
    Yes, I agree with Steve, this is far more complicated than a simple command we can help you with. They are many things to consider (in some cases you might be even forced to spoof IP addresses, different user agents, etc, so your script doesn't get blocked).

    So my advice is also: hire somebody who will write such a script for you... exactly custom tailored to your needs.
    {{ DiscussionBoard.errors[7292577].message }}
    • Profile picture of the author jeet020
      Originally Posted by SteveSRS View Post

      This is called web scraping... and might or might not be illegal so be careful.

      Secondly it is not the most easiest topic.. you can do it 'stupid' with file_get_contents() in php.
      But better is using CURL (just google) and then when you have site data you're going to have to use the DOM model to analyze and pick out the parts you want... this is hardest part. If you aren't pretty advanced in PHP (or other language) I would recommend hiring somebody.
      Thanks for the help..

      Originally Posted by WebThinker View Post

      Yes, I agree with Steve, this is far more complicated than a simple command we can help you with. They are many things to consider (in some cases you might be even forced to spoof IP addresses, different user agents, etc, so your script doesn't get blocked).

      So my advice is also: hire somebody who will write such a script for you... exactly custom tailored to your needs.
      Okey.. Can you suggest me someone who can make this script for me??

      Thanks all of you for your help
      {{ DiscussionBoard.errors[7296557].message }}
  • Profile picture of the author cgimaster
    Do not use REGEX to parse HTML:

    html - RegEx match open tags except XHTML self-contained tags - Stack Overflow

    Instead use a proper HTML Parser available within the programming/scripting language you are using.

    For instance php you have DOM, c# you have HTML Agility Pack and so on.
    {{ DiscussionBoard.errors[7304483].message }}
    • Profile picture of the author lordspace
      Originally Posted by cgimaster View Post

      Do not use REGEX to parse HTML:

      html - RegEx match open tags except XHTML self-contained tags - Stack Overflow

      Instead use a proper HTML Parser available within the programming/scripting language you are using.

      For instance php you have DOM, c# you have HTML Agility Pack and so on.
      if HTML is not well structured the parsing can break. With regex you can focus on a few tags with random attributes and that's it
      Signature

      Are you using WordPress? Have you tried qSandbox yet?

      {{ DiscussionBoard.errors[7307833].message }}
  • Profile picture of the author SteveSRS
    @seodude, that post has no value AT ALL in this topic..

    @FirstSocialApps
    yes if you use Regex for this you can be sure it will break.. probably before even using it for the first time

    @Brandon LOL nice one for catching that one but on the other hand if he can create a system to prevent them from scraping (which isn't that hard to do btw, but might be a little bad for SEO also) chances are he knows how to make them...

    I've done many of those however don't have the time to help you.. just make sure it is getting done with a good DOM structure analyzer which is stable and doesn't take too much memory of your servers (it is a pretty CPU intensive job!). Simple HTML DOM (google it) is a very good one to start building a scraper with.
    {{ DiscussionBoard.errors[7316821].message }}
    • Profile picture of the author FirstSocialApps
      Originally Posted by SteveSRS View Post

      @FirstSocialApps
      yes if you use Regex for this you can be sure it will break.. probably before even using it for the first time
      Steve. with respect that comment makes no sense at all, if it breaks before the first time, then the script was never done, so hence it cant break since it was never done to break. See the paradox in your statement

      REGEX is fine depending on what you want to pull out of the content. For example pulling out the title would work fine and is not likely to break.

      Since the OP did not specify what he wants from the content, and is obviously new to this type of work I provided him with the simplest solution.
      {{ DiscussionBoard.errors[7334487].message }}
      • Profile picture of the author cgimaster
        Originally Posted by FirstSocialApps View Post

        Steve. with respect that comment makes no sense at all, if it breaks before the first time, then the script was never done, so hence it cant break since it was never done to break. See the paradox in your statement

        REGEX is fine depending on what you want to pull out of the content. For example pulling out the title would work fine and is not likely to break.

        Since the OP did not specify what he wants from the content, and is obviously new to this type of work I provided him with the simplest solution.
        I dont see why you would want to re-invent the wheel when there are plenty libraries for most scripting/programming languages out there that will allow you to parse html without a bogus regex.

        Regex is awesome its just not worth to be used for this anymore, it was a common thing 10+ years ago.

        There is plenty of reason out there of why not to use regex on html over a parsing html library not limited but including the fact you have to create each rule, and test while the libraries have been developed for years and have sustained several tests and fixes along the years.
        {{ DiscussionBoard.errors[7334695].message }}
        • Profile picture of the author FirstSocialApps
          Originally Posted by cgimaster View Post

          I dont see why you would want to re-invent the wheel when there are plenty libraries for most scripting/programming languages out there that will allow you to parse html without a bogus regex.
          I dont know why every thing has to be such a big argument on this forum. Its almost as if everyone is more concerned with proving that there way is better then with actually helping the person who asks the question.

          Its obviously not 'reinventing the wheel' when as you just said doing this with REGEX is a very old method. Also as I said if you want to pull something very simple its not a bad way to go. Fast simple and very little code.

          I have all ready explained why I have chosen to tell the OP to do it this way, because it is the most simple method and he is obviously new to programming. Either you didnt read my posts , didnt understand them, or just ignored it.

          Im not going to argue over which is 'better' as better is a subjective term. Is better the way that gives the fastest execution, that requires the least lines of code, that is most reliable, that is most self contained, that is most ... on and on and on.

          In practice a programmer must assign values to each of these things and weigh his options. In my answer to the OP I assigned maximum value to the most simple to understand option. It makes no sense to tell a newb to parse the DOM with a library he has never heard of when he has only a basic understanding of what DOM is and has never used a 3rd party library.
          {{ DiscussionBoard.errors[7340785].message }}
  • Profile picture of the author fortsolution
    by getting html data from pages
    {{ DiscussionBoard.errors[7331213].message }}
  • Profile picture of the author seowonder56
    Do you want to grab data or content from the site then you should use CURL library to get it and you can have the content in the file without any problem
    {{ DiscussionBoard.errors[7340299].message }}
  • Profile picture of the author cgimaster
    because it is the most simple method and he is obviously new to programming
    More reason not to recommend REGEX, not only it can introduce a lot of bugs depending on the regex that you will have to rework but REGEX is also not noob-friendly.
    {{ DiscussionBoard.errors[7340932].message }}
  • Profile picture of the author FirstSocialApps
    I guess that depends on the noob, if using REGEX is more simple than understanding the DOM and an unknown library.
    {{ DiscussionBoard.errors[7340996].message }}
  • Profile picture of the author SteveSRS
    @First SocialApps

    I have to agree with cgimaster.. there are online tools to make regex a bit easier but it is just a pain in the ass. And if you compare it with Simple_HTML_DOM class using the documentation you can do simple tasks just in a few simple lines (copy / paste) which is a lot easier then regex which breaks super easy and fast. That was also the argument I made with my 'paradox statement'.. also I put a smiley there, I did that because I was jokingly saying that to make a point on how much using 'regex' for that tasks sucks.
    {{ DiscussionBoard.errors[7341203].message }}
  • Profile picture of the author wizwebtechno
    If you want to fetch data using php then you should using client url library which is called CURL in php, it might get you content of the site
    {{ DiscussionBoard.errors[7353892].message }}
  • Profile picture of the author wayfarer
    I prefer splitting on a delimiter in many cases, but regular expressions can save you a huge amount of work if you understand them. Sometimes they are the right choice, like when you need to know boundaries.
    Signature
    I build web things, server things. I help build the startup Veenome. | Remote Programming Jobs
    {{ DiscussionBoard.errors[7354227].message }}

Trending Topics