How to fetch data from website using php?

by 25 replies
31
Hi,

I want to fetch data from one website and make exactly the same website on my new domain. The data should store on mysql database.

Please guide if anyone knows this?

Thanks.
#programming #data #fetch #php #website
  • I am unsure of what you really want here, if you want to simple fetch the data and post to your site or if you want to simple clone the entire site or if u want to make a domain with some system/CMS behind and fetch the content from another site.

    Could you be a bit more specific and perhaps give us an example.

    To put it simple if you only want to fetch data for example you can use a php + dom to extract the wanted data and repost to your site, you could use a tool like httracker to clone an entire website without any CMS or system behind it, just html, you could use a CMS and copy the data or fed it automaticaly.
  • Hi,
    Thanks for the reply.

    I want to fetch data from other site to my site like we generally use rss feed to get data, by using rss feed we can post data to our site. I am using this in few of my wordpress sites. I am using plugin to fetch data via rss feed and whenever it gets updated on the main site , It will also update the data on my site also. So, its become complete auto pilot site.

    Like above I want to fetch data using php function or script not using rss feed. And whenever data updated on main site the script will update the data on my site too.

    Here an example I found what exactly I want to do on my site:

    Example:

    Main site:
    3gpmobilemovies.com

    Clone site (Full copied site from above site):
    3gparena.in

    Please help me if you can do this.
    Thanks and regards..
  • [DELETED]
  • This is called web scraping... and might or might not be illegal so be careful.

    Secondly it is not the most easiest topic.. you can do it 'stupid' with file_get_contents() in php.
    But better is using CURL (just google) and then when you have site data you're going to have to use the DOM model to analyze and pick out the parts you want... this is hardest part. If you aren't pretty advanced in PHP (or other language) I would recommend hiring somebody.
  • This plugin might work for you but the ideal would make your own php using cURL and DOM to fetch the wanted data and post back.

    WordPress › WP Web Scraper « WordPress Plugins
  • Yes, I agree with Steve, this is far more complicated than a simple command we can help you with. They are many things to consider (in some cases you might be even forced to spoof IP addresses, different user agents, etc, so your script doesn't get blocked).

    So my advice is also: hire somebody who will write such a script for you... exactly custom tailored to your needs.
    • [1] reply
    • Thanks for the help..

      Okey.. Can you suggest me someone who can make this script for me??

      Thanks all of you for your help
      • [1] reply
  • Touche Brandon! Haven't even thought about it his way )
  • Banned
    [DELETED]
  • Do not use REGEX to parse HTML:

    html - RegEx match open tags except XHTML self-contained tags - Stack Overflow

    Instead use a proper HTML Parser available within the programming/scripting language you are using.

    For instance php you have DOM, c# you have HTML Agility Pack and so on.
    • [1] reply
    • if HTML is not well structured the parsing can break. With regex you can focus on a few tags with random attributes and that's it
  • @seodude, that post has no value AT ALL in this topic..

    @FirstSocialApps
    yes if you use Regex for this you can be sure it will break.. probably before even using it for the first time

    @Brandon LOL nice one for catching that one but on the other hand if he can create a system to prevent them from scraping (which isn't that hard to do btw, but might be a little bad for SEO also) chances are he knows how to make them...

    I've done many of those however don't have the time to help you.. just make sure it is getting done with a good DOM structure analyzer which is stable and doesn't take too much memory of your servers (it is a pretty CPU intensive job!). Simple HTML DOM (google it) is a very good one to start building a scraper with.
    • [1] reply
    • Steve. with respect that comment makes no sense at all, if it breaks before the first time, then the script was never done, so hence it cant break since it was never done to break. See the paradox in your statement

      REGEX is fine depending on what you want to pull out of the content. For example pulling out the title would work fine and is not likely to break.

      Since the OP did not specify what he wants from the content, and is obviously new to this type of work I provided him with the simplest solution.
      • [1] reply
  • by getting html data from pages
  • Do you want to grab data or content from the site then you should use CURL library to get it and you can have the content in the file without any problem
  • More reason not to recommend REGEX, not only it can introduce a lot of bugs depending on the regex that you will have to rework but REGEX is also not noob-friendly.
  • I guess that depends on the noob, if using REGEX is more simple than understanding the DOM and an unknown library.
  • @First SocialApps

    I have to agree with cgimaster.. there are online tools to make regex a bit easier but it is just a pain in the ass. And if you compare it with Simple_HTML_DOM class using the documentation you can do simple tasks just in a few simple lines (copy / paste) which is a lot easier then regex which breaks super easy and fast. That was also the argument I made with my 'paradox statement'.. also I put a smiley there, I did that because I was jokingly saying that to make a point on how much using 'regex' for that tasks sucks.
    • [ 1 ] Thanks
    • [1] reply
    • Ok I concede Simple_HTML_DOM is more easy. And I was being a smart #$$ when I said about the 'paradox statement' thats why I put the after it
      • [ 3 ] Thanks
  • If you want to fetch data using php then you should using client url library which is called CURL in php, it might get you content of the site
  • I prefer splitting on a delimiter in many cases, but regular expressions can save you a huge amount of work if you understand them. Sometimes they are the right choice, like when you need to know boundaries.

Next Topics on Trending Feed