Program that pulls information from Wikipedia.

6 replies
I have been looking for quite a long time trying to find a program that would pull information such as birth dates of famous people..etc for a side line project I have planned.

Then formulate results in a form with:
{Name} {Date of Birth}
{Date of Death}
{URL Link}

and report back. I used Wikipedia as an example, I wouldn't think this would be a scraper program ( in the bad sense ) as I am giving links back to the originating site promoting it. All I am wanting is the information to be reported back in a form state.

The reason why I used Wikipedia as an example is they have no real uniformed method for birth dates or date of death. Some use Month - Day - Year others use DOB then date. And wonder if that would be a serious problem?

I'm beginning to wonder if this is even a viable option or not..

Thanks.
#information #program #pulls #wikipedia
  • Profile picture of the author ionisis
    Ehhhhh, that's a nasty one. You would have to regex for all possible patterns, then takes the MATCHES, and see if they can be converted to a standard date format (YYYY-MM-DD), and if so, then it's a date; if not, it was something else, so drop that particular record.
    {{ DiscussionBoard.errors[4064985].message }}
    • Profile picture of the author K Meier
      Wikipedia doesn't have an API, and every Wikipage is written differently. There is no pattern in the code.
      And Wikipedia is also trying to stick with W3C standard, so you won't find any custom tags to identify People, Birthdays nor Age.

      Wikipedia is the wrong source to pull that kind of information.
      {{ DiscussionBoard.errors[4066650].message }}
      • Profile picture of the author IMStudentforlife
        Thank you ionisis and Londrag for your replies.

        I wondered about if Wikipedia would be at all viable, they have tons of information but as you say its not in any specific format. Its all over the place. They're DOB for example there is no uniformed way to post.

        Could it be possible however, to pull the information from such as site as Wikipedia, using several different algorithms to pull the information then arrange it all into a specific database once extracted?

        Is there even a program or programming language that could even perform such a task??
        Signature
        Old School SEO and IM, 1MediaZone

        Running low on inspiration?
        The Strangest Secret in the World
        {{ DiscussionBoard.errors[4069668].message }}
    • Profile picture of the author ionisis
      Originally Posted by ionisis View Post

      You would have to regex for all possible patterns, then takes the MATCHES, and see if they can be converted to a standard date format (YYYY-MM-DD), and if so, then it's a date; if not, it was something else, so drop that particular record.
      Aside from that, you're SOL. If something can't be done in 1 language, it usually cannot be done in ANY language. A language doesn't dictate what you can DO, it dictates how EASY it can be done.
      {{ DiscussionBoard.errors[4096623].message }}
      • Profile picture of the author IMStudentforlife
        Originally Posted by ionisis View Post

        Aside from that, you're SOL. If something can't be done in 1 language, it usually cannot be done in ANY language. A language doesn't dictate what you can DO, it dictates how EASY it can be done.
        I wasn't really considering using different languages to pull the information. What I thought actually was about pulling all the information in various forms then conforming it to for example Month-day-year once I had the info.

        I do realize it's looking like SOL, its just a fun project no money in it. Mostly just to do a better job with organizing all the information.

        Thanks for replying.
        Signature
        Old School SEO and IM, 1MediaZone

        Running low on inspiration?
        The Strangest Secret in the World
        {{ DiscussionBoard.errors[4104693].message }}
  • Profile picture of the author freehugs
    It can be done. However, it is a bit tedious.

    You can always display the string to a user as it appeared on the page and have them enter the date in a specified format (obviously not convenient if you are scraping 100,000 dates, but super easy if you are only doing less than a couple thousand).

    Here is an example of an issue you might run into: 05/04/1979. How do you know if it is April 5th or May 4th? The solution is to be context-aware. Are you pulling from the en.wiki? Then it's most likely mm/dd/yyyy, so you try that first. Obviously if you are pulling from a european zone, then it will be dd/mm/yyyy. And obviously if one of the numbers is bigger than 12 you immediately make that the day.

    As for regular expressions, I would avoid them. Parse strings and attempt to form coherent dates by splitting on common delimiters and coming up with logical criteria for what a date really is (are you ok with only mm/yyyy?). This way, it's easier to understand what you are doing by breaking down the parsing process into bite size chunks instead of trying a one-size-fits-all solution. Using this method, you can catch 99% of the dates out there without a problem.
    {{ DiscussionBoard.errors[4105457].message }}

Trending Topics