Program that pulls information from Wikipedia.

by 6 replies
8
I have been looking for quite a long time trying to find a program that would pull information such as birth dates of famous people..etc for a side line project I have planned.

Then formulate results in a form with:
{Name} {Date of Birth}
{Date of Death}
{URL Link}

and report back. I used Wikipedia as an example, I wouldn't think this would be a scraper program ( in the bad sense ) as I am giving links back to the originating site promoting it. All I am wanting is the information to be reported back in a form state.

The reason why I used Wikipedia as an example is they have no real uniformed method for birth dates or date of death. Some use Month - Day - Year others use DOB then date. And wonder if that would be a serious problem?

I'm beginning to wonder if this is even a viable option or not..

Thanks.
#programming #information #program #pulls #wikipedia
  • Ehhhhh, that's a nasty one. You would have to regex for all possible patterns, then takes the MATCHES, and see if they can be converted to a standard date format (YYYY-MM-DD), and if so, then it's a date; if not, it was something else, so drop that particular record.
    • [ 1 ] Thanks
    • [2] replies
    • Wikipedia doesn't have an API, and every Wikipage is written differently. There is no pattern in the code.
      And Wikipedia is also trying to stick with W3C standard, so you won't find any custom tags to identify People, Birthdays nor Age.

      Wikipedia is the wrong source to pull that kind of information.
      • [ 1 ] Thanks
      • [1] reply
    • Aside from that, you're SOL. If something can't be done in 1 language, it usually cannot be done in ANY language. A language doesn't dictate what you can DO, it dictates how EASY it can be done.
      • [ 1 ] Thanks
      • [1] reply
  • Banned
    [DELETED]
  • It can be done. However, it is a bit tedious.

    You can always display the string to a user as it appeared on the page and have them enter the date in a specified format (obviously not convenient if you are scraping 100,000 dates, but super easy if you are only doing less than a couple thousand).

    Here is an example of an issue you might run into: 05/04/1979. How do you know if it is April 5th or May 4th? The solution is to be context-aware. Are you pulling from the en.wiki? Then it's most likely mm/dd/yyyy, so you try that first. Obviously if you are pulling from a european zone, then it will be dd/mm/yyyy. And obviously if one of the numbers is bigger than 12 you immediately make that the day.

    As for regular expressions, I would avoid them. Parse strings and attempt to form coherent dates by splitting on common delimiters and coming up with logical criteria for what a date really is (are you ok with only mm/yyyy?). This way, it's easier to understand what you are doing by breaking down the parsing process into bite size chunks instead of trying a one-size-fits-all solution. Using this method, you can catch 99% of the dates out there without a problem.
    • [ 1 ] Thanks

Next Topics on Trending Feed