16 replies
how to do word splitting

if i give buynow it should give buy now

if i give worldtraveltour then world travel tour even (world travel rave our tour ) such combo

if i give domainsitea it should give domain site a

etc :p

any dictionary tools , class files are available for this task ?

thanks
#splitting #words
  • Profile picture of the author markfail
    hi,

    if your using php try using the explode function: PHP: explode - Manual
    Signature

    *NEW* Wordpress Auction Theme - your own Flippa or eBay website in minutes!
    Wordpress Directory Script - creating a directory website is easy!
    Wordpress Shopping Cart - setup your own online store or Amazon affiliate store!

    {{ DiscussionBoard.errors[1124637].message }}
    • Profile picture of the author Steve Diamond
      No, explode isn't going to help because you don't know where the words are divided. That's the whole point. The only way to do this is with a dictionary lookup, as the OP implied.

      I don't know of any existing classes that do this. It wouldn't be too hard to write one if you had a good dictionary, but the tricky part would be making it quick and efficient. (Obviously, Google is very good at it.)

      Steve
      Signature
      Mindfulness training & coaching online
      Reduce stress | Stay focused | Keep positive and balanced
      {{ DiscussionBoard.errors[1124673].message }}
      • Profile picture of the author markfail
        Steve, you miss understand,

        If you have an array of words already, you can use this array to check if the word exists within a string and then extract it using explode.

        Either than or you can do it manually but i know which one i would prefer...
        Signature

        *NEW* Wordpress Auction Theme - your own Flippa or eBay website in minutes!
        Wordpress Directory Script - creating a directory website is easy!
        Wordpress Shopping Cart - setup your own online store or Amazon affiliate store!

        {{ DiscussionBoard.errors[1124740].message }}
        • Profile picture of the author chandan
          Originally Posted by markfail View Post

          Steve, you miss understand,

          If you have an array of words already, you can use this array to check if the word exists within a string and then extract it using explode.

          Either than or you can do it manually but i know which one i would prefer...
          no not possible to have words in array :p because it will be too lengthy to to put the dictionary words in array
          {{ DiscussionBoard.errors[1124751].message }}
          • Profile picture of the author Steve Diamond
            Originally Posted by chandan View Post

            no not possible to have words in array :p because it will be too lengthy to to put the dictionary words in array
            Exactly. If you're thinking of PHP on a typical shared web server, the dictionary would be much too lengthy.

            If you have a dedicated server with plenty of RAM, you could possibly write a C application taking this approach. Or you could virtualize the array. Or you could pre-load only a subset of the most common words in the dictionary, then do a database lookup as a last resort.

            As I indicated in my first post, the tricky part is to do it quickly and efficiently.

            Steve
            Signature
            Mindfulness training & coaching online
            Reduce stress | Stay focused | Keep positive and balanced
            {{ DiscussionBoard.errors[1124813].message }}
  • Profile picture of the author chandan
    thanks

    actually the input is random can be anything so explode function not fits

    i just given example with buynow , worldtraveltour

    but it can be like ksadas a junk name which should be splitted with sad das words too
    {{ DiscussionBoard.errors[1124660].message }}
  • Profile picture of the author markfail
    how would it know which words to split?

    u can add the words u want to an array and then just check the array, if the word is found then split it.
    Signature

    *NEW* Wordpress Auction Theme - your own Flippa or eBay website in minutes!
    Wordpress Directory Script - creating a directory website is easy!
    Wordpress Shopping Cart - setup your own online store or Amazon affiliate store!

    {{ DiscussionBoard.errors[1124668].message }}
  • Profile picture of the author lisag
    Sometimes us programmers are guilty of trying to provide a solution to a problem we don't fully understand.

    Chandan, you told us WHAT you want to do, but not WHY you want to do it. If we understand why you are trying to do this, maybe a clear solution will pop up.
    Signature

    -- Lisa G

    {{ DiscussionBoard.errors[1129357].message }}
    • Profile picture of the author markfail
      Originally Posted by lisag View Post

      Sometimes us programmers are guilty of trying to provide a solution to a problem we don't fully understand.

      Chandan, you told us WHAT you want to do, but not WHY you want to do it. If we understand why you are trying to do this, maybe a clear solution will pop up.
      ah, very well said.
      Signature

      *NEW* Wordpress Auction Theme - your own Flippa or eBay website in minutes!
      Wordpress Directory Script - creating a directory website is easy!
      Wordpress Shopping Cart - setup your own online store or Amazon affiliate store!

      {{ DiscussionBoard.errors[1129764].message }}
    • Profile picture of the author chandan
      Originally Posted by lisag View Post

      Sometimes us programmers are guilty of trying to provide a solution to a problem we don't fully understand.

      Chandan, you told us WHAT you want to do, but not WHY you want to do it. If we understand why you are trying to do this, maybe a clear solution will pop up.
      it will be used for a name suggestion like when user searching a whois of domain, or simple name search
      {{ DiscussionBoard.errors[1135730].message }}
  • Profile picture of the author lisag
    I would start here:
    Eight word lists to help you creating the perfect word game : Emanuele Feronato

    Grab those keyword lists and build a MySQL table.

    Since you aren't looking for anagrams; that is you don't want to find characters in random order, just linear order, you need to iterate through the string, one character at a time, concatenating the next character as you go.

    So, you take the string and you search for the first character. If a word is found you push it on to an array.

    Here's a matrix for the 11 character string: isthisright

    Character Position
    1
    1,2
    1,2,3
    1,2,3,4
    1,2,3,4,5
    1,2,3,4,5,6
    1,2,3,4,5,6,7
    1,2,3,4,5,6,7,8
    1,2,3,4,5,6,7,8,9
    1,2,3,4,5,6,7,8,9,10,11
    2
    2,3
    2,3,4
    2,3,4,5
    2,3,4,5,6
    2,3,4,5,6,7
    2,3,4,5,6,7,8
    2,3,4,5,6,7,8,9
    2,3,4,5,6,7,8,9,10,11
    3
    3,4
    3,4,5
    3,4,5,6
    3,4,5,6,7
    3,4,5,6,7,8
    3,4,5,6,7,8,9
    3,4,5,6,7,8,9,10,11
    ...
    Continue through all permutations until you have tested all the combinations against your word list.

    I think this is the correct progression order but someone feel free to chime in if I got it wrong.

    let's test: isthisright
    *= found word

    1=I*
    1,2=IS*
    1,2,3 = IST
    1,2,3,4 = ISTH
    1,2,3,4,5 = ISTHI
    1,2,3,4,5,6 = ISTHIS (ISTHIS is NOT a word). You already found Is, the word This will come later in the progression.

    1,2,3,4,5,6,7 = ISTHISR
    1,2,3,4,5,6,7,8 = ISTHISRI
    1,2,3,4,5,6,7,8,9 = ISTHISRIG
    1,2,3,4,5,6,7,8,9,10 = ISTHISRIGH
    1,2,3,4,5,6,7,8,9,10,11 = ISTHISRIGHT
    2 = S
    2,3 = ST
    2,3,4 = STH
    ...
    Continue through the matrix and you'll eventually make all the words.
    Signature

    -- Lisa G

    {{ DiscussionBoard.errors[1135831].message }}
    • Profile picture of the author CMartin
      Whatever solution you use be careful with situations like:
      wordsexpress
      wordsexchange

      Doing it on a character by character case to find dictionary words might give you some unexpected/undesired results

      Even Google makes mistakes when analyzing/splitting such kind of strings into words... and it was (don't know if still is) one of the reasons that many domains were flagged as adult domains.

      Carlos
      {{ DiscussionBoard.errors[1152548].message }}
      • Profile picture of the author lisag
        Originally Posted by CMartin View Post

        Whatever solution you use be careful with situations like:
        wordsexpress
        wordsexchange

        Doing it on a character by character case to find dictionary words might give you some unexpected/undesired results

        Even Google makes mistakes when analyzing/splitting such kind of strings into words... and it was (don't know if still is) one of the reasons that many domains were flagged as adult domains.

        Carlos
        Good catch Carlos. It would be a simple process to build a "kill list" of words you don't want to display.
        Signature

        -- Lisa G

        {{ DiscussionBoard.errors[1152674].message }}
        • Profile picture of the author lisag
          Originally Posted by lisag View Post

          Good catch Carlos. It would be a simple process to build a "kill list" of words you don't want to display.
          Here's a list to get you started.

          ** WARNING **
          This link leads to a dirty word list that you may find offensive. It's intended use is to build a dirty word filter and not to cater to anyone's prurient interests. If dirty words offend you, don't click.

          http://drupal.org/files/issues/dirtywords.txt
          Signature

          -- Lisa G

          {{ DiscussionBoard.errors[1152706].message }}
        • Profile picture of the author CMartin
          Originally Posted by lisag View Post

          Good catch Carlos. It would be a simple process to build a "kill list" of words you don't want to display.
          The point with the examples I provided was not to "kill" words from the string, but instead of splitting them correctly:
          - wordsexpress should be split as: words express
          - wordsexchange should be split as: words exchange

          Hmmm... but then who guarantees me or anyone else if the way they are split are in fact the correct way? Maybe the domain owner really registered "word sex press" or "word sex change"

          In other words... there will be always domain strings that can be split in several ways with very different meanings. Developing an algorithm to deal with these (and many others) type of situations can be very complex if there's a need to be somewhat "perfect" when splitting domain strings into words, not to mention if there's also a need to optimize it for speed.

          Carlos
          {{ DiscussionBoard.errors[1152804].message }}
  • Profile picture of the author HomeComputerGames
    Interesting...

    Here's you some dictionaries Kevin's Word List Page

    Now find a good one and loop through the words counting characters and picking out words from your concatenated string.
    Via PHP use strpos() to grab your first word,mark the position of the next character to start a new loop at and check what is left over... then print or assign these to your array and discard garbage.
    Of course you have to deal with if someone tries xapplesandoranges


    So for every first word loop you have that finds an initial match you have to run inner pattern matching until you run out of characters or dictionary words.
    Then move to your next potential phrase.
    All while checking currently found phrases.

    I think there are close to 3/4 of a million words in the English dialect not counting slang..not sure how many words are in any of those dictionaries

    Of course you would want to include a thesaurus so you can have related phrases sent back also. jokes

    Yeah, that would take some thought how to optimize....
    The system would need to "learn" somehow so it would record common phrases in order to become faster over time utilizing the dictionary less and less.
    Might make for an interesting project.

    We'll develop it on your servers though since it may take hours to run killing everything else while it ran LOL

    good luck
    Signature

    yes, I am....

    {{ DiscussionBoard.errors[1153462].message }}

Trending Topics