words splitting

by 16 replies
20
how to do word splitting

if i give buynow it should give buy now

if i give worldtraveltour then world travel tour even (world travel rave our tour ) such combo

if i give domainsitea it should give domain site a

etc :p

any dictionary tools , class files are available for this task ?

thanks
#programming #splitting #words
  • hi,

    if your using php try using the explode function: PHP: explode - Manual
    • [ 1 ] Thanks
    • [1] reply
    • No, explode isn't going to help because you don't know where the words are divided. That's the whole point. The only way to do this is with a dictionary lookup, as the OP implied.

      I don't know of any existing classes that do this. It wouldn't be too hard to write one if you had a good dictionary, but the tricky part would be making it quick and efficient. (Obviously, Google is very good at it.)

      Steve
      • [ 1 ] Thanks
      • [1] reply
  • thanks

    actually the input is random can be anything so explode function not fits

    i just given example with buynow , worldtraveltour

    but it can be like ksadas a junk name which should be splitted with sad das words too
  • how would it know which words to split?

    u can add the words u want to an array and then just check the array, if the word is found then split it.
  • Sometimes us programmers are guilty of trying to provide a solution to a problem we don't fully understand.

    Chandan, you told us WHAT you want to do, but not WHY you want to do it. If we understand why you are trying to do this, maybe a clear solution will pop up.
    • [ 1 ] Thanks
    • [2] replies
    • ah, very well said.
    • it will be used for a name suggestion like when user searching a whois of domain, or simple name search
  • I would start here:
    Eight word lists to help you creating the perfect word game : Emanuele Feronato

    Grab those keyword lists and build a MySQL table.

    Since you aren't looking for anagrams; that is you don't want to find characters in random order, just linear order, you need to iterate through the string, one character at a time, concatenating the next character as you go.

    So, you take the string and you search for the first character. If a word is found you push it on to an array.

    Here's a matrix for the 11 character string: isthisright

    Character Position
    1
    1,2
    1,2,3
    1,2,3,4
    1,2,3,4,5
    1,2,3,4,5,6
    1,2,3,4,5,6,7
    1,2,3,4,5,6,7,8
    1,2,3,4,5,6,7,8,9
    1,2,3,4,5,6,7,8,9,10,11
    2
    2,3
    2,3,4
    2,3,4,5
    2,3,4,5,6
    2,3,4,5,6,7
    2,3,4,5,6,7,8
    2,3,4,5,6,7,8,9
    2,3,4,5,6,7,8,9,10,11
    3
    3,4
    3,4,5
    3,4,5,6
    3,4,5,6,7
    3,4,5,6,7,8
    3,4,5,6,7,8,9
    3,4,5,6,7,8,9,10,11
    ...
    Continue through all permutations until you have tested all the combinations against your word list.

    I think this is the correct progression order but someone feel free to chime in if I got it wrong.

    let's test: isthisright
    *= found word

    1=I*
    1,2=IS*
    1,2,3 = IST
    1,2,3,4 = ISTH
    1,2,3,4,5 = ISTHI
    1,2,3,4,5,6 = ISTHIS (ISTHIS is NOT a word). You already found Is, the word This will come later in the progression.

    1,2,3,4,5,6,7 = ISTHISR
    1,2,3,4,5,6,7,8 = ISTHISRI
    1,2,3,4,5,6,7,8,9 = ISTHISRIG
    1,2,3,4,5,6,7,8,9,10 = ISTHISRIGH
    1,2,3,4,5,6,7,8,9,10,11 = ISTHISRIGHT
    2 = S
    2,3 = ST
    2,3,4 = STH
    ...
    Continue through the matrix and you'll eventually make all the words.
    • [ 1 ] Thanks
    • [1] reply
    • Whatever solution you use be careful with situations like:
      wordsexpress
      wordsexchange

      Doing it on a character by character case to find dictionary words might give you some unexpected/undesired results

      Even Google makes mistakes when analyzing/splitting such kind of strings into words... and it was (don't know if still is) one of the reasons that many domains were flagged as adult domains.

      Carlos
      • [ 1 ] Thanks
      • [1] reply
  • Interesting...

    Here's you some dictionaries Kevin's Word List Page

    Now find a good one and loop through the words counting characters and picking out words from your concatenated string.
    Via PHP use strpos() to grab your first word,mark the position of the next character to start a new loop at and check what is left over... then print or assign these to your array and discard garbage.
    Of course you have to deal with if someone tries xapplesandoranges


    So for every first word loop you have that finds an initial match you have to run inner pattern matching until you run out of characters or dictionary words.
    Then move to your next potential phrase.
    All while checking currently found phrases.

    I think there are close to 3/4 of a million words in the English dialect not counting slang..not sure how many words are in any of those dictionaries

    Of course you would want to include a thesaurus so you can have related phrases sent back also. jokes

    Yeah, that would take some thought how to optimize....
    The system would need to "learn" somehow so it would record common phrases in order to become faster over time utilizing the dictionary less and less.
    Might make for an interesting project.

    We'll develop it on your servers though since it may take hours to run killing everything else while it ran LOL

    good luck
    • [ 1 ] Thanks
  • Banned
    [DELETED]

Next Topics on Trending Feed