How to calculate uniqueness algorithmically?

10 replies
Does anyone know how to calculate the uniqueness of an article in spinning syntax? Or at least how to determine how 'different' two articles are?

Some sample code would be great (language doesn't matter).

My goal is to create x unique articles from an article in spinning syntax that are as 'different' as possible.
#algorithmically #calculate #uniqueness
  • Profile picture of the author SoftwareProjects
    Michael,

    You could create a hash of all words in each article, then compare two articles and score based on the percentage of words that don't appear in both articles.
    {{ DiscussionBoard.errors[2546030].message }}
    • Profile picture of the author dinushiya
      There is an online web page that does this for you. CopyScape is the name of the website.
      plagiarismchecker is also good
      {{ DiscussionBoard.errors[2548083].message }}
      • Profile picture of the author Michael R.
        @SoftwareProjects:
        Looking at the occurence of words alone is not enough to calculate uniqueness. This way a sentence wouldn't be different to the same sentence with words in reverse order.

        @dinushiya:
        I'm not looking for a service to calculate uniqueness, but for a way to implement the calculation, thats why I posted in 'Programming Talk'.
        {{ DiscussionBoard.errors[2548382].message }}
  • Profile picture of the author mojojuju
    Would something like PHP's similar_text function work for you?
    Signature

    :)

    {{ DiscussionBoard.errors[2548673].message }}
  • Profile picture of the author SoftwareProjects
    Michael,

    I was suggesting comparing a hash of words, so that the order of words as well as the number of times each word appears, will not be relevant.

    The score will be based on the percentage of unique words that appear in one article but not the other.
    {{ DiscussionBoard.errors[2548977].message }}
    • Profile picture of the author Michael R.
      Originally Posted by SoftwareProjects View Post

      Michael,

      I was suggesting comparing a hash of words, so that the order of words as well as the number of times each word appears, will not be relevant.

      The score will be based on the percentage of unique words that appear in one article but not the other.
      Consider the following text in spinning syntax:
      Code:
      {This is a sentence.|A sentence is this.}
      According to your suggestion the uniqueness of this text would be 0%, wouldn't it?

      However, TheBestSpinner calculates 80%, whatever that means...
      {{ DiscussionBoard.errors[2549314].message }}
  • Profile picture of the author SoftwareProjects
    Hi Michael,

    Never said anything about a spinning syntax.

    Just wrote up this sample code for you:

    <?php
    $article1 = "This is a sentence.";
    $article2 = "A sentence is this";

    $hash1 = BuildWordHash($article1);
    $hash2 = BuildWordHash($article2);

    echo "The two articles are ".CompareWordHash($hash1, $hash2)."% alike\r\n";

    function BuildWordHash($body)
    {
    $words = array();

    // Get rid of garbage
    $body = str_replace(array(",",".",":",";"),array("","","", ""),$body);

    // Build hash
    $arr_words = explode(" ", $body);
    foreach ($arr_words as $word)
    $words[] = strtolower($word);

    return $words;
    }

    function CompareWordHash($arr_words1, $arr_words2)
    {
    // Initialize
    $unique_words = 0;

    // Set these for easier access
    $total_words = count(array_merge($arr_words1,$arr_words2));

    // Start from arr_words1
    foreach ($arr_words1 as $word)
    if (!in_array(strtolower($word),$arr_words2))
    {
    unset($arr_words1[$word]);
    $unique_words++;
    }


    // Now move on to arr_words2
    foreach ($arr_words2 as $word)
    if (!in_array(strtolower($word),$arr_words1))
    {
    unset($arr_words2[$word]);
    $unique_words++;
    }

    // Set these for easier access
    $percentage = number_format((1-($unique_words / $total_words))*100,0);

    return $percentage;
    }
    ?>

    Enjoy!
    {{ DiscussionBoard.errors[2549946].message }}
    • Profile picture of the author Michael R.
      Originally Posted by SoftwareProjects View Post

      Hi Michael,

      Never said anything about a spinning syntax.

      Just wrote up this sample code for you:

      ....

      Enjoy!
      That's exactly what I said. The code calculates 0% uniqueness (or: The two articles are 100% alike), while TheBestSpinner says that the uniqueness is 80%.

      I wonder what these 80% mean...

      Edit: I just recognized that the calculation in TheBestSpinner doesn't make sense:

      Guess what the uniequeness of the following 'article' is?
      Code:
      {test test test test test test|test test test test test test}
      You were right! 86%!
      {{ DiscussionBoard.errors[2550317].message }}
  • Profile picture of the author SoftwareProjects
    Hi Michael,

    You're welcome :-)

    Not sure how good TheBestSpinner is, but the code I pasted works
    {{ DiscussionBoard.errors[2551072].message }}
  • Profile picture of the author sweetseo
    I have spend a lot of time in spinning articles but i found it is just wastage of time you can write article on notepad2 it will help you a lot in writing articles.
    {{ DiscussionBoard.errors[2553425].message }}

Trending Topics