Not the usual duplicate content question...

7 replies
How do companies like copyscape determine the % difference between two pieces of writing? Is there a recognized algorithm?

For example, take these two sentences:

The car ran out of gas.

The anesthetist ran out of gas.

Is there a 20% difference because one word out of five has changed, or is there a 100% difference because the entire sentence has a changed meaning?

Martin
#content #duplicate #question #usual
  • Profile picture of the author Andyhenry
    Not sure, but Google use something similar in their 'knols' - all knols have a 'similar content found' system which assesses whether they're original or not, I guess to be factored-in somewhere along the line.

    Andy
    Signature

    nothing to see here.

    {{ DiscussionBoard.errors[465479].message }}
  • Profile picture of the author Miguel Oliveira
    I'm not sure about this, but I think the computer has no way of knowing it has changed meaning, so I would guess it is 20% different. I would also suggest 25% different, as it is likely that the algorithm ignores common words, like "the" and "of".
    Signature
    {{ DiscussionBoard.errors[465481].message }}
    • Profile picture of the author Martin Avis
      Of course it would help if I could count - at least up to 6!

      In my example, the literal word difference is 16.6% (one in six) - not 20%.

      Martin
      Signature
      Martin Avis publishes Kickstart Newsletter - Subscribe free at http://kickstartnewsletter.com
      {{ DiscussionBoard.errors[465491].message }}
  • Profile picture of the author freddie_fireman
    This article dated Sep 2007 may be of interest ...

    How Does Copyscape find Plagiarism? - Web Promotion and Algo Cracking Blog

    Summary ...

    We used Phrase Mixer, one of the synonymizer software features, to see if copyscape is cheated by phrase scrambling. It is not.

    Synonym replacement is a good alternative because it does not destroy the meaning of the phrase. However, any word will do. Copyscape does not discriminate between a meaningful replacement and a senseless one. Anything that prevents a 4-6 word text string from being an exact duplication of another prevents infringement. You can use any word or letter, but punctuation marks will not do the trick.
    Copyscape starts looking for plagiarism in texts longer than 14 words.

    If the duplicated text is more than 70% of the total document, you will need 1 every 6 words replaced. It your duplicated (borrowed, stolen, pirated, plagiarized) text is over 70% of the total web page, you will need 1 every 4 words replaced.
    Signature
    Water shapes its course according to the nature of the ground over which it flows. – Sun Tzu, 600 B.C.

    freddie fireman
    {{ DiscussionBoard.errors[465509].message }}
  • Profile picture of the author grumpyb
    The car ran out of gas. or The anesthetist ran out of gas.

    I find it difficult to beleive that in the millions of publications on line and elsewhere that these simple phrases would not have been used countless times.
    So even if you write a brand new piece then there surely must be a reasonable chance statisticaly that it would be very difficult to come up with something completely original and unique
    {{ DiscussionBoard.errors[465533].message }}
  • Profile picture of the author Jon Alexander
    they use shingling, as far as I'm aware. 3 word phrases seem to be their preferred target. They concentrate on text, not meaning, afaik. You want to 'pass' copyscape? miss-spell every 3rd word. Although why you'd want to is another question!
    Signature
    http://www.contentboss.com - automated article rewriting software gives you unique content at a few CENTS per article!. New - Put text into jetspinner format automatically! http://www.autojetspinner.com

    PS my PM system is broken. Sorry I can't help anymore.
    {{ DiscussionBoard.errors[465760].message }}
    • Profile picture of the author Martin Avis
      I'm not about trying to get past Copyscape, I'm just curious as to how their percentage difference is calculated.

      If, as Jon suggests, you need to change every 3rd word, would that result in a document that is 100% different, or 33% different?

      In other words, if Copyscape report a 65% difference, for example, what does that really mean?

      Martin
      Signature
      Martin Avis publishes Kickstart Newsletter - Subscribe free at http://kickstartnewsletter.com
      {{ DiscussionBoard.errors[465825].message }}

Trending Topics