Help Deleting Duplicates In CSV File

31 replies
I have HUGE CSV files (1.5GB) that I need to delete duplicate records out of.

I normally just use excel 2010 and I can do it with the push of a button. But these files exceed the 1,000,000 row limit.

I've searched the internet, found a couple PHP scripts that are supposed to do it, but I'm not exactly the sharpest tool in the shed when it comes to programming.

Before I go through the hassle of hiring someone on Odesk and paying them for what should 2 minutes of work I thought I'd ask here.

I've attached an example of the CSV file. (obviously a small one) in hopes that someone can help. I'd really appreciate it.

For the purpose of these files anytime a keyword is used twice the record is considered a duplicate, even if there is other data that's different in the row.

Hope that makes sense. The file is small, but I had to zip it because I guess you can't attach CSV files.

Thanks,
Paul
#csv #deleting #duplicates #file
  • Profile picture of the author tryinhere
    Sub deletemyduplicates()
    'tryinhere
    Range("A2:Q2").Select
    Range(Selection, Selection.End(xlDown)).Select
    Selection.Sort Key1:=Range("A2"), Order1:=xlAscending, Header:=xlNo, _
    OrderCustom:=1, MatchCase:=False, Orientation:=xlTopToBottom, _
    DataOption1:=xlSortNormal
    Range("A3").Select
    Do Until ActiveCell.Value = ""
    If ActiveCell.Value = ActiveCell.Offset(-1, 0).Value Then
    ActiveCell.EntireRow.Delete
    ActiveCell.Offset(-1, 0).Select
    End If
    ActiveCell.Offset(1, 0).Select
    Loop
    MsgBox "your good to go"
    End Sub
    open up your csv file and insert this macro then run it / make a copy first justincase
    to add the macro open the file go to tools > macro > visual basic editor
    then select > insert > module
    in that page paste the code above and save / close the editor

    return to your page select tools > macro > macros > dltemyduplicates & then run
    Signature
    | > Choosing to go off the grid for a while to focus on family, work and life in general. Have a great 2020 < |
    {{ DiscussionBoard.errors[5373970].message }}
    • Profile picture of the author packerfan
      Originally Posted by tryinhere View Post

      open up your csv file and insert this macro then run it / make a copy first justincase
      to add the macro open the file go to tools > macro > visual basic editor
      then select > insert > module
      in that page paste the code above and save / close the editor

      return to your page select tools > macro > macros > dltemyduplicates & then run
      This has to be done in excel, right? The problem is the entire file doesn't open because it exceeds more than a million rows.

      Is there another program that I can use to open it and still run it?
      Signature

      Nothing to see here

      {{ DiscussionBoard.errors[5375887].message }}
      • Profile picture of the author Boudewijn
        Originally Posted by packerfan View Post

        This has to be done in excel, right? The problem is the entire file doesn't open because it exceeds more than a million rows.

        Is there another program that I can use to open it and still run it?
        1Mio doesn't sound like a lot. I could help you out here but then you should have to send the stuff to me (zipped).
        I can then remove the duplicates for you and return the CSV.
        b [dot] lutgerink [at] betuwe [dot] net
        Just let me know if you are interested.

        should not take much time, probably abt 10 minutes.

        BL
        {{ DiscussionBoard.errors[5503533].message }}
        • Profile picture of the author packerfan
          Scroll up I posted a step by step guide on blog of how to do this with more than 1 million records.
          Signature

          Nothing to see here

          {{ DiscussionBoard.errors[5503935].message }}
  • Profile picture of the author telofint
    Thank you, I also go to try
    {{ DiscussionBoard.errors[5375661].message }}
    • Profile picture of the author mojojuju
      I used the sed command on it. Learned how to do it from this page (ctrl+f "delete duplicate, nonconsecutive lines from a file").

      edit: scratch that. This is an even better way... (source)
      awk '!x[$0]++' oldfile > newfile
      awk and sed commands are typically found on Linux or UNIX systems, but if you're using Windows, you can install cygwin.
      Signature

      :)

      {{ DiscussionBoard.errors[5375911].message }}
      • Profile picture of the author packerfan
        Originally Posted by mojojuju View Post

        I used the sed command on it. Learned how to do it from this page (ctrl+f "delete duplicate, nonconsecutive lines from a file").

        Is there a way to do this on a windows machine, or will I need to use my mac?

        I don't have linux installed anywhere. Would prefer a solution that could done from my PC or MAC, but it's not the end of the world if I need to use only my Mac.

        So just to make sure I understand...

        Below is the command I run. I have no idea what any of that stuff means! Should I replace any of that with the directroy/filename? Sorry for the dumb questions. I have no problem learning, I just have little to no experience with command line scripting or anything.

        And assuming this works on my mac, do I just open up a terminal session and do it there?

        sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P'
        Signature

        Nothing to see here

        {{ DiscussionBoard.errors[5375942].message }}
        • Profile picture of the author mojojuju
          Originally Posted by packerfan View Post


          And assuming this works on my mac, do I just open up a terminal session and do it there?

          sed -n 'G; s/n/&&/; /^([ -~]*n).*n1/d; s/n//; h; P'
          No, see my edit. Don't use that command. Just open up a terminal and do this:
          awk '!x[$0]++' oldfile > newfile

          where 'oldfile' is the name of the file that you want to check for duplicate lines. and 'newfile' is where you want to store the new version (the one with duplicates removed).

          I assume Macs have tools like awk. If it doesn't you can install cygwin and get everything you need to do this.

          Another option is to install linux, or run a linux live cd/usb, or install linux in virtualbox.
          Signature

          :)

          {{ DiscussionBoard.errors[5375958].message }}
          • Profile picture of the author packerfan
            Originally Posted by mojojuju View Post

            No, see my edit. Don't use that command. Just open up a terminal and do this:
            awk '!x[$0]++' oldfile > newfile

            where 'oldfile' is the name of the file that you want to check for duplicate lines. and 'newfile' is where you want to store the new version (the one with duplicates removed).

            I assume Macs have tools like awk. If it doesn't you can install cygwin and get everything you need to do this.

            Another option is to install linux, or run a linux live cd/usb, or install linux in virtualbox.
            Okay I'm installing the cygwin now. I don't want to have to install linux just to do this. Thanks for all your help.

            One last question... Sorry for so many. Assume old file is called test.csv and new will be test1.csv

            awk '!x[$0]++' test.csv > test1.csv

            Or

            awk '!x[$0]++' c:\directory\test.csv > c:\directory\test1.csv

            Which would be correct?
            Signature

            Nothing to see here

            {{ DiscussionBoard.errors[5375985].message }}
            • Profile picture of the author packerfan
              Okay nevermind about where the file needs to be. That seems to be obvious after opening the program. However, I get this error when I try to run it. The picture clearly shows the file on my desktop (the screen shot is the terminal and file both on my desktop.

              What am I doing wrong!
              Signature

              Nothing to see here

              {{ DiscussionBoard.errors[5376043].message }}
            • Profile picture of the author mojojuju
              Originally Posted by packerfan View Post

              One last question... Sorry for so many. Assume old file is called test.csv and new will be test1.csv

              awk '!x[$0]++' test.csv > test1.csv

              Or

              awk '!x[$0]++' c:directorytest.csv > c:directorytest1.csv

              Which would be correct?
              The first would be correct if you were in the same folder/directory as the test.csv file. The second won't work in cygwin.

              The easist way to do it would be to copy or move the test.csv file to the root of your C drive so it's at c:\test.csv.

              Then open up the cygwin terminal and navigate to the root of your c: drive by typing:
              cd /cygdrive/c

              Then you'll be at the root of the c: drive and you can just run the awk command on your file like this (the same command you mentioned above):
              awk '!x[$0]++' test.csv > test1.csv

              Once you've done that, you'll have the file called test1.csv located at c:test1.csv and that is the file with all duplicates removed.

              Originally Posted by packerfan View Post

              Okay nevermind about where the file needs to be. That seems to be obvious after opening the program. However, I get this error when I try to run it. The picture clearly shows the file on my desktop (the screen shot is the terminal and file both on my desktop.

              What am I doing wrong!
              Do what I wrote above.

              But here's an explanation first....

              In the picture above, your file is clearly on the desktop, but your cygwin shell is not.

              Go ahead and type the following command into the cygwin shell:
              pwd
              It's probably going to say you're at '/home/Paul/' because that would be the default location that you go when you open the shell. (The reason you see it says Paul@Paul-desktop is because your username is 'Paul' and your computer name is 'Paul-desktop'. In other words, that's not your location.)

              To make a long story short, and if you just want to get the job done, do exactly what I wrote before I quoted you.

              Or you can just download the file attachment I attached to a previous post (It's your initial file with all duplicates removed). edit: I see that it's not the file you want to work with, but a smaller one for example.
              Signature

              :)

              {{ DiscussionBoard.errors[5376085].message }}
  • Profile picture of the author mojojuju
    I forgot about Windows Powershell.

    You can do it like this...(source)
    gc $filename | sort | get-unique > $newfileName

    It works good, but I don't know how it would work on a file as big as 1.5 GB. You'd just have to try it and see.

    Here's another way to do it if you go the cygwin route:

    sort test.csv | uniq > result.csv
    Signature

    :)

    {{ DiscussionBoard.errors[5376226].message }}
    • Profile picture of the author packerfan
      Originally Posted by mojojuju View Post

      I forgot about Windows Powershell.

      You can do it like this...(source)
      gc | sort | get-unique >

      It works good, but I don't know how it would work on a file as big as 1.5 GB. You'd just have to try it and see.

      Here's another way to do it if you go the cygwin route:

      sort test.csv | uniq > result.csv
      Okay, getting closer and I really appreciate all the help. Things are starting make sense (honestly I don't how you programmer guys do this stuff!)

      I moved the file to C:, changed directory per your instructions, and got the following error.

      -bash: file1.csv: Permission denied

      So it would seem it found my file, but doesn't think I have permission to write to the C: drive is crazy. I'm the administrator on the machine and just put the darn test file there.

      If we can get this figured out I'll be forever greatful.

      I was "playing" around with windows powershell earlier trying to figure it out. I'll give that a shot, too.
      Signature

      Nothing to see here

      {{ DiscussionBoard.errors[5376313].message }}
      • Profile picture of the author packerfan
        Okay, the powershell thing works 99% how I want. If there's anyway to make one small tweak. If not, I'll survive.

        I need it to act as if anytime a keyword is duplicated the entire record is a duplicate, regardless if the rest of the data is different. For example
        "cats" 22500 searches - and
        "cats" 15000 searches I want considered the same thing, so I would keep one and remove the other.

        If that's possible with powershell, then I'm all set. If it's not, I'm all set anyway because it's not the end of the world.

        I can't thank you enough. If the powershell works on my huge files you've made my life like a million times easier!

        Paul
        Signature

        Nothing to see here

        {{ DiscussionBoard.errors[5376349].message }}
        • Profile picture of the author mojojuju
          Originally Posted by packerfan View Post

          Okay, the powershell thing works 99% how I want. If there's anyway to make one small tweak. If not, I'll survive.

          I need it to act as if anytime a keyword is duplicated the entire record is a duplicate, regardless if the rest of the data is different. For example
          "cats" 22500 searches - and
          "cats" 15000 searches I want considered the same thing, so I would keep one and remove the other.
          I don't know how you'd do that with powershell. What I'd do, and what I've done many times with large amounts of data from the Adwords keyword tool, is that I'd just import it all into a MySQL database.

          Once it's imported into MySQL, it's easy to do all kinds of queries on the data. For instance, if you wanted some easy pickings for keyword queries, you could select all keyword phrases with at least three words and which the ratio of competition divided by keyword volume is really low (as low as you want).

          If the data was in a database you could easily remove duplicates and also explore the data in many ways that even the best keyword tools won't allow. But you gotta learn MySQL first.


          Originally Posted by packerfan View Post

          If that's possible with powershell, then I'm all set. If it's not, I'm all set anyway because it's not the end of the world.
          I'm sure it is in some way or another. Maybe somebody else could chime in with an example of how to do it.

          Originally Posted by packerfan View Post

          I can't thank you enough. If the powershell works on my huge files you've made my life like a million times easier!
          You're welcome. I hope you find the other solution you're looking for.
          Signature

          :)

          {{ DiscussionBoard.errors[5376475].message }}
          • Profile picture of the author packerfan
            Originally Posted by mojojuju View Post

            I don't know how you'd do that with powershell. What I'd do, and what I've done many times with large amounts of data from the Adwords keyword tool, is that I'd just import it all into a MySQL database.

            Once it's imported into MySQL, it's easy to do all kinds of queries on the data. For instance, if you wanted some easy pickings for keyword queries, you could select all keyword phrases with at least three words and which the ratio of competition divided by keyword volume is really low (as low as you want).

            If the data was in a database you could easily remove duplicates and also explore the data in many ways that even the best keyword tools won't allow. But you gotta learn MySQL first.




            I'm sure it is in some way or another. Maybe somebody else could chime in with an example of how to do it.



            You're welcome. I hope you find the other solution you're looking for.
            You're exactly right with what I want to do. I am building a big database. Will eventually put in mysql (once I hack my way around trying to figure that out) and then put some sort of front end on it with php if possible so I can query the heck out of it. It will all be on my local machine so I'm guessing processing won't be an issue.

            Anyway, if you could possibly point me to a guide somewhere that would be awesome! I hate to keep taking up your time in this thread.
            Signature

            Nothing to see here

            {{ DiscussionBoard.errors[5376751].message }}
            • Profile picture of the author Jeff Chandler
              When you create your database in MySql, if you set the field as the primary key, it will not allow duplicate records. Then when you do the import, the duplicate records should fail automatically because of a primary key violation and you'll have your de-duped database. If you have MS Access, you could do the same thing.
              {{ DiscussionBoard.errors[5377966].message }}
              • Profile picture of the author packerfan
                Originally Posted by Jeff Chandler View Post

                When you create your database in MySql, if you set the field as the primary key, it will not allow duplicate records. Then when you do the import, the duplicate records should fail automatically because of a primary key violation and you'll have your de-duped database. If you have MS Access, you could do the same thing.
                I set up WAMP on my machine last night, created my database, and a table.

                Went to import the file and got an error that it was too big (it's about 1.8gb)

                Is there a way to remove this? I know there is, I googled it but I couldn't find something that made sense to me. I'm kinda dumb when it comes to this stuff.
                Signature

                Nothing to see here

                {{ DiscussionBoard.errors[5378191].message }}
                • Profile picture of the author Jeff Chandler
                  Originally Posted by packerfan View Post

                  I set up WAMP on my machine last night, created my database, and a table.

                  Went to import the file and got an error that it was too big (it's about 1.8gb)

                  Is there a way to remove this? I know there is, I googled it but I couldn't find something that made sense to me. I'm kinda dumb when it comes to this stuff.
                  I'm not sure the file size limit, but the easiest way would be to just split it up into smaller files.
                  {{ DiscussionBoard.errors[5378281].message }}
  • Profile picture of the author mojojuju
    You should close your cygwin shell. Then go to:

    Start Menu > All Programs > cygwin

    Then, right click on "Cygwin Terminal" and in the menu that comes up, click on "Run as administrator". Once you do that and you're running as administrator you should have no problems doing whatever you want.

    Since you've already got your text file located at C:\test.csv and you're now running cygwin as administrator, you can pick up on the instructions I mentioned earlier at the point shown below:

    navigate to the root of your c: drive by typing:
    cd /cygdrive/c

    Then you'll be at the root of the c: drive and you can just run the awk command on your file like this (the same command you mentioned above):
    awk '!x[$0]++' test.csv > test1.csv

    Once you've done that, you'll have the file called test1.csv located at c:test1.csv and that is the file with all duplicates removed.

    EDIT: The following command might work better and not choke on the large file or take so long because sorting the file first makes it easier (for your computer) to remove the duplicates.

    sort test.csv | uniq > result.csv
    Signature

    :)

    {{ DiscussionBoard.errors[5376347].message }}
  • Profile picture of the author pandadoodle
    I was going to say another quick work around would be set up a local server and import it directly into a mysql database you could format the data and delete what you need and even run batch strings.
    Signature
    Web Design Cardiff - Panda Doodle
    {{ DiscussionBoard.errors[5376505].message }}
    • Profile picture of the author packerfan
      Thanks for all your help!

      I've just written a step-by-step for the non-techies like me of how to do this. It's posted here. Remove Duplicates in CSV Using PowerShell | Import data into a sreadsheet.

      Full credit given to this thread, the source mojojuju cited included.

      You guys are awesome. Thanks!
      Signature

      Nothing to see here

      {{ DiscussionBoard.errors[5376681].message }}
    • Profile picture of the author packerfan
      Originally Posted by pandadoodle View Post

      I was going to say another quick work around would be set up a local server and import it directly into a mysql database you could format the data and delete what you need and even run batch strings.
      Is there an "idiots guide to mysql and php" that would show me how to do this? This is actually what I want to do long term is build a huge database so I can query how I want.

      I know how to set up WAMP and get a database set up, but other than that I'm clueless as how to import files, delete duplicates, etc during the process.
      Signature

      Nothing to see here

      {{ DiscussionBoard.errors[5376738].message }}
      • Profile picture of the author Joe Ray
        Originally Posted by packerfan View Post

        Is there an "idiots guide to mysql and php" that would show me how to do this? This is actually what I want to do long term is build a huge database so I can query how I want.

        I know how to set up WAMP and get a database set up, but other than that I'm clueless as how to import files, delete duplicates, etc during the process.
        If you really want to learn to manage your data programmatically, I recommend Python instead of php. You need to learn some basic Python, then you can start learning Pandas, which is a Python library. Pandas is a high-performance data structures and data analytics tool.

        It's definitely a learning curve if you don't have a programming background, but if you want to work with big data, it's a good idea to learn this stuff. But, it will definitely take a while to learn the basics.

        If you don't want to take the time to learn, you can pay a data engineer to set up a structure to automate everything for you. Then, you don't have to do anything. You will have a custom-made software that handles all your data structures and data analytics, exactly the way you want it.
        {{ DiscussionBoard.errors[10724434].message }}
  • Profile picture of the author dd1153
    Open the CSV and sort the row you want to remove duplicates from smallest to largest. Insert a column next to that one and put the forumla in, assuming the duplicates are in cell A2

    =EXACT(A2,A3)

    Drag that down to the end of the spread sheet.

    Filter B1 now and only show 'TRUE' columns. Those are your duplicates.

    Delete those, problem solved.
    {{ DiscussionBoard.errors[5378420].message }}
    • Profile picture of the author packerfan
      Originally Posted by dd1153 View Post

      Open the CSV and sort the row you want to remove duplicates from smallest to largest. Insert a column next to that one and put the forumla in, assuming the duplicates are in cell A2

      =EXACT(A2,A3)

      Drag that down to the end of the spread sheet.

      Filter B1 now and only show 'TRUE' columns. Those are your duplicates.

      Delete those, problem solved.
      Again, for small files you can easily do this in excel. I'm working with files that 5,000,000 rows of data. Excel 2010 can only handle 1,000,000

      So your solution doesn't apply in this case.
      Signature

      Nothing to see here

      {{ DiscussionBoard.errors[5378645].message }}
    • Profile picture of the author PatriciaS
      Originally Posted by dd1153 View Post

      Open the CSV and sort the row you want to remove duplicates from smallest to largest. Insert a column next to that one and put the forumla in, assuming the duplicates are in cell A2

      =EXACT(A2,A3)

      Drag that down to the end of the spread sheet.

      Filter B1 now and only show 'TRUE' columns. Those are your duplicates.

      Delete those, problem solved.
      THANK YOU! That sure works for me, whose goal it is to NEVER have a file exceeding 1 million records.
      {{ DiscussionBoard.errors[5502135].message }}
  • Profile picture of the author dd1153
    Do it 5 times?
    {{ DiscussionBoard.errors[5381282].message }}
    • Profile picture of the author packerfan
      Originally Posted by dd1153 View Post

      Do it 5 times?
      Wish it were that easy. When you import the file it imports a million records, then quits the import. It leaves you with your original file with 5 million records, and then the new one with 1 million.

      So you can't just "do it 5 times" believe me I wish it were that easy.
      Signature

      Nothing to see here

      {{ DiscussionBoard.errors[5381552].message }}
    • Profile picture of the author Jeff Chandler
      You couldn't just run the excel macro 5 times because you would only get the duplicates in that particular spreadsheet.
      {{ DiscussionBoard.errors[5381556].message }}
  • Profile picture of the author Justin Ford
    Did you used DuplicateFilesDeleter software ? I did use it for deleting duplicate file. Hope that help you. Thanks
    Signature
    Less is more. Keeping it simple takes time and effort.
    {{ DiscussionBoard.errors[10724148].message }}

Trending Topics