Help Deleting Duplicates In CSV File

12 years ago

Sub deletemyduplicates()
'tryinhere
Range("A2:Q2").Select
Range(Selection, Selection.End(xlDown)).Select
Selection.Sort Key1:=Range("A2"), Order1:=xlAscending, Header:=xlNo, _
OrderCustom:=1, MatchCase:=False, Orientation:=xlTopToBottom, _
DataOption1:=xlSortNormal
Range("A3").Select
Do Until ActiveCell.Value = ""
If ActiveCell.Value = ActiveCell.Offset(-1, 0).Value Then
ActiveCell.EntireRow.Delete
ActiveCell.Offset(-1, 0).Select
End If
ActiveCell.Offset(1, 0).Select
Loop
MsgBox "your good to go"
End Sub

open up your csv file and insert this macro then run it / make a copy first justincase
to add the macro open the file go to tools > macro > visual basic editor
then select > insert > module
in that page paste the code above and save / close the editor

return to your page select tools > macro > macros > dltemyduplicates & then run

Thanks
1 reply

Signature

| > Choosing to go off the grid for a while to focus on family, work and life in general. Have a great 2020 < |

{{ DiscussionBoard.errors[5373970].message }}

packerfan

12 years ago

Originally Posted by tryinhere

open up your csv file and insert this macro then run it / make a copy first justincase
to add the macro open the file go to tools > macro > visual basic editor
then select > insert > module
in that page paste the code above and save / close the editor

return to your page select tools > macro > macros > dltemyduplicates & then run

This has to be done in excel, right? The problem is the entire file doesn't open because it exceeds more than a million rows.

Is there another program that I can use to open it and still run it?

Thanks
1 reply

Signature

Nothing to see here

{{ DiscussionBoard.errors[5375887].message }}

Boudewijn

12 years ago

Originally Posted by packerfan

This has to be done in excel, right? The problem is the entire file doesn't open because it exceeds more than a million rows.

Is there another program that I can use to open it and still run it?

1Mio doesn't sound like a lot. I could help you out here but then you should have to send the stuff to me (zipped).
I can then remove the duplicates for you and return the CSV.
b [dot] lutgerink [at] betuwe [dot] net
Just let me know if you are interested.

should not take much time, probably abt 10 minutes.

BL

Thanks
1 reply

{{ DiscussionBoard.errors[5503533].message }}

packerfan 12 years ago

Scroll up I posted a step by step guide on blog of how to do this with more than 1 million records.
- Thanks
Signature

Nothing to see here
{{ DiscussionBoard.errors[5503935].message }}

telofint

12 years ago

Thank you, I also go to try

Thanks
1 reply

{{ DiscussionBoard.errors[5375661].message }}

mojojuju

12 years ago

I used the sed command on it. Learned how to do it from this page (ctrl+f "delete duplicate, nonconsecutive lines from a file").

edit: scratch that. This is an even better way... (source)

awk '!x[$0]++' oldfile > newfile

awk and sed commands are typically found on Linux or UNIX systems, but if you're using Windows, you can install cygwin.

Thanks
1 reply

Signature

:)

{{ DiscussionBoard.errors[5375911].message }}

packerfan

12 years ago

Originally Posted by mojojuju

I used the sed command on it. Learned how to do it from this page (ctrl+f "delete duplicate, nonconsecutive lines from a file").

Is there a way to do this on a windows machine, or will I need to use my mac?

I don't have linux installed anywhere. Would prefer a solution that could done from my PC or MAC, but it's not the end of the world if I need to use only my Mac.

So just to make sure I understand...

Below is the command I run. I have no idea what any of that stuff means! Should I replace any of that with the directroy/filename? Sorry for the dumb questions. I have no problem learning, I just have little to no experience with command line scripting or anything.

And assuming this works on my mac, do I just open up a terminal session and do it there?

sed -n 'G; s/\n/&&/; /^$[ -~]*\n$.*\n\1/d; s/\n//; h; P'

Thanks
1 reply

Signature

Nothing to see here

{{ DiscussionBoard.errors[5375942].message }}

mojojuju

12 years ago

Originally Posted by packerfan

And assuming this works on my mac, do I just open up a terminal session and do it there?

sed -n 'G; s/n/&&/; /^([ -~]*n).*n1/d; s/n//; h; P'

No, see my edit. Don't use that command. Just open up a terminal and do this:

awk '!x[$0]++' oldfile > newfile

where 'oldfile' is the name of the file that you want to check for duplicate lines. and 'newfile' is where you want to store the new version (the one with duplicates removed).

I assume Macs have tools like awk. If it doesn't you can install cygwin and get everything you need to do this.

Another option is to install linux, or run a linux live cd/usb, or install linux in virtualbox.

Thanks
1 reply

Signature

:)

{{ DiscussionBoard.errors[5375958].message }}

packerfan

12 years ago

Originally Posted by mojojuju

No, see my edit. Don't use that command. Just open up a terminal and do this:

awk '!x[$0]++' oldfile > newfile

where 'oldfile' is the name of the file that you want to check for duplicate lines. and 'newfile' is where you want to store the new version (the one with duplicates removed).

I assume Macs have tools like awk. If it doesn't you can install cygwin and get everything you need to do this.

Another option is to install linux, or run a linux live cd/usb, or install linux in virtualbox.

Okay I'm installing the cygwin now. I don't want to have to install linux just to do this. Thanks for all your help.

One last question... Sorry for so many. Assume old file is called test.csv and new will be test1.csv

awk '!x[$0]++' test.csv > test1.csv

Or

awk '!x[$0]++' c:\directory\test.csv > c:\directory\test1.csv

Which would be correct?

Thanks
2 replies

Signature

Nothing to see here

{{ DiscussionBoard.errors[5375985].message }}

packerfan 12 years ago

Okay nevermind about where the file needs to be. That seems to be obvious after opening the program. However, I get this error when I try to run it. The picture clearly shows the file on my desktop (the screen shot is the terminal and file both on my desktop.

What am I doing wrong!
- Thanks
Signature

Nothing to see here
{{ DiscussionBoard.errors[5376043].message }}

mojojuju

12 years ago

Originally Posted by packerfan

One last question... Sorry for so many. Assume old file is called test.csv and new will be test1.csv

awk '!x[$0]++' test.csv > test1.csv

Or

awk '!x[$0]++' c:directorytest.csv > c:directorytest1.csv

Which would be correct?

The first would be correct if you were in the same folder/directory as the test.csv file. The second won't work in cygwin.

The easist way to do it would be to copy or move the test.csv file to the root of your C drive so it's at c:\test.csv.

Then open up the cygwin terminal and navigate to the root of your c: drive by typing:

cd /cygdrive/c

Then you'll be at the root of the c: drive and you can just run the awk command on your file like this (the same command you mentioned above):

awk '!x[$0]++' test.csv > test1.csv

Once you've done that, you'll have the file called test1.csv located at c:test1.csv and that is the file with all duplicates removed.

Originally Posted by packerfan

Okay nevermind about where the file needs to be. That seems to be obvious after opening the program. However, I get this error when I try to run it. The picture clearly shows the file on my desktop (the screen shot is the terminal and file both on my desktop.

What am I doing wrong!

Do what I wrote above.

But here's an explanation first....

In the picture above, your file is clearly on the desktop, but your cygwin shell is not.

Go ahead and type the following command into the cygwin shell:

pwd

It's probably going to say you're at '/home/Paul/' because that would be the default location that you go when you open the shell. (The reason you see it says Paul@Paul-desktop is because your username is 'Paul' and your computer name is 'Paul-desktop'. In other words, that's not your location.)

To make a long story short, and if you just want to get the job done, do exactly what I wrote before I quoted you.

Or you can just download the file attachment I attached to a previous post (It's your initial file with all duplicates removed). edit: I see that it's not the file you want to work with, but a smaller one for example.

Thanks

Signature

:)

{{ DiscussionBoard.errors[5376085].message }}

mojojuju

12 years ago

I forgot about Windows Powershell.

You can do it like this...(source)

gc $filename | sort | get-unique > $newfileName

It works good, but I don't know how it would work on a file as big as 1.5 GB. You'd just have to try it and see.

Here's another way to do it if you go the cygwin route:

sort test.csv | uniq > result.csv

Thanks
1 reply

Signature

:)

{{ DiscussionBoard.errors[5376226].message }}

packerfan

12 years ago

Originally Posted by mojojuju

I forgot about Windows Powershell.

You can do it like this...(source)

gc | sort | get-unique >

It works good, but I don't know how it would work on a file as big as 1.5 GB. You'd just have to try it and see.

Here's another way to do it if you go the cygwin route:

sort test.csv | uniq > result.csv

Okay, getting closer and I really appreciate all the help. Things are starting make sense (honestly I don't how you programmer guys do this stuff!)

I moved the file to C:, changed directory per your instructions, and got the following error.

-bash: file1.csv: Permission denied

So it would seem it found my file, but doesn't think I have permission to write to the C: drive is crazy. I'm the administrator on the machine and just put the darn test file there.

If we can get this figured out I'll be forever greatful.

I was "playing" around with windows powershell earlier trying to figure it out. I'll give that a shot, too.

Thanks
1 reply

Signature

Nothing to see here

{{ DiscussionBoard.errors[5376313].message }}

packerfan

12 years ago

Okay, the powershell thing works 99% how I want. If there's anyway to make one small tweak. If not, I'll survive.

I need it to act as if anytime a keyword is duplicated the entire record is a duplicate, regardless if the rest of the data is different. For example
"cats" 22500 searches - and
"cats" 15000 searches I want considered the same thing, so I would keep one and remove the other.

If that's possible with powershell, then I'm all set. If it's not, I'm all set anyway because it's not the end of the world.

I can't thank you enough. If the powershell works on my huge files you've made my life like a million times easier!

Paul

Thanks
1 reply

Signature

Nothing to see here

{{ DiscussionBoard.errors[5376349].message }}

mojojuju

12 years ago

Originally Posted by packerfan

Okay, the powershell thing works 99% how I want. If there's anyway to make one small tweak. If not, I'll survive.

I need it to act as if anytime a keyword is duplicated the entire record is a duplicate, regardless if the rest of the data is different. For example
"cats" 22500 searches - and
"cats" 15000 searches I want considered the same thing, so I would keep one and remove the other.

I don't know how you'd do that with powershell. What I'd do, and what I've done many times with large amounts of data from the Adwords keyword tool, is that I'd just import it all into a MySQL database.

Once it's imported into MySQL, it's easy to do all kinds of queries on the data. For instance, if you wanted some easy pickings for keyword queries, you could select all keyword phrases with at least three words and which the ratio of competition divided by keyword volume is really low (as low as you want).

If the data was in a database you could easily remove duplicates and also explore the data in many ways that even the best keyword tools won't allow. But you gotta learn MySQL first.

Originally Posted by packerfan

If that's possible with powershell, then I'm all set. If it's not, I'm all set anyway because it's not the end of the world.

I'm sure it is in some way or another. Maybe somebody else could chime in with an example of how to do it.

Originally Posted by packerfan

I can't thank you enough. If the powershell works on my huge files you've made my life like a million times easier!

You're welcome. I hope you find the other solution you're looking for.

Thanks
1 reply

Signature

:)

{{ DiscussionBoard.errors[5376475].message }}

packerfan

12 years ago

Originally Posted by mojojuju

I don't know how you'd do that with powershell. What I'd do, and what I've done many times with large amounts of data from the Adwords keyword tool, is that I'd just import it all into a MySQL database.

Once it's imported into MySQL, it's easy to do all kinds of queries on the data. For instance, if you wanted some easy pickings for keyword queries, you could select all keyword phrases with at least three words and which the ratio of competition divided by keyword volume is really low (as low as you want).

If the data was in a database you could easily remove duplicates and also explore the data in many ways that even the best keyword tools won't allow. But you gotta learn MySQL first.

I'm sure it is in some way or another. Maybe somebody else could chime in with an example of how to do it.

You're welcome. I hope you find the other solution you're looking for.

You're exactly right with what I want to do. I am building a big database. Will eventually put in mysql (once I hack my way around trying to figure that out) and then put some sort of front end on it with php if possible so I can query the heck out of it. It will all be on my local machine so I'm guessing processing won't be an issue.

Anyway, if you could possibly point me to a guide somewhere that would be awesome! I hate to keep taking up your time in this thread.

Thanks
1 reply

Signature

Nothing to see here

{{ DiscussionBoard.errors[5376751].message }}

Jeff Chandler

12 years ago

When you create your database in MySql, if you set the field as the primary key, it will not allow duplicate records. Then when you do the import, the duplicate records should fail automatically because of a primary key violation and you'll have your de-duped database. If you have MS Access, you could do the same thing.

Thanks
1 reply

{{ DiscussionBoard.errors[5377966].message }}

packerfan

12 years ago

Originally Posted by Jeff Chandler

When you create your database in MySql, if you set the field as the primary key, it will not allow duplicate records. Then when you do the import, the duplicate records should fail automatically because of a primary key violation and you'll have your de-duped database. If you have MS Access, you could do the same thing.

I set up WAMP on my machine last night, created my database, and a table.

Went to import the file and got an error that it was too big (it's about 1.8gb)

Is there a way to remove this? I know there is, I googled it but I couldn't find something that made sense to me. I'm kinda dumb when it comes to this stuff.

Thanks
1 reply

Signature

Nothing to see here

{{ DiscussionBoard.errors[5378191].message }}

Jeff Chandler

12 years ago

Originally Posted by packerfan

I set up WAMP on my machine last night, created my database, and a table.

Went to import the file and got an error that it was too big (it's about 1.8gb)

Is there a way to remove this? I know there is, I googled it but I couldn't find something that made sense to me. I'm kinda dumb when it comes to this stuff.

I'm not sure the file size limit, but the easiest way would be to just split it up into smaller files.

Thanks

{{ DiscussionBoard.errors[5378281].message }}

mojojuju

12 years ago

You should close your cygwin shell. Then go to:

Start Menu > All Programs > cygwin

Then, right click on "Cygwin Terminal" and in the menu that comes up, click on "Run as administrator". Once you do that and you're running as administrator you should have no problems doing whatever you want.

Since you've already got your text file located at C:\test.csv and you're now running cygwin as administrator, you can pick up on the instructions I mentioned earlier at the point shown below:

navigate to the root of your c: drive by typing:

cd /cygdrive/c

Then you'll be at the root of the c: drive and you can just run the awk command on your file like this (the same command you mentioned above):

awk '!x[$0]++' test.csv > test1.csv

Once you've done that, you'll have the file called test1.csv located at c:test1.csv and that is the file with all duplicates removed.

EDIT: The following command might work better and not choke on the large file or take so long because sorting the file first makes it easier (for your computer) to remove the duplicates.

sort test.csv | uniq > result.csv

[ 1 ] Thanks

Signature

:)

{{ DiscussionBoard.errors[5376347].message }}

pandadoodle

12 years ago

I was going to say another quick work around would be set up a local server and import it directly into a mysql database you could format the data and delete what you need and even run batch strings.

Thanks
2 replies

Signature

Web Design Cardiff - Panda Doodle

{{ DiscussionBoard.errors[5376505].message }}

packerfan 12 years ago

Thanks for all your help!

I've just written a step-by-step for the non-techies like me of how to do this. It's posted here. Remove Duplicates in CSV Using PowerShell | Import data into a sreadsheet.

Full credit given to this thread, the source mojojuju cited included.

You guys are awesome. Thanks!
- Thanks
Signature

Nothing to see here
{{ DiscussionBoard.errors[5376681].message }}

packerfan

12 years ago

Originally Posted by pandadoodle

I was going to say another quick work around would be set up a local server and import it directly into a mysql database you could format the data and delete what you need and even run batch strings.

Is there an "idiots guide to mysql and php" that would show me how to do this? This is actually what I want to do long term is build a huge database so I can query how I want.

I know how to set up WAMP and get a database set up, but other than that I'm clueless as how to import files, delete duplicates, etc during the process.

Thanks
1 reply

Signature

Nothing to see here

{{ DiscussionBoard.errors[5376738].message }}

Joe Ray

8 years ago

Originally Posted by packerfan

Is there an "idiots guide to mysql and php" that would show me how to do this? This is actually what I want to do long term is build a huge database so I can query how I want.

I know how to set up WAMP and get a database set up, but other than that I'm clueless as how to import files, delete duplicates, etc during the process.

If you really want to learn to manage your data programmatically, I recommend Python instead of php. You need to learn some basic Python, then you can start learning Pandas, which is a Python library. Pandas is a high-performance data structures and data analytics tool.

It's definitely a learning curve if you don't have a programming background, but if you want to work with big data, it's a good idea to learn this stuff. But, it will definitely take a while to learn the basics.

If you don't want to take the time to learn, you can pay a data engineer to set up a structure to automate everything for you. Then, you don't have to do anything. You will have a custom-made software that handles all your data structures and data analytics, exactly the way you want it.

Thanks

Signature

Software Engineering

Data Science & AI

{{ DiscussionBoard.errors[10724434].message }}

dd1153

12 years ago

Open the CSV and sort the row you want to remove duplicates from smallest to largest. Insert a column next to that one and put the forumla in, assuming the duplicates are in cell A2

=EXACT(A2,A3)

Drag that down to the end of the spread sheet.

Filter B1 now and only show 'TRUE' columns. Those are your duplicates.

Delete those, problem solved.

[ 2 ] Thanks
2 replies

{{ DiscussionBoard.errors[5378420].message }}

packerfan

12 years ago

Originally Posted by dd1153

Open the CSV and sort the row you want to remove duplicates from smallest to largest. Insert a column next to that one and put the forumla in, assuming the duplicates are in cell A2

=EXACT(A2,A3)

Drag that down to the end of the spread sheet.

Filter B1 now and only show 'TRUE' columns. Those are your duplicates.

Delete those, problem solved.

Again, for small files you can easily do this in excel. I'm working with files that 5,000,000 rows of data. Excel 2010 can only handle 1,000,000

So your solution doesn't apply in this case.

Thanks

Signature

Nothing to see here

{{ DiscussionBoard.errors[5378645].message }}

PatriciaS

12 years ago

Originally Posted by dd1153

Open the CSV and sort the row you want to remove duplicates from smallest to largest. Insert a column next to that one and put the forumla in, assuming the duplicates are in cell A2

=EXACT(A2,A3)

Drag that down to the end of the spread sheet.

Filter B1 now and only show 'TRUE' columns. Those are your duplicates.

Delete those, problem solved.

THANK YOU! That sure works for me, whose goal it is to NEVER have a file exceeding 1 million records.

Thanks

{{ DiscussionBoard.errors[5502135].message }}

dd1153

12 years ago

Do it 5 times?

Thanks
2 replies

{{ DiscussionBoard.errors[5381282].message }}

packerfan 12 years ago

Originally Posted by dd1153

Do it 5 times?

Wish it were that easy. When you import the file it imports a million records, then quits the import. It leaves you with your original file with 5 million records, and then the new one with 1 million.

So you can't just "do it 5 times" believe me I wish it were that easy.
- Thanks
Signature

Nothing to see here
{{ DiscussionBoard.errors[5381552].message }}
Jeff Chandler 12 years ago

You couldn't just run the excel macro 5 times because you would only get the duplicates in that particular spreadsheet.
- Thanks
{{ DiscussionBoard.errors[5381556].message }}

Justin Ford

8 years ago

Did you used DuplicateFilesDeleter software ? I did use it for deleting duplicate file. Hope that help you. Thanks

Thanks

Signature

Less is more. Keeping it simple takes time and effort.

{{ DiscussionBoard.errors[10724148].message }}

Help Deleting Duplicates In CSV File

Trending Topics

How do I effectively build and nurture an email list for affiliate marketing purposes?

Are billboards still effective in driving customers?

Are Solo ADs the Most Cost Effective Way to Promote your Product?

Been a while! Any of you Old Timers here?

How much does the rebranding affect CR ?