How would you handle this problem?

11 replies
I need to do the following with a CRON job running on server A

1) Pull a .zip file from Server B
2) Unzip file (it contains a CSV)
3) Read through CSV and enter into a database

Simple enough but Im concerned with the file size.
The zipped file is 220MB, unzipped is just over 1.3GB

Im concerned that there will be performance / timeout issues because of the file
size, it contains over 1 million records. Just looping through them to insert into the DB will take a while, let alone downloading and unzipping.

Any ideas how to efficiently tackle this problem.

PS) Will be using PHP / MYSQL
#handle #problem
  • Profile picture of the author SteveSRS
    hey,

    You forgot some vital info; how often will the cron job run?
    {{ DiscussionBoard.errors[7599324].message }}
  • Profile picture of the author FirstSocialApps
    Opps sorry. Once every 1-2 days have not decided which for sure. But either way its not like it will be running every ten minutes or something and kill the server. I still want to be as efficient as possible though.
    My first thought was CURL, unzip it, fgetcsv(), just dont know if doing it that way will cause issues. It certainly will be slow.
    {{ DiscussionBoard.errors[7599503].message }}
    • Profile picture of the author Brandon Tanner
      I don't think performance should be too much of a concern, so long as your script is running on a decent-enough box with plenty of RAM.

      Just make sure that the values of "max_execution_time" and "memory_limit" in the php.ini file are sufficient to complete the job, and you should be good to go (depending on your server, you may have to bump up one or both of those values considerably). FYI - it's a good idea to check with your host before changing those values!

      Also... it would be wise to wrap Try/Catch blocks around each individual task in a script like that, so that you can repeat any task that doesn't get executed properly (ie incomplete download, can't connect to DB, etc). And keep in mind that any task that has to be repeated adds to the script's total execution time.
      Signature

      {{ DiscussionBoard.errors[7601446].message }}
      • Profile picture of the author wayfarer
        Originally Posted by Brandon Tanner View Post

        Just make sure that the values of "max_execution_time" and "memory_limit" in the php.ini file are sufficient to complete the job...
        Instead of that, I would wrap the long-running processes in a shell script, then call whatever PHP process you need afterwards, right from the shell script. This would offload what you need right from the OS, without having to wrap those calls in a PHP script.


        ***EDIT***
        Oops, should have read to the bottom of the comments. 2 other people suggested a similar thing.
        What SteveSRS suggests is pretty much exactly what I was thinking actually.
        Signature
        I build web things, server things. I help build the startup Veenome. | Remote Programming Jobs
        {{ DiscussionBoard.errors[7605115].message }}
  • Profile picture of the author FirstSocialApps
    Thanks Brandon,

    I think Im also going to break it up into 2 CRON jobs, One for the download, one for the insert to the database, and then separate them by an hour or so.
    {{ DiscussionBoard.errors[7603434].message }}
  • Profile picture of the author SteveJohnson
    If #3 is accurate ('enter into a database') you should investigate using MySQL's LOAD DATA INFILE ( MySQL :: MySQL 5.1 Reference Manual :: 13.2.6 LOAD DATA INFILE Syntax ) on the CSV file itself. It's SO MUCH FASTER than using single INSERT INTO statements.

    On the other hand, if you have to UPDATE existing data in the table, do as Brandon suggested and use try/catch blocks wrapped around an fgetcsv or fgets retrieval/storage loop, storing the file pointer each time so you can keep track of progress.
    Signature

    The 2nd Amendment, 1789 - The Original Homeland Security.

    Gun control means never having to say, "I missed you."

    {{ DiscussionBoard.errors[7603717].message }}
  • Profile picture of the author FirstSocialApps
    Yes direct loading the CSV is not an option. I have to process the data some before entering into the database.

    My main concern was with memory usage and the CRON. I have never used CRON or zip functions on a file over 50mb. (just never had a need to) This script is going to run on a shared host. I wrote a script to loop through the records, process and insert into the database, 10K records took around 3200ms so I think I can do 1 million in around 6 minutes. Thats not to bad I think.
    {{ DiscussionBoard.errors[7603745].message }}
  • Profile picture of the author SteveSRS
    Hey,

    Do you know how to use bash scripting? I would recommend the following:

    cron job to run a bash file say
    transfercsv.sh

    In this bash you execute a wget command (way faster then downloading via php, altough you could also do wget command via php's 'exec' or 'system' command).

    Then using the the same bash script unzip the file via command line and then you can run another cron to process the data or just call php script directly from the bash script to avoid weird timing problems

    bash scripting is super easy (just like .bat on Windows just then with unix commands one per line)...

    Something like:
    -------------------
    #/bin/sh

    cd /www/yourpathtosite/downloadfolder
    wget http://www.csvurl.com/filecsv.zip

    unzip /pathto/csvfile/filecsv.zip -d /tmp
    #/tmp is extraction path you can change this if you want however /tmp would be a good place for this
    #last one:
    php -f /path/to/yourcron.php

    #if this command doesn't work check your crontab how php scripts are called there and copy that method.
    -------------------------

    Also even if the data need some modification using mysql load file is still a good idea.
    Otherwise use multiple inserts / updates / replaces in one query where possible instead of a while / for loop and every time 1 query per loop run
    {{ DiscussionBoard.errors[7604381].message }}
    • Profile picture of the author Don Art
      yeah, for sure go with what steve is saying

      but one tip that may really save you from aggravation:

      Command line, php and cron each run in a unique environment - so, for example a script that works from the command line, may fail from cron

      so, do this:

      at your command line use whereis to get the full path to each executable (like wget and php )

      example: whereis php will give you something like: /usr/bin/php

      so then in steves suggestion just edit:

      php -f /path/to/yourcron.php

      instead:

      usr/bin/php -f /path/to/yourcron.php

      ALSO:

      make sure you test cron scripts with your working directory(s) at permission level 777

      after everything is working, dial back the permission to the minimum that still works

      (you might even have to play with permissions to access /path/to/yourcron.php from cron)
      {{ DiscussionBoard.errors[7606194].message }}
  • Profile picture of the author PermaNulled
    Not sure why you'd want to use PHP to process it as you could completely unzip the file and import the records in pure bash.

    But what Steve posted is certainly something you should look into as some of it is going to require some bash to handle, you're not going to want to try and download the file using PHP it's self that's just gonna give you a headache same with unzipping.

    And if it's a huge file importing the records is gonna take quite some time as well which is why I said I'm not sure why you don't handle the whole operation in a bash script.
    {{ DiscussionBoard.errors[7604479].message }}
  • Profile picture of the author TopicSpan
    FirstSocialApps - don't forget about mysql's "INSERT DELAYED" feature. It doesn't work with InnoDB tables, but it will work 100% with MyISAM. If you want a MASSIVE performance boost, I would SERIOUSLY consider SteveJohnson's recommendation - use PHP to manipulate the CSV (line by line or whatever, 1 mil lines is NOT a lot of data to get through) and then use LOAD DATA. Why? Read this article, it explains the gains of using LOAD DATA.
    Signature
    Don't lose users!
    Grab our full-page UltraCache system that instantly speeds up any PHP website!
    {{ DiscussionBoard.errors[7633289].message }}

Trending Topics