Uploading Files

Posted by Koz Thursday, April 23, 2009 23:46:00 GMT

Anyone who’s built a rails application that deals with large file uploads probably has a few horror stories to tell about it. While some people love to overstate the issues for their own purposes, it’s still something that can be quite challenging to do well.

What’s the Problem?

As I mentioned in the article on File Downloads, your rails processes are a scarce resource. You need them to be free to handle your applications’ requests, if they’re all busy, your users will be left waiting. When we optimised the download processes we made sure that we used our webservers instead of tying up a rails process to spoon feed the file out over the network to your users. Dealing with uploads has a similar problem.

When a browser uploads a file, it encodes the contents in a format called ‘multipart mime’ (it’s the same format that gets used when you send an email attachment). In order for your application to do something with that file, rails has to undo this encoding. To do this requires reading the huge request body, and matching each line against a few regular expressions. This can be incredibly slow and use a huge amount of CPU and memory.

While this parsing is happening your rails process is busy, and can’t handle other requests. Pity the poor user stuck behind a request which contains a 100M upload!

What’s not the problem?

Some people seem to think that the File Upload problem with rails is that the entire process is blocked while the browser sends the encoded body to you. This isn’t not true, and hasn’t been for a long time. Whether you’re using nginx + mongrel, apache + mongrel or apache + passenger, your web server buffers the entire request before rails locks itself for processing. So no matter how slow a user’s connection is, your application isn’t locked while they upload their file.

What can you do?

There are a number of unattractive options to work around this slow multipart-parsing. The most common I’ve seen is to send uploads to a non-rails process such as a CGI script or a merb/mongrel/rack application. CGI scripts have the obvious disadvantage that you need to write a script simple enough to start up quickly and featured enough to process your uploads. Doing it in rack leaves you relying on ruby’s threading to handle parallelism. This is probably not what you want and your throughput is probably much lower than it would be without that upload being processed.

What else can you do?

Because neither of these options were acceptable Pratik Naik and I have built a Mod Porter an Apache module that does the heavy lifting for your file uploads. All of the hard stuff is done by libapreq though, so you don’t have to worry about using C code written by two ruby programmers!

Porter is essentially the inverse of X-SendFile. It parses the multipart post in C inside your apache process and writes the files to disk. Once that work is done it changes the request to look like a regular form POST which contains pointers to the temp files on disk. To maintain system security it also signs the modified parameters so people can’t attack your system like those old PHP apps.

This means that your rails processes don’t have to deal with anything more than a regular form post which is nice and fast. In addition to the apache module, Porter also includes a Rails Plugin which hides all of this from you. It makes an upload handled by Porter, look just like a regular Rails Upload.

How fast is it?

The speed of upload parsing isn’t particularly relevant, the reduced locking is far more important. Your user’s internet connection is much more important for the round-trip upload performance than your upload handler’s parser.

Having said all that, Porter runs significantly faster than the equivalent pure-ruby parsing code. Depending on the size and number of uploads we’ve seen response times between 30 and 200 times as fast. That’s not just compared to rails’ upload parser, it’s that much faster than every other ruby mime parser we tried.

Isn’t this just like the Nginx module?

Kinda. We’ve been thinking about this module ever since we started using lighttpd’s X-SendFile header. When I saw the nginx module get released I decided to start planning the Apache equivalent. Porter is completely transparent to your application, you don’t need a special form action, and you don’t need to tell Porter what form fields to pass through to the web application. This means you can use porter in production, and mongrel or thin in development, without any changes to your application.

The biggest improvement from this is that you don’t need to change your nginx config every time you add a new input to a form, or a new file upload to your application. This is extremely tedious and error prone, especially when making these changes involves a support ticket with your hosting provider. The major goal we have with Porter is to make sure it always ‘Just Works’, so you can put a file upload into any form without having to worry about your web server.

Getting Started

Porter is still beta software, so you’re strongly advised to test it first, but you already knew that. The porter website has the installation instructions. Once you’ve got that done you’ll need to add the rails plugin, and configure them to share a nice secure secret. Then, hopefully, your application will Just Work but your uploads will be much less painful.

If you have any issues getting it running, leave us a note on the git hub issues page.


Comments

Leave a response

  1. Renaud MorvanApril 24, 2009 @ 12:03 AM

    It looks nice, I was reluctant to write a rack asset server due to that dramatic ruby multipart parsing issue but looks like you just offer a good workaround it.

    I will definitly give it a try.

    Thanx!

  2. FabianApril 24, 2009 @ 04:41 AM

    Looks great, I have been waiting for this post since you announced it. I’m running an application which features large file uploads. (~ 20 – 100MB) I use paperclip with swfupload to handle these uploads, as a plain old file upload is a bit unresponsive with these file sizes. Can I just plug in your mod into apache, and it will just work? I also use the mimetypes-gem to parse the file type, as flash just returns application/octet-stream . Would I still need to do this? And if yes, is there any advantage from using the mod? Sorry for the questions, my understanding if web servers is pretty shallow.

    Anyway, thanks for your great work!

  3. Brian McManusApril 24, 2009 @ 06:28 AM

    Paperclip is not liking this because it is getting a ModPorter::UploadedFile rather just a File object so things Paperclip is expecting to call (to_tempfile, size, etc.) aren’t available… I might try to patch Paperclip to work and see what happens

  4. Thijs van der VossenApril 24, 2009 @ 08:26 AM

    I’m wondering why you guys didn’t do a libapreq patch for Rack instead? It seems this would have solved the slow multipart-parsing just as well.

  5. Manfred StienstraApril 24, 2009 @ 08:26 AM

    Why did you implement an Apache module and not just a plugin or monkey patch for Rack multipart parsing using libapreq2 Ruby bindings?

  6. Pratik NaikApril 24, 2009 @ 11:18 AM

    @Manfred/Thijs : Because it’s not about performance. It’s about about resource utilization. Multipart parsing can/will be slow no matter what you write it in. If porter takes 10 seconds for a 100 MB file and even if the Rack one takes 10 seconds, porter is still better because Apache processes are much cheaper than your Ruby one. So you’ll be blocking a 2 MB processes rather than your 50-60+ MB Rails process.

  7. Anil WadghuleApril 24, 2009 @ 12:41 PM

    Nice! It’s must have thing.

  8. Dmytro ShteflyukApril 24, 2009 @ 12:48 PM

    I don’t know why did you tell about nginx config changing when new fields added to the form. Just add upload_pass_form_field ”” line to nginx config and all fields will be passed directly as is. Here in Scribd we are using Nginx upload module and very happy about it.

  9. Roderick van DomburgApril 24, 2009 @ 02:48 PM

    Very interesting! Two questions about the implementation: if user switching is used with Apache, then are the temporary files also owned by the users? And exactly what is the impact on applications that don’t use a plugin or shared secret?

  10. johnny illerApril 24, 2009 @ 02:58 PM

    One approach I have taken with recent projects is to use amazon’s S3 and post files directly to that without going through rails at all. I realize you don’t get some of the intelligent processing you may require from you upload script, but you can always process the file by a different process entirely after is has been uploaded….

  11. Brian McManusApril 24, 2009 @ 03:45 PM

    If you put a file directly to S3 and then need to process it you’d have to fetch it back from S3 which would bump up your S3 monetary costs needlessly.

  12. peteApril 24, 2009 @ 05:20 PM

    If you’re using S3, you can also have the browser POST directly to S3.

  13. Neal ClarkApril 24, 2009 @ 11:12 PM

    @Brian McManus i don’t really have a strong opinion on uploading directly to S3 or not, but—

    if you’re storing uploads on S3, the order you do it in doesn’t make any difference: you’re either uploading to your application server and paying to transfer the file to S3, or you’re uploading to S3 and paying to transfer the file to your application server. it doesn’t matter if which ‘direction’ the transfer is—it costs the same amount. unless you don’t need to do any processing of file uploads at all, in which case you’re saving money by uploading directly to S3.

    of course, if you’re application server is running on EC2 (and you’re bucket is in the same availability zone), transfer in either direction is free.

  14. KozApril 25, 2009 @ 01:24 AM

    Uploading to S3 directly is a good fit for some requirements but, as mentioned, if you have to do anything to those files, it’s actually more complicated than local uploads.

    As for the nginx comments, it’s still not completely transparent to your rails applications, but I’m sure it works great. Our main motivation was that we wanted to use it with apache and passenger, and the nginx module is no help there :)

    The paperclip problems are our problem not theirs, can you lodge issues for us and we’ll get them fixed.

  15. Brian McManusApril 25, 2009 @ 07:59 PM

    Issue opened on github for the Paperclip problem.

    @Neal Clark Yes, there would be a cost difference as far as I can see if you need to do post-processing on the file you’re uploading to S3.

    If you upload to S3 directly you have to:

    1) Pay the upload cost (which is higher than the download cost, not the same as you said). 2) Download the file back to your system for processing (another transfer fee). 3) Upload the file* back to S3 (again the higher push cost paid again).

    For a total of 3 S3 transfers unless I’m missing something.

    Local upload is:

    1) Upload the file to your server (no S3 fee). 2) Do post-processing on the file. 3) Upload the file* to S3.

    Doing it this way only needs one S3 transfer.

    • Could potentially be multiple pushes depending if your post-processing is creating alternate versions of the uploaded file but that would be the same N pushes in either case.
  16. Neal ClarkApril 26, 2009 @ 10:16 AM

    @Brian McManus

    you’re absolutely right.

    i could have sworn that S3 transfer cost the same regardless of the direction, but i looked at the S3 pricing page after reading your reply, and i was wrong.

    the last several projects i have worked on have been hosted on EC2, so i’m not used to thinking about the costs after the initial upload (since there are none). not running on EC2, like you said, you pay for:

    1. upload to s3 directly from the user (more expensive) 2. download from s3 from your application (less expensive) 3. re-upload to s3 after post-processing (more expensive)

    i prefaced my original post saying i don’t really have a strong position on uploading directly to EC2, but after your clarification i think it’s an obviously bad idea for non-EC2 hosted apps. for EC2 hosted apps, it depends on the application.

    good call dude, my bad. thanks for the reply, i def learned something!

    -neal

  17. Cody BrimhallMay 03, 2009 @ 01:00 AM

    God bless you people! I have been tearing my hair out for a while trying to deal with large uploads gracefully, and then out comes this, like Christmahanukwanzukkah in May!

  18. MillisamiMay 24, 2009 @ 10:22 PM

    Hey, guys. good discussion!! I’m feeling like I’ve got the Ultimate one that I’ve been thinking about the various upload processes on my head only. This made me feel like I’ve got the both. Gonna give it a try!

  19. ggstarlingJune 17, 2009 @ 05:36 PM

    Hello everyone! This evening I have found a very fresh video about the office plankton everyday life. I recommend this for all to pick up your mood. I helped)) You can watch this toon <a href=http://www.youtube.com/watch?v=neFYTsNFzmc here

    Dont be sad - be happy))
  20. mod_porter userJuly 29, 2009 @ 02:26 PM

    This object ModPorter::UploadedFile passed does not recognize a read method

    undefined method `read’ for #

  21. Andrew KuklewiczAugust 28, 2009 @ 07:57 PM

    We are currently changing how we do uploads (having used the mongrel upload progress plugin when that was start of the art).

    How does mod_porter scale?

    We work with files that are hella big – often > 100mb, and see days and times when many users are uploading at the same time. Currently this is a pretty big drain on our servers, so offloading the whole thing to the much faster and more scalable s3 post service seems attractive me.

    We also do all of our file processing on ec2 anyway, so even though our main site is not on ec2, we get the definite advantage of not having to upload to our servers then also push to s3.

    Anyway, next time I need uploads, and m not using ec2/s3, this will be top of my list – nice to also see this apparently now works with the apache upload progress module (see comments: http://drogomir.com/blog/2008/6/18/upload-progress-bar-with-mod_passenger-and-apache).

Comment