Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This may be a silly question, but it's something that's bothered me ever since Scribd launched: How do you handle all the file format conversions?

It seems to be that you couldn't be wrangling all those formats yourself, as many are undocumented, and/or hideously complex, and you've got to do a lot more work than simply reading the files, and not much time to do it. So, are there some off-the-shelf components to do this work (with reasonable license terms), or is the problem easier than it seems?

I'm most interested in the MSFT formats, which seem the trickiest.



My first guess would be Abiword. In the course of putting together an open source word processor that can handle a couple of different closed file formats, they've spun off their code into libraries. The wvWare library handles Word files. http://abiword.com/projects/

As far as I know, the whole Abiword project is GPL, though that shouldn't matter much for server-side code, unless you're letting your customers host the service themselves, like Versionate seems to be planning on doing... I guess you could just pipe the output from a thin wrapper around the library to the rest of your code.


It could be done this way, but frankly, Abiword's Word importer isn't very good. OpenOffice's, while not perfect, is much better. Unfortunately, unlike Abiword, OOo doesn't come with a nice command-line utility for doing those conversions. OOo has a VBA-esque language that allows you to automate tasks like that, but it's a lot more suitable for more "interactive" purposes than as part of another app's backend.

Another thing to note is that some of Scribd's backend is written in C#. Maybe if your app is Windows based, there are Office API calls that let you do stuff like this. Just a guess, though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: