I studied a fair amount of NLP (a true passion of mine) at school and after I graduated I spent several months working on tech which did this (and other things). That was intended to be a startup, but sadly, at the time, my business sense sucked and I couldn't decide on a good product to fit the tech to (the fact that I was developing tech before I had a strong sense of my product is already telling).
I since have started a completely different (and profitable) company and the code has just been bit-rotting. I'm not sure what I should do with it. Keep it around in case I ever decide to do a business model like some of these companies (I probably don't have time for that)? Open source it (time-consuming to clean the code and what do I stand to gain from that)? I guess I could use the open-sourced stuff to help me find contracts for freelancing, but I just don't see a lot of NLP remote work being offered.
News article title extraction. News article relevant thumbnail extraction. News article text body extraction. Generating publicly traded stock symbols from business news articles. Some Techmeme-style document clustering.
I am working on a project (more of a public service than a startup) that needs this. I've looked through all of the resources linked in the articles above and nothing works as well as I need it to. The best performer is readability, so I will probably be going with the python port of that.
Currently in the middle of a re-architecture/re-write due to its flaws but something similar to this in Ruby I worked on last year: http://github.com/peterc/pismo
Decruft also has a couple bug fixes to python-readability. They both need a lot of work, though. You'll have to do some spelunking to figure out how to actually call the libraries correctly.
I since have started a completely different (and profitable) company and the code has just been bit-rotting. I'm not sure what I should do with it. Keep it around in case I ever decide to do a business model like some of these companies (I probably don't have time for that)? Open source it (time-consuming to clean the code and what do I stand to gain from that)? I guess I could use the open-sourced stuff to help me find contracts for freelancing, but I just don't see a lot of NLP remote work being offered.
Still, I hate seeing the code rot...