Which im network are you using? I worked on SEC crawler in PubSub during 2004 to 2006. I remember SEC has SGML version of all filings. So in this case, it may be better to use SGML parsers to process the document due to its structure.
Let me know because I think any project that will utilize data SEC will be interesting.