Put together a 5 minute read on why object storage has been particularly nice to build on when training/serving models. This post unsurprisingly revolves around MinIO, which is an open source, high performance object store, but these use-cases can be applied against your object storage of choice!
Apache Tika is time-tested and, by some, considered a legacy toolkit. With Tika running as a container and the use of Python bindings, it's possible to get a text extraction experience that is as easy to build with as newer frameworks like Unstructured, but also matches the extraction capability of dedicated extraction models like Nougat. Kind of surprising!
Furthermore, using a backing object store (i.e. MinIO) to hold the source documents is very useful (whether the extracted text is being used for RAG or an LLM training dataset).
Put together a document text extraction server using Apache Tika (with ~30 lines of code) that can be used to vectorize text for retrieval-augmented generation or to create LLM training datasets.
Much credit to the tika-python project for making the Python bindings!
For a long time, the hardest part about building software was the interfacing between natural language and syntactic systems. Now, ironically, that might be one of the simplest parts. This 5 minute read explains what LLMs are good (and bad) at, and why.
A neat guide on how to perform feature extraction using the hidden states of LLMs. As a bonus, it also goes over securely loading datasets without having to upload your data onto the Hugging Face Hub.
In terms of the detection model, I trained it on a dataset of fairly clear satellite imagery, so it should work even on aircraft at their normal cruising altitudes. For off-the-shelf-drone altitudes, a slightly different dataset may have to be used (I'm thinking normal photographs of aircraft types might do the trick).
All the other stuff, like the onboard deployment of MinIO and an onboard inference server should work regardless of the altitude -- it becomes a question of the hardware involved at that point.