The GitHub readme/repository doesn't give a fair overview over what this project really covers. Seems really ambitions and well made, at least from a quick glance. This page gives a better overview: https://docs.alephdata.org/how-aleph-is-used
Some problems they aim to solve:
> Easy data search for both structured and unstructured information (ie. documents and databases).
> Cross-referencing between different datasets ("Who are all the politicians in my country that are mentioned in this leak?")
> Access control and data compartmentalisation, but also flexible sharing within cross-border teams.
> Continuous crawling of hundreds of public data sources as background material for research.
I trialed Aleph recently and was impressed by its progress against an ambitious goal. My impressions as a user were as follows:
1. Aleph is excellent out-of-the-box for its
– OCR, via Tesseract or Google’s Vision API
– Full text search, via Elasticsearch
– Browser based UI, via React
2. Aleph does a okay job but has room for improvement with
– Entity extraction
– Language detection
where “okay” means it’s accurate enough to be useful for filtering by names, emails, languages, etc., but you’ll probably encounter occasional errors.
I also noticed search latency in my deployment and would love to try the Elasticsearch tips from the HN thread last week [1]. This latency does not appear in the production deployment by the Aleph team.
The latency issue with ES probably comes from how we use the indexes: each type of entity (documents, emails, people, companies) ends up with its own index. 48 in total at the moment. This of course introduces lots of overhead at the beginning, but with a larger dataset and with a clustered ES it pays off at scale. We should consider re-adding single index operation mode for small deployment.
Given your trial would you think it useful to investigate moving a search solution of standards and regulations in PDF and text documents to aleph? That is to say is it a good enough search solution for structured and unstructured data that it would make sense to build on top of it instead of rolling your own?
Yes, definitely worth investigating if you need full text search as well as content extraction from PDFs. I found the production deployment installation [1] to be the most straightforward.
> Can Britain leave the European Union? Yes, it's possible; but complicated and will probably not make your life better in the way that you're expecting.
As a side note, I stumbled on this cause the German Public Television seems to work on this too. Found it quite interesting to see that, in addition to finding this project
I'm an outsider that doesn't have much knowledge about faith, please humor me.
I looked up "Tetragrammaton" since I never heard of it[0], but I don't see how it would consist of a logical proof, I don't quite understand the significance. How do you reach your approximation from the tetragrammaton?
The tetragrammaton is a 4-letter name of God, and a conceptualization of
- somethingness
- becoming observable/“contrasty”
- germinating
- and becoming self-aware (and an applied practice)
Like a stack overflow of reality. It is one of many names of God in Hebrew.
This essentially represents “the all”, as we can understand it. It’s interesting how ancient alphabets were built on base concepts and interactions. They have a sort of philosophy to them.
Markov chain text generation is pretty wild, because we now can build computer systems that endlessly generate their own many-letter “names of God”. God has many names in Hebrew; everything is a name of God, so to speak, like a DNA sequence.
Dear HN colleagues: let's be careful about swamping those Aleph folks with traffic. They probably have enemies around the web that would exploit any overload and outage. Slashdotting can definitely turn into an unintended dDOS attack.
Better yet: maybe somebody with access to some kind of attack-resistant CDN provider could help them migrate.
What do you mean? The actual aleph server is a deploy on premises docker image, so we can't overload the actual services people are using. The infrastructure is all github or docker, which already scales.
Some problems they aim to solve:
> Easy data search for both structured and unstructured information (ie. documents and databases).
> Cross-referencing between different datasets ("Who are all the politicians in my country that are mentioned in this leak?")
> Access control and data compartmentalisation, but also flexible sharing within cross-border teams.
> Continuous crawling of hundreds of public data sources as background material for research.
> Visual exploration of investigative analysis.