Aleph: A suite of data analysis tools for investigators

capableweb · on March 1, 2020

The GitHub readme/repository doesn't give a fair overview over what this project really covers. Seems really ambitions and well made, at least from a quick glance. This page gives a better overview: https://docs.alephdata.org/how-aleph-is-used

Some problems they aim to solve:

> Easy data search for both structured and unstructured information (ie. documents and databases).

> Cross-referencing between different datasets ("Who are all the politicians in my country that are mentioned in this leak?")

> Access control and data compartmentalisation, but also flexible sharing within cross-border teams.

> Continuous crawling of hundreds of public data sources as background material for research.

> Visual exploration of investigative analysis.

dang · on March 1, 2020

Ok, we've changed the URL from https://github.com/alephdata/aleph to the project home page.

salzig · on March 2, 2020

Thanks

divbzero · on March 1, 2020

I trialed Aleph recently and was impressed by its progress against an ambitious goal. My impressions as a user were as follows:

1. Aleph is excellent out-of-the-box for its

– OCR, via Tesseract or Google’s Vision API

– Full text search, via Elasticsearch

– Browser based UI, via React

2. Aleph does a okay job but has room for improvement with

– Entity extraction

– Language detection

where “okay” means it’s accurate enough to be useful for filtering by names, emails, languages, etc., but you’ll probably encounter occasional errors.

I also noticed search latency in my deployment and would love to try the Elasticsearch tips from the HN thread last week [1]. This latency does not appear in the production deployment by the Aleph team.

[1]: https://news.ycombinator.com/item?id=22396918

Again, props to the Aleph team for their success so far.

pudo · on March 2, 2020

The latency issue with ES probably comes from how we use the indexes: each type of entity (documents, emails, people, companies) ends up with its own index. 48 in total at the moment. This of course introduces lots of overhead at the beginning, but with a larger dataset and with a clustered ES it pays off at scale. We should consider re-adding single index operation mode for small deployment.

bryanrasmussen · on March 2, 2020

Given your trial would you think it useful to investigate moving a search solution of standards and regulations in PDF and text documents to aleph? That is to say is it a good enough search solution for structured and unstructured data that it would make sense to build on top of it instead of rolling your own?

divbzero · on March 2, 2020

Yes, definitely worth investigating if you need full text search as well as content extraction from PDFs. I found the production deployment installation [1] to be the most straightforward.

[1]: https://docs.alephdata.org/developers/installation#productio...

Feel free to ping me if you decide to try it and have questions. In addition, the Aleph team is active on both GitHub and Slack.

ssutch3 · on March 2, 2020

A LOL from the docs:

> Can I run Aleph without using Docker?

> Can Britain leave the European Union? Yes, it's possible; but complicated and will probably not make your life better in the way that you're expecting.

salzig · on March 2, 2020

As a side note, I stumbled on this cause the German Public Television seems to work on this too. Found it quite interesting to see that, in addition to finding this project

https://github.com/NorddeutscherRundfunk/aleph

adultSwim · on March 2, 2020

https://www.icij.org/blog/2016/04/data-tech-team-icij/

ICIJ put together a great platform to investigate the Panama and Paradise Papers

Jugurtha · on March 1, 2020

Note on the name: In addition to the origin story, Aleph is also the first letter in Arabic and Hebrew (א, ا)

david-cako · on March 1, 2020

Which represents the “breath of life”, and all possible things it can become.

Jugurtha · on March 1, 2020

The ultimate "Initial commit". Care to share some reading on this that you found interesting/intriguing/enlightening?

david-cako · on March 2, 2020

Kabbalah and the associated mysteries; BOTA has great coursework on the tarot that teaches to the Hebrew roots.

To me the tetragrammaton is a logical proof that approximates an “initial commit”.

capableweb · on March 2, 2020

I'm an outsider that doesn't have much knowledge about faith, please humor me.

I looked up "Tetragrammaton" since I never heard of it[0], but I don't see how it would consist of a logical proof, I don't quite understand the significance. How do you reach your approximation from the tetragrammaton?

- [0] https://en.wikipedia.org/wiki/Tetragrammaton

david-cako · on March 6, 2020

The tetragrammaton is a 4-letter name of God, and a conceptualization of

- somethingness

- becoming observable/“contrasty”

- germinating

- and becoming self-aware (and an applied practice)

Like a stack overflow of reality. It is one of many names of God in Hebrew.

This essentially represents “the all”, as we can understand it. It’s interesting how ancient alphabets were built on base concepts and interactions. They have a sort of philosophy to them.

Markov chain text generation is pretty wild, because we now can build computer systems that endlessly generate their own many-letter “names of God”. God has many names in Hebrew; everything is a name of God, so to speak, like a DNA sequence.

DyslexicAtheist · on March 2, 2020

been using this for some time to find info on companies/CEO's and other characters that appear in my news feeds.

here is a working example: https://aleph.occrp.org/

It also has a great client API which allows you to index a large volume of pdf's all at once:

  $> alephclient crawldir --foreign-id <id> directory_with_pdf/

monkeydust · on March 1, 2020

Looks interesting for personal or company wide search across multiple document types.

traverseda · on March 1, 2020

Looks like a great alternative to open semantic search.

OliverJones · on March 1, 2020

Dear HN colleagues: let's be careful about swamping those Aleph folks with traffic. They probably have enemies around the web that would exploit any overload and outage. Slashdotting can definitely turn into an unintended dDOS attack.

Better yet: maybe somebody with access to some kind of attack-resistant CDN provider could help them migrate.

If they haven't already.

traverseda · on March 1, 2020

What do you mean? The actual aleph server is a deploy on premises docker image, so we can't overload the actual services people are using. The infrastructure is all github or docker, which already scales.

Are we worried about dos-ing the documentation website? https://xkcd.com/932/