Hacker Newsnew | past | comments | ask | show | jobs | submit | josefcullhed's commentslogin

Great article, I really think programmers should be more conservative when it comes to new technology. You don't always have to use fancy react stuff when making simple websites and so on..

But I'm not sure about the whole "programming is simple, like planning a party" thing. When I am thinking about my own projects with more than 150K lines of code its not like any party planning process I have been involved in, it is actually a lot more complex. It is easy to forget how hard things are when you know them and have worked with it for many years.


Great job with Marginalia. Do you plan to open source your data as well or only the code?


If I can solve the logistics, maybe. I don't have the needed bandwidth and off-site storage at this point.


What sort of sizes are we talking about? I'm thinking if it would be possible to "crowdfund" the storage-costs for a requester-pays s3 bucket for it.


I'm probably producing around 250-500 Gb data/month at this point.


What's the cumulative size for the index to date? I'm not rich by any measure, but if it's within reach I'd probably fund the storage costs.


Reach out to Jason Scott -- textfiles.com -- and see if he knows anyone who would be interested.

He might know some folks.


Any idea how compressible that is?

If it's something that compresses really well (eg text data in a database), then live compression filesystems (eg ZFS, likely others) could potentially help make that workable.


The data is either already compressed or dense binary soup, so no luck.


Idea: Host datasets as one or more torrents. Thoughts?


Founder here. We are using harmonic centrality instead of pagerank. But of course much more work needs to be done to make the search engine usable.


I'm curious as to which reasons/tradeoffs were involved in the decision to use harmonic centrality, if you wouldn't mind sharing.


Founder here,

I suggest you start by not implementing a crawler but use commoncrawl.org instead. The problem with starting a web crawler is you will need a lot of money and almost all big websites are behind cloudflare so you will be blocked pretty quickly. Crawling is a big issue and most of the issues are non-technical.


I've heard from other people who run engines (Right Dao, Gigablast) that this is a major problem; Common Crawl does look helpful, but it's not continuously updated. FWIW, Right Dao uses Wikipedia as a starting point for crawling. Kiwix makes pre-indexed dumps of Wikipedia, StackExchange, and other sites available.

Some sort of partnership between crawlers could go a long way. Have you considered contributing content back towards the Common Crawl?


There seems to be a threshold where you get greylisted by cloudflare. Not sure if it's requests per day or what they're doing. But I've been able to mostly circumvent it by crawling at a modest rate.


First off, nice work!

This seems like a reasonable fallback option but it's also a weaker one. By "most of the issues are non-technical", do you mean that you need special permission from someone like cloudflare to get "crawl rights"?


Hello,

My name is Josef Cullhed. I am the programmer of alexandria.org and one of two founders. We want to build an open source and non profit search engine and right now we are developing in our spare time and are funding the servers ourselves. We are indexing commoncrawl and the search engine is in a really early stage.

We would be super happy to find more developers who want to help us.


I was trying to learn more about the ranking algorithm that Alexandria uses, and I was a bit confused by the documentation on Github for it. Would I be correct in that it uses "Harmonic Centrality" (http://vigna.di.unimi.it/ftp/papers/AxiomsForCentrality.pdf) at least for part of the algorithm?


Hi,

Yes our documentation is probably pretty confusing. It works like this, the base score for all URLs to a specific domain is the harmonic centrality (hc). Then we have two indexes, one with URLs and one with links (we index the link text). Then we first make a search on the links, then on the URLs. We then update the score of the urls based on the links with this formula: domain_score = expm1(5 * link.m_score) + 0.1; url_score = expm1(10 * link.m_score) + 0.1;

then we add the domain and url score to url.m_score

where link.m_score is the HC of the source domain.


The main scoring function seems to be index_builder<data_record>::calculate_score_for_record() in line 296 of https://github.com/alexandria-org/alexandria/blob/main/src/i..., and it mentions support for BM25 (Spärck Jones, Walker and Robertson, 1976) and TFIDF (Spärck Jones, 1972) term weighting, pointing to the respective Wikipedia pages.


This is actually not used yet. Working on implementing that as a factor.


Thanks for sharing this with the world. Did you manage to include all of a common crawl in an index? How long did that take you to produce such an index? Is your index in-memory or on disk?

I'd consider contributing. Seems you have something here.


The index we are running right now are all URLs in commoncrawl from 2021 but only URLs with direct links to them. This is mostly because we would need more servers to index more URLs and that would increase the cost.

It takes us a couple of days to build the index but we have been coding this for about 1 year.

All the indexes are on disk.


>> All the indexes are on disk.

Love it. Makes for a cheaper infrastructure, since SSD is cheaper than RAM.

>> It takes us a couple of days to build the index

It's hard for me to see how that could be done much faster unless you find a way to parallelize the process, which in itself is a terrifyingly hard problem.

I haven't read your code yet, obviously, but could you give us a hint as to what kind of data structure you use for indexing? According to you, what kind of data structure allows for the fastest indexing and how do you represent it on disk so that you can read your on-disk index in a forward-only mode or "as fast as possible"?


Yes it would be impossible to keep the index in RAM.

>> It's hard for me to see how that could be done much faster unless you find a way to parallelize the process

We actually parallelize the process. We do it by separating the URLs to three different servers and indexing them separately. Then we just make the searches on all three servers and merges the result URLs.

>> I haven't read your code yet, obviously, but could you give us a hint as to what kind of data structure you use for indexing?

It is not very complicated, we use hashes a lot to simplify things. The index is basically a really large hash table with the word_hash -> [list of url hashes] Then if you search for "The lazy fox" we just take the intersection between the three lists of url hashes to get all the urls which have all words in them. This is the basic idea that is implemented right now but we will of course try to improve.

details are here: https://github.com/alexandria-org/alexandria/blob/main/src/i...


I realize I'm asking for a free ride here, but could you explain what happens after the index scan? In a phrase search you'd need to intersect, union or remove from the results. Are you using roaring bitmaps or something similar?


We are currently just doing an intersection and then we make a lookup in a forward index to get the urls, titles and snippets.

I actually don't know what roaring bitmaps are, please enlighten me :)


If you are solely supporting union or solely supporting intersection then roaring bitmaps is probably not a perfect solution to any of your problems.

There are some algorithms that have been optimized for intersect, union, remove (OR, AND, NOT) that work extremely well for sorted lists but the problem is usually: how to efficiently sort the lists that you wish to perform boolean operations on, so that you can then apply the roaring bitmap algorithms on them.

https://roaringbitmap.org/


Roaring Bitmaps are awesome. I use them when merging indices. I need to know which items to keep from the old index, so I'm calculating the intersection between two sets of a cardinality around 500,000,000. Without breaking a sweat.


Oh boy, I have too many questions. I'd appreciate any answers you're able/willing to give:

1. Do you have any plans to support the parsing of any additional metadata (e.g. semantic HTML, microformats, schema.org structured data, open graph, dublin core, etc)?

2. How do you plan to address duplicate content? Engines like Google and Bing filter out pages containing the same content, which is welcome due to the amount of syndication that occurs online. `rel="canonical"` is a start, but it alone is not enough.

3. With the ranking algorithm being open-source, is there a plan to address SEO spam that takes advantage of Alexandria's ranking algo? I know this was an issue for Gigablast, which is why some parts of the repo fell out of sync with the live engine.

4. What are some of your favorite search engines? Have you considered collaboration with any?


Hello, Ivan here (the other founder).

1. Yes, any structured data could definitely help improve the results, I personally like the Wikidata dataset. It's just a matter of time and resources :)

2. The first step will probably be to handle this in our "post processing". We query several servers when doing a search and often get many more results than we need and in this step we could quite easily remove identical results.

3. The ranking is currently heavily based on links (same as Google) so we will have similar issues. But hopefully we will find some ways to better determine what sites are actually trustworthy, perhaps with more manually verified sites if enough people would want to contribute.

4. I think that Gigablast and Marginalia Search are really cool and interesting to see how much can be done with a very small team.


> Yes, any structured data could definitely help improve the results

Which syntaxes and vocabularies do you prefer? microformats, as well as schema.org vocabs represented Microdata or JSON-LD, seem to be the most common acc to the latest Web Data Commons Extraction Report[0]. The report is also powered by the Common Crawl.

[0]: http://webdatacommons.org/structureddata/2021-12/stats/stats...


The UI is amazing. Don't change it significantly!


Very impressive work so far!

Apologies if I missed it (and solely out of curiosity), but how roughly much does hosting Alexandria Search cost (per month)? (I'm assuming you've optimized for cost to avoid spending your own money!)

I have some other questions (around crawlers, parsing, and dependencies), but I need to read the other comments first (to see if my questions have already been answered).


Thanks!

The active index is running on 4 servers and we have one server for hosting the frontend and the api (the API is what is used by the frontend, ex: https://api.alexandria.org/?q=hacker%20news)

Then we have one fileserver storing raw data to be indexed. The cost for those 6 servers are around 520 USD per month.


I searched for a competitive keyword my SaaS business recently reached #1 on Google for. All of our competitors came up, but we were nowhere to be seen (I gave up after page 5).

Does this mean we’re not in Commoncrawl? Or are there any factors you weight much more heavily than Google might?


What will be the USP that makes it a success we are all waiting for? At the moment I'm switching between DDG & Google.


I just think that the timing is right. I think we are in a spot in time where it does not cost billions of dollars to build a search engine like it did 20 years ago. The relevant parts of the internet is probably shrinking and Moore's Law is making computing exponentially cheaper so there has to be an inflection point somewhere.

We hope we can become a useful search engine powered by open source and donations instead of ads.


hello, so i was studying B+ trees today, you see, in the morning i browsed hackernews and saw alexandria.org, opened the tab, kept it open, went about my day, got frustrated with my search results, noticed the alexandria tab and tried it, every result was meaningful, well done .


Awesome work.

1. How do you plan to finance?

2. How will you avoid SEO?

3. What kind of help would be most welcome?


Hello, Ivan here (the other founder).

1. We would prefer to be funded with donations like Wikipedia.

2. I don't think we can avoid it completely, perhaps with volunteers helping us determine the trustworthiness of websites. Do you have any suggestions?

3. I think programmers and people with experience raising money for nonprofits could help the most right now. But if you see some other way you would want to contribute, please let us know!


Regarding the raise of money I wouldn't be surprised if given the current state of things in the EU you could manage to get some funding. I have no experience on it but there are companies specialized on helping with writing grant proposals.


Interesting, in Sweden I only got this story when I made the same searches: https://imgur.com/a/k1Avbtm


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: