Oh boy, I have too many questions. I'd appreciate any answers you're able/willing to give:
1. Do you have any plans to support the parsing of any additional metadata (e.g. semantic HTML, microformats, schema.org structured data, open graph, dublin core, etc)?
2. How do you plan to address duplicate content? Engines like Google and Bing filter out pages containing the same content, which is welcome due to the amount of syndication that occurs online. `rel="canonical"` is a start, but it alone is not enough.
3. With the ranking algorithm being open-source, is there a plan to address SEO spam that takes advantage of Alexandria's ranking algo? I know this was an issue for Gigablast, which is why some parts of the repo fell out of sync with the live engine.
4. What are some of your favorite search engines? Have you considered collaboration with any?
1. Yes, any structured data could definitely help improve the results, I personally like the Wikidata dataset. It's just a matter of time and resources :)
2. The first step will probably be to handle this in our "post processing". We query several servers when doing a search and often get many more results than we need and in this step we could quite easily remove identical results.
3. The ranking is currently heavily based on links (same as Google) so we will have similar issues. But hopefully we will find some ways to better determine what sites are actually trustworthy, perhaps with more manually verified sites if enough people would want to contribute.
4. I think that Gigablast and Marginalia Search are really cool and interesting to see how much can be done with a very small team.
> Yes, any structured data could definitely help improve the results
Which syntaxes and vocabularies do you prefer? microformats, as well as schema.org vocabs represented Microdata or JSON-LD, seem to be the most common acc to the latest Web Data Commons Extraction Report[0]. The report is also powered by the Common Crawl.
1. Do you have any plans to support the parsing of any additional metadata (e.g. semantic HTML, microformats, schema.org structured data, open graph, dublin core, etc)?
2. How do you plan to address duplicate content? Engines like Google and Bing filter out pages containing the same content, which is welcome due to the amount of syndication that occurs online. `rel="canonical"` is a start, but it alone is not enough.
3. With the ranking algorithm being open-source, is there a plan to address SEO spam that takes advantage of Alexandria's ranking algo? I know this was an issue for Gigablast, which is why some parts of the repo fell out of sync with the live engine.
4. What are some of your favorite search engines? Have you considered collaboration with any?