This is most definitely not the same thing. The indexing requirements are not "This site must be an independent or personal site", it's "This site must lean towards being a plain HTTP document".
The Search My Site, from what I can tell, has the goal of surfacing personal/independent websites, while Wiby has the goal of surfacing minimally styled documents.
> In the early days of the web, pages were made primarily by hobbyists, academics, and computer savvy people about subjects they were personally interested in. Later on, the web became saturated with commercial pages that overcrowded everything else. All the personalized websites are hidden among a pile of commercial pages.
> […]
> The Wiby search engine is building a web of pages as it was in the earlier days of the internet.
I'm certainly not presenting Wiby as being the same thing, merely as something that is worth a mention due to likely being of interest to anyone interested in Search My Site.
Pagefind uses an index that is created ahead of time and stored as numerous files on a static site. It then downloads just the part of the index needed to complete the search. This means that you can search vastly more data than could be loaded onto a browser.
I think Pagefind is focused on the whole experience of searching pages, like with default UI widgets, easy page indexing, and handling larger sites. fuse.js seems to be a fuzzy-filter function on JS data, not handling the site integration.
Thanks for putting this together.
I wonder, is Postgres a bit of a large DB if it's just a personal website search tool?
I'll have to give it a go. We need more tools like this.
Postgres is just used for the site admin, i.e. keeping track of submissions, review status, subscriptions etc. The actual search index is in Apache Solr. In theory you could use Solr to store all the admin data, but it is generally not recommended to use a Solr style document store to master data. I guess something more lightweight like SQLite could be used, but it is intended to be deployed on servers and Postgres isn't too resource intensive.
I like this, thank you! I just lost an hour of time to the exact sort of random but considered personal websites that I think made the Web great in the first place.
Thanks for the great feedback:-) This is what searchmysite.net is attempting to do - help make "surfing the web" a fun leisure activity once more. It is good to see more people seem to get that point now. When it was on HN nearly 3 years ago[0], many people saw a search box and thought it must be a Google replacement, but were disappointed to find it wasn't. And I guess now more than ever it is useful to have a way of finding content on the web which has been made by humans rather than AI.
Ironically, given Google's stronghold over the past decade, I strongly feel that one of the big winners in the AI space is going to be the backend search engine.
Modern web search has become so polluted, with many tricks to get to the front page of Google that a lot (most?) of the good content is lost.
Now that many of the big models are capable of calling out to the web, this bloat is now appearing in AI search. A proper data first engine, without ads, less focus on presentation, and more on structured data is what is needed.
The LLM was for an experiment in retrieval augmented generation, i.e. "a chat with your website" style interface, using Apache Solr as the vector store. Results (on a small self-hosted LLM to keep costs manageable) weren't good enough for the functionality to be fully rolled out, so the LLM has been disabled and is likely to be fully removed.
At a big corporate, we had an Apache Solr based search which had some reasonably clever lemmatization and stats analysis and spell check config to suggest alternative searches if not many results were found for the original query, but one day someone reported an unfortunate edge case which caused a bit of a panic - if you searched "annual report” it returned "did you mean anal report?" (we were in the finance sector rather than medical sector, but there were a lot more documents in the corpus containing words like analysts, analysis, analytics etc). Anyway, the point is yes, it is great to have that sort of functionality, but it does come at a cost, and a small project like this might prefer to keep it simple.
Generating suggestions from something other than what your users have already given you is inevitably going to result in something different and potentially offensive being shown to them.
One solution is to offer suggestion from a list of previous searches.
Also, that is very much a big corporate problem: I imagine most searchmysite users are mature and stable enough not to have a melt down at the word "anal".
But I agree with your point, sometimes seemingly small features take a disproportionate amount of support, and this could be one of them!
Most of the search engines you encounter fail here (press Ctrl+F in your browser and make a typo), it's the web search that's different. Though even here it's easy to check without making relying only on imagination - how often do you add quotes for literals?
This is exactly what I have been looking for. Like the other commenter, I am a but surprised by having to drag along psql for this. I like the design of the site, though
The thread was interesting, with a lot of people posting rebuttals for why such a scheme would obviously not work. Equally obviously, someone else thought it was a good idea and went and implemented it.
According to the site, the funding comes from its "Search as a Service" feature[0], where anybody can pay them in order to have a search service focused on their site (which does not have to be in the public index and thus doesn't have to be personal/independent).
So, in the sense that the funding (aims to) comes from larger companies, you are correct. It's not VC, but it does seem like it could end up relying on payments from large companies, making it potentially vulnerable.
That's right. Most search engines are funded by advertising, where there is the clear conflict of interest[0], not to mention incentive for spam etc. Alternative models include a subscription fee (which I don't think would work for a small niche search like this) and donations (which may or may not be sustainable). Looking through some of the support forums for the big search engines, I'm pretty sure that enough site owners would pay a fee for support to pay the running costs for a large search engine, although for a smaller search engine like this there needs to be something more than just support, hence the search as a service features.
[0] "Advertising funded search engines will be inherently biased towards the advertisers and away from the needs of consumers", to quote Sergey Brin and Lawrence Page in their "The Anatomy of a Large-Scale Hypertextual Web Search Engine" paper from 1998.
That too, but I was referring to the sites in the search themselves. They are not independents but in the pay of someone else if they're not generating revenue.
As someone with a commercial digital garden with ads (and a $3/mo sub for the ad-adverse) I like to point out the tension in that whenever possible
Not necessarily. As a counterpoint, I have a site (<https://www.automidiflip.com/>, which I posted to HN on launch 8 years ago[0], which provides a service and does not have (and has never had) any ads whatsoever. (It inserts a reference to automidiflip.com in the MIDIs it creates as a credit, but no ads in the sense that most people mean.)
Granted, it's a niche service, but over the past year it's still been used to flip MIDIs about 23 times a day on average, so it's definitely not unknown. I don't see any need to monetise it, though.
"The Wiby search engine is building a web of pages as it was in the earlier days of the internet."
It's main indexing requirements are:
- "Pages must be simple in design. Simple HTML, non-commerical sites are preferred."
- "Pages should not use much scripts/css for cosmetic effect. Some might squeak through."
- "Don't use ads that are intrusive (such as ads that appear overtop of content)."
- "Don't submit a page which serves primarily as a portal to other bloated websites."
https://wiby.me
The Search My Site, from what I can tell, has the goal of surfacing personal/independent websites, while Wiby has the goal of surfacing minimally styled documents.
Two different goals.
> […]
> The Wiby search engine is building a web of pages as it was in the earlier days of the internet.
https://wiby.me/about/
Sounds to me like Wiby is more similar to Search My Site than what your comment makes it sound like.
It is relevant and has vaguely aligned intent.
[1]: https://marginalia-search.com/
[2]: https://news.ycombinator.com/item?id=35611923
[3]: https://news.ycombinator.com/item?id=31536626
- https://nownownow.com/
- https://omg.lol/
- https://indieweb.org/
- https://ooh.directory/
- https://neocities.org/
- https://aboutideasnow.com/
- https://indieblog.page/
- https://wiby.me/
- https://80.style/
Generally I crawl the internet to find pages. The result is in https://github.com/rumca-js/Internet-Places-Database. Personal pages are tagged with "personal" tag.
[0] https://news.ycombinator.com/item?id=31395231
My go to choice is https://marginalia-search.com/
Ironically, given Google's stronghold over the past decade, I strongly feel that one of the big winners in the AI space is going to be the backend search engine.
Modern web search has become so polluted, with many tricks to get to the front page of Google that a lot (most?) of the good content is lost.
Now that many of the big models are capable of calling out to the web, this bloat is now appearing in AI search. A proper data first engine, without ads, less focus on presentation, and more on structured data is what is needed.
(Disclaimer: I built it).
An LLM model is loaded. What does the LLM model add to the solution?
digiatl
> No results found for digiatl.
It's been so long since I had one that really worked that way that I might turn out to hate it though.
> No results found for "digiatl". Did you mean to search for "digital" instead?
One solution is to offer suggestion from a list of previous searches.
Also, that is very much a big corporate problem: I imagine most searchmysite users are mature and stable enough not to have a melt down at the word "anal".
But I agree with your point, sometimes seemingly small features take a disproportionate amount of support, and this could be one of them!
That's a good starting point..
That’s a pretty clever filter tbh. Sounds so evident that I’m amazed nobody thought of it before.
I’d love Kagi to have such an option.
I did. I posted it on HN as a comment. It was a very popular (by my standards) comment: https://news.ycombinator.com/item?id=40438288
The thread was interesting, with a lot of people posting rebuttals for why such a scheme would obviously not work. Equally obviously, someone else thought it was a good idea and went and implemented it.
Maybe I misunderstood, but it's not obvious to me that someone read your idea, thought it was good, and then went to implement it.
https://github.com/searchmysite/searchmysite.net/graphs/code... looks like the bulk of the code was added 2 - 4 years ago.
So, in the sense that the funding (aims to) comes from larger companies, you are correct. It's not VC, but it does seem like it could end up relying on payments from large companies, making it potentially vulnerable.
[0] https://searchmysite.net/pages/about/#search-as-a-service
[0] "Advertising funded search engines will be inherently biased towards the advertisers and away from the needs of consumers", to quote Sergey Brin and Lawrence Page in their "The Anatomy of a Large-Scale Hypertextual Web Search Engine" paper from 1998.
As someone with a commercial digital garden with ads (and a $3/mo sub for the ad-adverse) I like to point out the tension in that whenever possible
Granted, it's a niche service, but over the past year it's still been used to flip MIDIs about 23 times a day on average, so it's definitely not unknown. I don't see any need to monetise it, though.
[0] https://news.ycombinator.com/item?id=13553224