Search My Site – open-source search engine for personal and independent websites

(searchmysite.net)

265 points | by OuterVale 109 days ago

18 comments

OuterVale 108 days ago
Also worth a mention is Wiby.
"The Wiby search engine is building a web of pages as it was in the earlier days of the internet."
It's main indexing requirements are:
- "Pages must be simple in design. Simple HTML, non-commerical sites are preferred."
- "Pages should not use much scripts/css for cosmetic effect. Some might squeak through."
- "Don't use ads that are intrusive (such as ads that appear overtop of content)."
- "Don't submit a page which serves primarily as a portal to other bloated websites."
https://wiby.me
[-]
- lelanthran 108 days ago
  This is most definitely not the same thing. The indexing requirements are not "This site must be an independent or personal site", it's "This site must lean towards being a plain HTTP document".
  The Search My Site, from what I can tell, has the goal of surfacing personal/independent websites, while Wiby has the goal of surfacing minimally styled documents.
  Two different goals.
  [-]
  - codetrotter 108 days ago
    > In the early days of the web, pages were made primarily by hobbyists, academics, and computer savvy people about subjects they were personally interested in. Later on, the web became saturated with commercial pages that overcrowded everything else. All the personalized websites are hidden among a pile of commercial pages.
    > […]
    > The Wiby search engine is building a web of pages as it was in the earlier days of the internet.
    https://wiby.me/about/
    Sounds to me like Wiby is more similar to Search My Site than what your comment makes it sound like.
  - danlitt 108 days ago
    I wonder if this is why they said "worth a mention" rather than "the same thing".
    [-]
    - lelanthran 108 days ago
      I dunno; something is worth a mention if it's in the same category being discussed. Wiby most certainly isn't.
      [-]
      - danlitt 108 days ago
        "alternative search engines"?
        [-]
        lelanthran 108 days ago
        The title says, as of writing,
        for personal and independent websites
        To my mind, that excludes things like Kagi, and Wiby, etc which would ordinarily be included if the title said
        alternative search engine
  - OuterVale 108 days ago
    I'm certainly not presenting Wiby as being the same thing, merely as something that is worth a mention due to likely being of interest to anyone interested in Search My Site.
    It is relevant and has vaguely aligned intent.
  - notachatbot123 108 days ago
    And that is ok. It is still contextually relevant to some people and a nice project to boost.
danlitt 108 days ago
Also Marginalia, [1] featured many times on HN previously. [2,3]
[1]: https://marginalia-search.com/
[2]: https://news.ycombinator.com/item?id=35611923
[3]: https://news.ycombinator.com/item?id=31536626
renegat0x0 108 days ago
Other sites:
- https://nownownow.com/
- https://omg.lol/
- https://indieweb.org/
- https://ooh.directory/
- https://neocities.org/
- https://aboutideasnow.com/
- https://indieblog.page/
- https://wiby.me/
- https://80.style/
Generally I crawl the internet to find pages. The result is in https://github.com/rumca-js/Internet-Places-Database. Personal pages are tagged with "personal" tag.
[-]
- freetonik 108 days ago
  Also https://minifeed.net/ which I maintain; soon reaching 1000 personal blogs indexed.
rumgewieselt 108 days ago
I love the simplecity of https://pagefind.app/
[-]
- junto 108 days ago
  This is what I’m using with my Astro personal blog. It’s awesome.
- brontosaurusrex 108 days ago
  Interesting, is that a more complete variation of fuse.js? (Just pluged-in fuse.js into my static jekyll blog)
  [-]
  - 7952 108 days ago
    Pagefind uses an index that is created ahead of time and stored as numerous files on a static site. It then downloads just the part of the index needed to complete the search. This means that you can search vastly more data than could be loaded onto a browser.
  - wonger_ 108 days ago
    I think Pagefind is focused on the whole experience of searching pages, like with default UI widgets, easy page indexing, and handling larger sites. fuse.js seems to be a fuzzy-filter function on JS data, not handling the site integration.
- kilroy123 108 days ago
  Me too! I'm a huge fan. I use it for all my static sites.
- ozornin 108 days ago
  This is just what I wanted, thank you for that!
kreelman 108 days ago
Thanks for putting this together. I wonder, is Postgres a bit of a large DB if it's just a personal website search tool? I'll have to give it a go. We need more tools like this.
[-]
- m-i-l 108 days ago
  Postgres is just used for the site admin, i.e. keeping track of submissions, review status, subscriptions etc. The actual search index is in Apache Solr. In theory you could use Solr to store all the admin data, but it is generally not recommended to use a Solr style document store to master data. I guess something more lightweight like SQLite could be used, but it is intended to be deployed on servers and Postgres isn't too resource intensive.
1dom 108 days ago
I like this, thank you! I just lost an hour of time to the exact sort of random but considered personal websites that I think made the Web great in the first place.
[-]
- m-i-l 108 days ago
  Thanks for the great feedback:-) This is what searchmysite.net is attempting to do - help make "surfing the web" a fun leisure activity once more. It is good to see more people seem to get that point now. When it was on HN nearly 3 years ago[0], many people saw a search box and thought it must be a Google replacement, but were disappointed to find it wasn't. And I guess now more than ever it is useful to have a way of finding content on the web which has been made by humans rather than AI.
  [0] https://news.ycombinator.com/item?id=31395231
unfixed 108 days ago
This kind of projects are really good for finding interesting blogs and obscure sites.
My go to choice is https://marginalia-search.com/
_puk 108 days ago
Great to see this.
Ironically, given Google's stronghold over the past decade, I strongly feel that one of the big winners in the AI space is going to be the backend search engine.
Modern web search has become so polluted, with many tricks to get to the front page of Google that a lot (most?) of the good content is lost.
Now that many of the big models are capable of calling out to the web, this bloat is now appearing in AI search. A proper data first engine, without ads, less focus on presentation, and more on structured data is what is needed.
kittikitti 108 days ago
Thank you very much for sharing this project. After digging around, I found this blog post of yours to be the most insightful into the technical details about the search engine, https://blog.searchmysite.net/posts/searchmysite.net-buildin...
ThinkBeat 108 days ago
I am a bit confused. Solr is the search engine.
An LLM model is loaded. What does the LLM model add to the solution?
[-]
- m-i-l 108 days ago
  The LLM was for an experiment in retrieval augmented generation, i.e. "a chat with your website" style interface, using Apache Solr as the vector store. Results (on a small self-hosted LLM to keep costs manageable) weren't good enough for the functionality to be fully rolled out, so the LLM has been disabled and is likely to be fully removed.
nelsonfigueroa 108 days ago
This is awesome. I love anything that helps me discover new personal sites/blogs.
aleken 108 days ago
This is exactly what I have been looking for. Like the other commenter, I am a but surprised by having to drag along psql for this. I like the design of the site, though
csprimer-in 108 days ago
I fail to understand the complexity of it, can you help me understand how is it different from other search engines ? Thanks in advance !
[-]
- _puk 108 days ago
  Sites are ranked higher when they have no ads. Fully open source.
  That's a good starting point..
  [-]
  - pjerem 108 days ago
    > Sites are ranked higher when they have no ads.
    That’s a pretty clever filter tbh. Sounds so evident that I’m amazed nobody thought of it before.
    I’d love Kagi to have such an option.
    [-]
    - lelanthran 108 days ago
      > That’s a pretty clever filter tbh. Sounds so evident that I’m amazed nobody thought of it before.
      I did. I posted it on HN as a comment. It was a very popular (by my standards) comment: https://news.ycombinator.com/item?id=40438288
      The thread was interesting, with a lot of people posting rebuttals for why such a scheme would obviously not work. Equally obviously, someone else thought it was a good idea and went and implemented it.
      [-]
      - 1dom 108 days ago
        > Equally obviously, someone else thought it was a good idea and went and implemented it.
        Maybe I misunderstood, but it's not obvious to me that someone read your idea, thought it was good, and then went to implement it.
        https://github.com/searchmysite/searchmysite.net/graphs/code... looks like the bulk of the code was added 2 - 4 years ago.
  - p3rls 108 days ago
    So then definitionally not independent then because their funding comes from elsewhere
    [-]
    - Sophira 108 days ago
      According to the site, the funding comes from its "Search as a Service" feature[0], where anybody can pay them in order to have a search service focused on their site (which does not have to be in the public index and thus doesn't have to be personal/independent).
      So, in the sense that the funding (aims to) comes from larger companies, you are correct. It's not VC, but it does seem like it could end up relying on payments from large companies, making it potentially vulnerable.
      [0] https://searchmysite.net/pages/about/#search-as-a-service
      [-]
      - m-i-l 108 days ago
        That's right. Most search engines are funded by advertising, where there is the clear conflict of interest[0], not to mention incentive for spam etc. Alternative models include a subscription fee (which I don't think would work for a small niche search like this) and donations (which may or may not be sustainable). Looking through some of the support forums for the big search engines, I'm pretty sure that enough site owners would pay a fee for support to pay the running costs for a large search engine, although for a smaller search engine like this there needs to be something more than just support, hence the search as a service features.
        [0] "Advertising funded search engines will be inherently biased towards the advertisers and away from the needs of consumers", to quote Sergey Brin and Lawrence Page in their "The Anatomy of a Large-Scale Hypertextual Web Search Engine" paper from 1998.
      - p3rls 108 days ago
        That too, but I was referring to the sites in the search themselves. They are not independents but in the pay of someone else if they're not generating revenue.
        As someone with a commercial digital garden with ads (and a $3/mo sub for the ad-adverse) I like to point out the tension in that whenever possible
        [-]
        Sophira 108 days ago
        Not necessarily. As a counterpoint, I have a site (<https://www.automidiflip.com/>, which I posted to HN on launch 8 years ago[0], which provides a service and does not have (and has never had) any ads whatsoever. (It inserts a reference to automidiflip.com in the MIDIs it creates as a credit, but no ads in the sense that most people mean.)
        Granted, it's a niche service, but over the past year it's still been used to flip MIDIs about 23 times a day on average, so it's definitely not unknown. I don't see any need to monetise it, though.
        [0] https://news.ycombinator.com/item?id=13553224
eviks 108 days ago
No basics like typo-resistance?
digiatl
> No results found for digiatl.
[-]
- amanaplanacanal 108 days ago
  In my imagination, I think I prefer a search engine which searches for what I ask, rather than one which tries to guess what I really want.
  It's been so long since I had one that really worked that way that I might turn out to hate it though.
  [-]
  - 1dom 108 days ago
    Best of both worlds:
    > No results found for "digiatl". Did you mean to search for "digital" instead?
    [-]
    - m-i-l 108 days ago
      At a big corporate, we had an Apache Solr based search which had some reasonably clever lemmatization and stats analysis and spell check config to suggest alternative searches if not many results were found for the original query, but one day someone reported an unfortunate edge case which caused a bit of a panic - if you searched "annual report” it returned "did you mean anal report?" (we were in the finance sector rather than medical sector, but there were a lot more documents in the corpus containing words like analysts, analysis, analytics etc). Anyway, the point is yes, it is great to have that sort of functionality, but it does come at a cost, and a small project like this might prefer to keep it simple.
      [-]
      - 1dom 108 days ago
        Generating suggestions from something other than what your users have already given you is inevitably going to result in something different and potentially offensive being shown to them.
        One solution is to offer suggestion from a list of previous searches.
        Also, that is very much a big corporate problem: I imagine most searchmysite users are mature and stable enough not to have a melt down at the word "anal".
        But I agree with your point, sometimes seemingly small features take a disproportionate amount of support, and this could be one of them!
      - busymom0 108 days ago
        Couldn't you just add an extra step to check if the suggestion is offensive, then don't show it?
  - eviks 108 days ago
    Most of the search engines you encounter fail here (press Ctrl+F in your browser and make a typo), it's the web search that's different. Though even here it's easy to check without making relying only on imagination - how often do you add quotes for literals?
saltysalt 108 days ago
I'd also suggest https://greppr.org
(Disclaimer: I built it).
tobiasnvdw 108 days ago
This is wonderful. I immediately found an interesting new blog I had never seen before.
idiotsecant 108 days ago
It's like the internet is made of sincere weirdos again, I love this.
misonic 108 days ago
the login with password seems not working properly