Hi HN, author here. SHARP is Apple's recent single-image 3D Gaussian splatting model (https://arxiv.org/abs/2512.10685). Their reference code is PyTorch + a pretty heavy pipeline; I wanted to see if it could run in a browser with no server hop, so I exported the predictor to ONNX and ran it via onnxruntime-web with the WebGPU EP.
What works: drop in an image, get a .ply you can download or preview live, all on your machine — your image never leaves the tab. The model is large (~2.4 GB sidecar) so first load is slow on a cold cache, but inference itself is a few seconds on a recent Mac.
Caveats: SHARP's released weights are research-use only (Apple's model license, not the code's). I host the exported ONNX on R2 so thedemo "just works", but you can also export your own from the upstream Apple repo and upload locally.
A *2.4gb* ONNX? That is wild. This format continues to impress me. ONNX uses 32bit single precision floats I believe, so thats something like ~644m float params/constants. I recently dove deep 'traditional ML' side of the ONNX serialization format for the purposes of writing an JVM ML compiler for trees and regressions. ONNX actually quite clever the way it serializes trees into parallel arrays (which is then serialized using protobuf). My trees have capped out at < 32mb. I haven't dove into the neural net side of things yet, mainly because I don't have any models to run in prod.(https://github.com/exabrial/petrify if anyone is interested.)
I vibecoded a simple web app using Sharp that allowed be to quickly browse any local image folder and view them as "almost" volumetric 3d scenes in a VR headset.
I precomputed and cached each one so it was nearly instant. The effect - although only a crude wrapper around what Sharp already does - was quite transformative and mesmerising. Just the ease of pointing it at any folder of photos and viewing them fully spatially.
It was a bit of a mess code-wise and kinda specific to my local setup - but I should really clean it up deploy it somewhere for other people to try. Although I keep assuming someone else will do it before me and make a better job of it.
What are the requirements for running this? Chrome throws a whole bunch of "out of memory" errors into the console when I try to execute these. I'm guessing 4GiB of VRAM is not enough?
Nice, I've also been doing some similarly neat things via ONNX web at https://intabai.dev (caution, just PoC tools atm, only Chrome tested, only some mobile devices work, no filters).
I think all-client-side in-browser AI imagery is becoming very doable and has lots of privacy benefits. However ONNX web leaves a lot to be desired (I had to proto patch many pytorch conversions because things like Conv3D ops had webgpu issues IIRC). I have yet to try Apache TVM webgpu approaches or any others, but I feel if the webgpu space were more invested in, running these models would be even more feasible.
I don't like that it uses only a single photo. This means it is going to make up a lot of stuff. E.g. if I show it a photo of a poster, then it will make that poster 3D. With only two photos that problem would already be solved.
I haven't tried that specific case but - are you sure? It does get a lot of stuff right from context. I think it would probably depend how much of the frame, the poster took up.
More reference images from different angles is always going to give more accurate information in 3D. From a single 2D image there is a lot of ambiguity in the context. Several different shapes in 3D can be represented in identical ways in 2D. Additional context like lighting shadows etc helps. But more real signal from more images will always be better
1. There's many use cases where only a single photo is available
2. There are many models similar to Sharp that do accept multiple photos - but Sharp is trying to solve a specific problem. If you have multiple photos - don't use Sharp.
This is cool. For practitioners, What’s the current state of the art for free form multi picture to splat? The last time I looked at it the pipeline was pretty janky and included a few separate steps.
Did not work in Firefox on Linux, but it runs on Chrome.
Have to admit, I dont get it. I tried it with 3 landscape photos I have and the results were nowhere close to the results in the demo, but that just speaks to the model.
Regardless, its very cool as a browser tech showcase.
What works: drop in an image, get a .ply you can download or preview live, all on your machine — your image never leaves the tab. The model is large (~2.4 GB sidecar) so first load is slow on a cold cache, but inference itself is a few seconds on a recent Mac.
Caveats: SHARP's released weights are research-use only (Apple's model license, not the code's). I host the exported ONNX on R2 so thedemo "just works", but you can also export your own from the upstream Apple repo and upload locally.
Happy to talk about it in the comments :)
I precomputed and cached each one so it was nearly instant. The effect - although only a crude wrapper around what Sharp already does - was quite transformative and mesmerising. Just the ease of pointing it at any folder of photos and viewing them fully spatially.
It was a bit of a mess code-wise and kinda specific to my local setup - but I should really clean it up deploy it somewhere for other people to try. Although I keep assuming someone else will do it before me and make a better job of it.
I think all-client-side in-browser AI imagery is becoming very doable and has lots of privacy benefits. However ONNX web leaves a lot to be desired (I had to proto patch many pytorch conversions because things like Conv3D ops had webgpu issues IIRC). I have yet to try Apache TVM webgpu approaches or any others, but I feel if the webgpu space were more invested in, running these models would be even more feasible.
2. There are many models similar to Sharp that do accept multiple photos - but Sharp is trying to solve a specific problem. If you have multiple photos - don't use Sharp.
Have to admit, I dont get it. I tried it with 3 landscape photos I have and the results were nowhere close to the results in the demo, but that just speaks to the model.
Regardless, its very cool as a browser tech showcase.