VGGT: Visual Geometry Grounded Transformer

(github.com)

177 points | by xnx 1 day ago

15 comments

  • w-m 22 hours ago
    I read the paper yesterday, would recommend it. Kudos to the authors for getting to these results, and also for presenting them in a polished way. It's nice to follow the arguments about the alternating attention (global across all tokens vs only the tokens per camera), the normalization (normalize the scene scale - done in the data vs DUST3R, which normalizes in the network), and the tokens (image tokens from DINOv2 + camera tokens + additional register tokens, handling the first camera differently as it becomes the frame of reference). The results are amazing, and fine-tuning this model will be fun, e.g. for forward 3DGS reconstruction, looking forward to this.

    I'm sure getting to this point was quite difficult, and on the project page you can read how it involved discussions with lots and lots of smart and capable people. But there's no big "aha" moment in the paper, so it feels like another hit for The Bitter Lesson in the end: They used a giant bunch of [data], a year and a half of GPU time to [train] the final model, and created a model with a billion parameters that outperforms all specialized previous models.

    Or in the words of the authors, from the paper:

    > We also show that it is unnecessary to design a special network for 3D reconstruction. Instead, VGGT is based on a fairly standard large transformer [119], with no particular 3D or other inductive biases (except for alternating between frame-wise and global attention), but trained on a large number of publicly available datasets with 3D annotations.

    Fantastic to have this. But it feels.. yes, somewhat bitter.

    [The Bitter Lesson]: http://www.incompleteideas.net/IncIdeas/BitterLesson.html (often discussed on HN)

    [data]: "Co3Dv2 [88], BlendMVS [146], DL3DV [69], MegaDepth [64], Kubric [41], WildRGB [135], ScanNet [18], HyperSim [89], Mapillary [71], Habitat [107], Replica [104], MVS-Synth [50], PointOdyssey [159], Virtual KITTI [7], Aria Synthetic Environments [82], Aria Digital Twin [82], and a synthetic dataset of artist-created assets similar to Objaverse [20]."

    [train]: "The training runs on 64 A100 GPUs over nine days", that would be around $18k on lambda labs in case you're wondering

    • SJC_Hacker 14 hours ago
      > They used a giant bunch of [data], a year and a half of GPU time to [train] the final model,

      >[train]: "The training runs on 64 A100 GPUs over nine days", that would be around $18k on lambda labs in case you're wondering

      How is that a "year and half of GPU time". Maybe on some exoplanet ?

      • dragonwriter 14 hours ago
        > > [train]: "The training runs on 64 A100 GPUs over nine days",

        > How is that a "year and half of GPU time".

        64 GPUs × 9 days = 576 GPU-days ≈ 1.577 GPU-years

        • refulgentis 12 hours ago
          Doh, that's entirely fair: haven't been in this thread yet, but would echo what I perceive as implicit puzzlement re: this amount of GPU time being described as bitter-lesson-y.
    • dleeftink 22 hours ago
      Doesn't the bitter lesson take the argument a bit too far by opposing search/learn to heuristics? Is the former not dependent on breakthroughs in the latter?
      • CooCooCaCha 20 hours ago
        The bitter lesson is the opposite. It argues that hand-crafted heuristics will eventually get beaten by more general learning algorithms that can take advantage of computing power.
        • porphyra 20 hours ago
          Indeed, even in "classical chess engines" like Stockfish which previously required handcrafted heuristics at leaf nodes, in recent years the NNUE [1] [2] has greatly outperformed it. Note that this is a completely different approach from the one that AlphaZero takes, and modern Stockfish is significantly stronger than AlphaZero.

          [1] https://stockfishchess.org/blog/2020/introducing-nnue-evalua...

          [2] https://www.chessprogramming.org/Stockfish_NNUE

        • dleeftink 19 hours ago
          > eventually get beaten

          Brute forcing is bound to find paths beyond heuristics. What I'm getting at is that the path needs to be established first before it can be beaten. Hence why I'm wondering if one isn't an extension of the other instead of an opposing strategy.

          I.e. search and heuristics both have a time and place, not so much a bitter lesson but a common filter for a next iteration to pass through.

          • CooCooCaCha 14 hours ago
            That's like saying horse drawn carriages aren't opposed to cars because they needed to be developed first.
  • Workaccount2 23 hours ago
    More info and demos:

    https://vgg-t.github.io/

    • kavalg 6 hours ago
      And license: Creative Commons Attribution, Non Commercial
    • bhouston 22 hours ago
      You are the hero! Thank you! The main post link should be updated to this.
      • soulofmischief 17 hours ago
        And then everyone will ask for the source. :)
  • mk_stjames 1 hour ago
    I'd really like to see this coupled with some SLAM techniques to essentially allow really accurate, long-range outdoor scene mapping with nothing but a cell phone.

    A small panning video of city street, right now, can generate a pretty damn accurate (for some use cases) pointcloud, but the position accuracy falls off as you try to go any large distance away from the start point, dude to the dead-reckoning drift that essentially happens here. But if you could pipe real GPS and synthesized heading (from gyros/accel/megnetometers) from the phone the images were captured on into the transformer with the images, it would instantly and greatly improve the resultant accuracy since it would now have those camera parameters 'ground truth'd'.

    I think then this technique could nearly start to rival what you need a $3-10k LIDAR camera to do right now. There are a lot of 'archival' and architecture study fields where absolute precision isn't as important as just getting 'full' scans of an area without missing patches, and speed is a factor. Walking around with a LIDAR camera can really suck compared to just a phone, and this technique would have no problem with multiple people using multiple phones to generate the input.

  • sgnelson 22 hours ago
    I really wish someone would take this and combine it with true photogrammetry to supplement the photogrammetry rather than just try to replace traditional photogrammetry.

    This type of thing would be the killer app for phone based 3d scanners. You don't have to have a perfect scan because this will fill in the holes for you.

  • davedx 23 hours ago
    I'd love to hear what the use cases are for this. I was looking at Planet's website yesterday and although the technology is fascinating, I do sometimes struggle to understand what people actually do (commercially or otherwise) with the data? (Genuinely not snark, this stuff's just not my field!)
    • stevepotter 22 hours ago
      I'm working on a system that uses affordable hardware (iPhones) to make orthopedic surgery easier and more precise. Among other things, we have to track the position in space of surgical tools like drills. Models like this can play a pivotal role in that.

      As someone mentioned, this is great for gaussian splatting, which we also do.

      • nmfisher 14 hours ago
        My brother is an orthopaedic surgeon, so I’m curious to know more. Do you have a website?
    • vessenes 22 hours ago
      This is a super useful utility — until this there was nothing fast and easy that you could dump say 30 quick camera photos of a (room/object/place) into and get out a dense point cloud.

      Instead you had to run all these separate pipelines inferring camera location, etc. etc. before you could get any sort of 3D information out of your photos. I’d guess this is going into many many workflows where it will drop in replace a bunch of jury-rigged pipelines.

    • imbusy111 23 hours ago
      Architectural visualizations is one. For example, design phase of remodelling your house would be much easier, if you had a 3D reconstruction of the current state already available.
    • the8472 18 hours ago
      The depth maps and point clouds are useful in CGI to turn a 2D image into a 3D environment which can then be incorporated into a raytracing renderer. E.g. CAD-data based foreground object placed in a generated environment.
    • Lerc 19 hours ago
      Seems like it would provide good data for training control nets for image generation.

      This would let you have any of the types if data that this model can output be used as input for controlling image generation.

    • cluckindan 22 hours ago
      Collision meshes for Gaussian splats
  • bhouston 22 hours ago
    I'm a little suspicious of many of the outdoor examples given though. They are of famous places that are likely in the training set:

    - Egyptian pyramids

    - Roman Colosseum

    These are the most iconic and most photographed things in the world.

    That said, there are other examples are there more novel. I am just going to focus on those to judge its quality.

    • ed 21 hours ago
      It’s worth trying the demo - I uploaded a low quality video of an indoor space and got decent results
    • kfarr 21 hours ago
      Use the hugging face with your own data, it's very good and outputs a glb: https://huggingface.co/spaces/facebook/vggt
  • porphyra 22 hours ago
    It is cool to see recent research doing this to reconstruct scenes from fewer images, essentially using a transformer to guess what the scene structure is. Previously, you needed a ton of images and had to use COLMAP. All the fancy papers like NERF and Gaussian Splatting used COLMAP in the backend, and while it does a great job in terms of accuracy, it is slow and requires a lot of images with known calibration.
  • jdthedisciple 22 hours ago
    Interesting idea, I applaud it.

    However I just tried it on Huggingface and the result was ... mediocre at best:

    The resulting point cloud missed about half the features from the input image.

    • porphyra 22 hours ago
      The final point cloud rendering might be due to a third party renderer rather than VGGT itself.
  • vessenes 22 hours ago
    Looking at the output, which is impressive, I want to see this pipeline applied to splats. Dense point clouds lose a bunch of color and directional information needed for high quality splats, but it seems easy to imagine this method would work well for splats. I wonder if the architecture could be fine tuned for this or if you’d need to retrain an entire model.
    • w-m 22 hours ago
      Certainly possible, read 4.6. Finetuning for Downstream Tasks in the paper, the first subsection is "Feed-forward Novel View Synthesis". They chose to report their experiments on LVSM, which is not an explicit representation like 3D Gaussian Splatting, but they're citing two feed-forward 3DGS approaches in their state of the art listing.

      Should be quite exciting going forward, as fine-tuning might be possible on consumer hardware / single Desktop machines (like it is with LLMs). So I would expect a lot of experiments coming out in this space, soon-ish. If the results hold true, it'll be pretty exciting to drop slow and cumbersome COLMAP processing and scene optimization for a single forward pass that lasts a few seconds.

  • maelito 23 hours ago
    Can it be used to build Google Earth like 3D scenes ?
  • richard___ 20 hours ago
    We need camera poses in dynamic scenes
  • ninetyninenine 21 hours ago
    I feel agi will be a patchwork of models melded together. Something like this would constitute a single model in the "perception" area.
  • fallingmeat 23 hours ago
    video or it didn't happen.
  • amelius 20 hours ago
    Please stop using keywords from electrical engineering.
    • juunpp 17 hours ago
      Someone's not grounded in reality.

      I'm ready to be grounded for this comment.