I indexed 669 GB of my GoPro videos using my M1 Max computer and local ML models
TLDR: I had 2,207 GoPro videos, and I need to rewatch them to find interesting moments from my cycling journey. I built a project to index them locally on my M1 Max using open-source ML models, search for those moments, and send the best clips straight to my DaVinci Resolve timeline. I indexed 628 videos (668.68 GB, 15h 13m 18s of footage duration), more details in the metrics table in the last section of this article.
Full article: https://iliashaddad.com/blog/i-indexed-669-gb-of-my-gopro-videos-using-my-m1-max-computer
> Then, run the frame analysis pipeline, which will divide the video into separate video scenes (1s each, or 1fps)
> (…)
> Frames analyzed 57,537
Aha, it makes total sense. This number sounds much more reasonable than “669 GB”, since the actual total size of processed frames would be like 10-30 GB.
(Not downplaying anything. Doing-at-home always requires some math on practicality)
> Total compute time 67h 40m 42s
I’m just curious tho — is there any paying options that can accelerate this kind of process? Just spin up GPU instances?
> Aha, it makes total sense. This number sounds much more reasonable than “669 GB”, since the actual total size of processed frames would be like 10-30 GB.
The reason why is “669 GB” is the total raw footage size when I'm doing the video processing, I downscaled each frame to 720p to make the video processing much faster and I don't need full original quality in order to get accurate results (as far as I know and experiment with).
> I’m just curious tho — is there any paying options that can accelerate this kind of process? Just spin up GPU instances?
For now, I found that NVIDIA GPU for example RTX 3060 with 12GB Vram was much faster than my M1 Max. (still working on optimizing for speed and accuracy).
What PAYG providers do people here recommend? Most powerful machine at home is an M1 MBA (16GB), so I too am interested in short term options where I can still benefit from the privacy of local models
Something I've enjoyed more than I expected is Google and Apple photos sending me photo memories and compilations of various things in my life and my kids lives over the last decade.
I'm really bullish on taking more video of my kids, with the thought that it will become easier and easier for AI to put them into little compilations I can enjoy later.
Take a fast, small and powerful LLM running locally to index my personal data like images, videos, documents and enrich them and tag with the enriched metadata.
Want to group by people - Search tagged metadata and group it
What to search an image by description - tagged metadata
What to organize by anything - tagged metadata
This should (hopefully) put an end to my file clutter
DaVinci 21 has indexing built-in (AI IntelliSearch). Not to diminish the work you did, but this is now available to many users (probably only Studio users since it has AI in the name)
Yes, I didn’t look at it. But does it upload your videos to the cloud or process them locally? And does it allow to provide custom faces data to help labeling faces in your videos ?
I think Adobe premiere pro have it as well but cloud processed
Well done! I couldn't understand how you are building reels out of it via the agent. Is it some sort of AI tool calling that takes image links and builds a reel via some video editing tool ? Or +/- time delta around the timestamp returned from the indexed from a given query + join them together?
Thank you! I'm using RAG, I have every video scene indexed individually in the vector database. When I'm asking the agent, it'll use an Ollama model to understand the request, use the available search tool (searching using transcription text, faces, visual, audio or combined) something like when you use Claude or Chat GPT it'll use the web search tool to find you info online. Then, I can filter out video scenes using the Ollama to better present accurate and unique video scene, then send those video results to Davinci Resolve using their API to create a video timeline using those video clips
Just because they don't refuse it doesn't mean they are useful.
I found a few pornographic pictures on the web to hand to Abliterated Gemma4 12B(literally just to test this) and it needs pushing just to accept that people can be naked.
It didn't refuse but it also didn't provide useful descriptions such as "this is a pornographic picture of a woman".
> G4: There is a person lying down in a scientific context, if I had to guess they are a biologist in a classroom
> me: Is she wearing any clothes?
> G4: No.
Also, it is obsessed with penises —seeing them in compositions where there is only a female. I suppose it's been trained to ban dick pics or something.
Prompting may help some but 12B seems to be a bit worse than E4B with the vision/audio model at voice and text reading so maybe that one would do better.
For GP's purpose, can face recognition techniques be repurposed for, um, other body parts recognition? Sometimes the actresses are facing away from camera. There are exposed lips, if that helps.
Last time I tried whisper, it hallucinated an elaborate conversation from sounds of slapping and moaning and it took minutes to spit every single line of it.
If I remember correctly, the whisper documentation actually recommends to trim non-speech portions as the models halucinate heavily during those portions.
just because it is local does not mean it wouldn't reject explicit content. you can definitely try and find abilated models and can attempt to use unsloth or something similar to tune it properly.
Is abliteration even necessary. While “playing around” I have noticed that most models are very strict only in the first prompt. The moment you get past that with a good turn, the next turn on you can get them to do _anything_.
I don't disagree with your conclusion but the comparison of max bandwidth between the two SoCs is not enough. Neither of them will use all of that bandwidth doing AI work because the GPU will be compute limited. That's why dedicated GPUs perform so significantly better without having significantly higher bandwidth.
“Comparable” is maybe true if we are talking about single core performance, but for memory bandwidth, the M1 Max is about 8 times faster. Wider bus, lower latency, not even close.
I was looking for a solution for this issue of running docker containers over MPS and utilizing their GPU power. I think this project will be the solution for it, I’ll try it very soon and add support for it. Thank you, much appreciated
Cool build but the example videos you provide at the end are . . . not what I would hope for when thinking about the highlights of 2000+ videos of biking? For example the dog barking video only has one scene repeated two or three times and it's five seconds long?
I don't have any preconceptions about specific content I want to see. I'd just think that so many hours of such cool adventures would have greater variety. It made me wonder if your AI really did such a good job of indexing it. It made me think maybe the tech isn't quite ready yet?
Did you ever visit crazyguyonabike.com? A long time ago I had the pleasure of following the journey of a friend of a friend of a friend on that site:
Sure, I'm using (https://huggingface.co/collections/Qwen/qwen25-vl) which can help me understand action like falling down, because I can provide for example 5 frames (down scaled to 720p) to understand what is happening in this part of the video
I would love your feedback and suggestions for new improvements or features you wanna have, either in the source available version, the desktop app or blog post itself?
Not really. Grab frames, lower res, classify, combine metadata, transcribe the audio, convert those data (text, visual and audio) to embedding, save them over a vector DB and SQL DB. Which helped me to do semantic search, RAG, search using a screenshot of the video to find the exact the moment in the video plus search using an audio file as well. And other features unlocked with vector DB
I agree with that, thank you for your feedback. Also, maybe you're not a video editor and you just wanna search your videos. The video editing integrations are optional and you have full control. You can switch between Adobe Premiere Pro, Final cut Pro or Davinci Resolve
> Many of the videos I captured amazing moments, and sometimes it's kind of hard to watch the full videos to get those moments.
Yep. I had the same problem.
> Then, run the frame analysis pipeline [...] I have a face recognition plugin using my custom faces data, object detection, on-screen text, shot type, and scene description [...] we will have three vector DB collections that have all the information about our videos, like video location metadata, camera name, faces recognized, objects detected, on-screen text, transcription, description of each scene, and many more [...] we can get better indexed data if you use the advanced mode indexing to use the Qwen2.5-VL-7B-Instruct model to understand and describe your video much better, but at a slower indexing speed
Yeah, uhm... ok :)
If anyone else has a similar problem, the real solution is as follows:
1. When recording, if you witness an interesting moment worth saving later, press the power button — this will mark the current moment in the video as a chapter.
2. Find the chapters later when editing and cut them into clips.
3. You're done :)
This has two main benefits over the insanity above:
1. It's trivially simple instead of insanely complex and inefficient.
2. It will reliably catch all the stuff you find interesting, since you're the one doing the marking.
The downsides:
1. Doesn't work retroactively.
2. It may miss interesting stuff if you miss it at the time as well.
3. Only works for this use case.
4. Nerds won't salivate over your usage of cutting edge tech.
She would rather have done corporate law but did not have the academic credentials or the networks needed for a job at the likes of Latham Watkins or White and Case.
Still it is good for society that criminals get the worst lawyers to defend them.
https://news.ycombinator.com/item?id=48222733 https://blog.simbastack.com/indexed-a-year-of-video-locally/
I wasn't familiar with your project though, interesting stuff.
I'm trying to add more photography related features to Framedex but yeah there's so much we can do locally, exciting times.
Good job for the article and the project. That's great, yes local models are getting better and better
Aha, it makes total sense. This number sounds much more reasonable than “669 GB”, since the actual total size of processed frames would be like 10-30 GB.
(Not downplaying anything. Doing-at-home always requires some math on practicality)
> Total compute time 67h 40m 42s
I’m just curious tho — is there any paying options that can accelerate this kind of process? Just spin up GPU instances?
The reason why is “669 GB” is the total raw footage size when I'm doing the video processing, I downscaled each frame to 720p to make the video processing much faster and I don't need full original quality in order to get accurate results (as far as I know and experiment with).
> I’m just curious tho — is there any paying options that can accelerate this kind of process? Just spin up GPU instances?
For now, I found that NVIDIA GPU for example RTX 3060 with 12GB Vram was much faster than my M1 Max. (still working on optimizing for speed and accuracy).
I'm really bullish on taking more video of my kids, with the thought that it will become easier and easier for AI to put them into little compilations I can enjoy later.
Years from now they'll be getting "hey look at BIKE BRANDS' NEWEST CHEAP BIKE REMEMBER WHEN YOU USED TO RIDE BIKE BRAND BIKES"
Take a fast, small and powerful LLM running locally to index my personal data like images, videos, documents and enrich them and tag with the enriched metadata.
Want to group by people - Search tagged metadata and group it What to search an image by description - tagged metadata What to organize by anything - tagged metadata
This should (hopefully) put an end to my file clutter
I think Adobe premiere pro have it as well but cloud processed
https://www.blackmagicdesign.com/products/davinciresolve/wha...
You might want to add something like yolo finetune to detect scenes + face recognition too.
I found a few pornographic pictures on the web to hand to Abliterated Gemma4 12B(literally just to test this) and it needs pushing just to accept that people can be naked.
It didn't refuse but it also didn't provide useful descriptions such as "this is a pornographic picture of a woman".
> G4: There is a person lying down in a scientific context, if I had to guess they are a biologist in a classroom
> me: Is she wearing any clothes?
> G4: No.
Also, it is obsessed with penises —seeing them in compositions where there is only a female. I suppose it's been trained to ban dick pics or something.
Prompting may help some but 12B seems to be a bit worse than E4B with the vision/audio model at voice and text reading so maybe that one would do better.
ref: https://www.cpubenchmark.net/compare/4585vs4245/Apple-M1-Max...
- "unified" ram makes all the system ram available as VRAM - dedicated ai coaccelerator thingy
Both of these reasons allow the apple silicon chips to crush conventional cpus in these kind of AI model workload stuffs
No idea about what the windows arm stuff is capable of. I know they use Qualcomm snapdragon chips though.
When trying to read this article, the main website was throwing errors to CloudFlare unfortunately
comes with some nifty features like NLE- integrations, people search, MCP, API etc
Disclaimer: one of the co-founders
Other comments mention davinci resolve has this built in. How would you compare the two?
For the dog barking videos, those are only the video scenes that I have a dog barking sound in the video.
I'll keep adding more prompts and example videos, keep an eye for that
Did you ever visit crazyguyonabike.com? A long time ago I had the pleasure of following the journey of a friend of a friend of a friend on that site:
https://www.crazyguyonabike.com/doc/?doc_id=2405
Stuff like that I guess?
Frame level embedding it covering a lot, but can miss out on a lot of action related searches.
Yep. I had the same problem.
> Then, run the frame analysis pipeline [...] I have a face recognition plugin using my custom faces data, object detection, on-screen text, shot type, and scene description [...] we will have three vector DB collections that have all the information about our videos, like video location metadata, camera name, faces recognized, objects detected, on-screen text, transcription, description of each scene, and many more [...] we can get better indexed data if you use the advanced mode indexing to use the Qwen2.5-VL-7B-Instruct model to understand and describe your video much better, but at a slower indexing speed
Yeah, uhm... ok :)
If anyone else has a similar problem, the real solution is as follows:
1. When recording, if you witness an interesting moment worth saving later, press the power button — this will mark the current moment in the video as a chapter.
2. Find the chapters later when editing and cut them into clips.
3. You're done :)
This has two main benefits over the insanity above:
1. It's trivially simple instead of insanely complex and inefficient.
2. It will reliably catch all the stuff you find interesting, since you're the one doing the marking.
The downsides:
1. Doesn't work retroactively.
2. It may miss interesting stuff if you miss it at the time as well.
3. Only works for this use case.
4. Nerds won't salivate over your usage of cutting edge tech.
Her client was recording while committing the abhorrent crime. The criminal would otherwise have got off.
From my perspective, the GoPro camera produced a good outcome. Still, one has wonder why anyone to record their criminal actions.
She would rather have done corporate law but did not have the academic credentials or the networks needed for a job at the likes of Latham Watkins or White and Case.
Still it is good for society that criminals get the worst lawyers to defend them.