The rsync algorithm (1996) [pdf]

(andrew.cmu.edu)

148 points | by vortex_ape 19 hours ago

7 comments

ssl-3 20 minutes ago
The first time I got paid to use rsync was nearly 25 years ago. It provided for reasonably space-efficient, remote, versioned backups of a mail server, using hard links.
That mail server used maildir, which...for those who are not familiar: With maildir, each email message is a separate file on the disk. Thus, there were a lot of folders that had many thousands of files in them. Plus hardlinks for daily/weekly/whatever versions of each of those files.
At the time there were those who were very vocal about their opinion of using maildir in this kind of capacity, as it likened to abuse of the filesystem. And if that was stupid, then my use of hard links certainly multiplied that stupidity.
Perhaps I was simply not very smart at that time.
But it was actually fun to fit that together, and it was kind of amazing to watch rsync perform this job both automatically and without complaint between a pair of particularly not-fast (256kbps?) DOCSIS connections from Roadrunner.
It worked fine. Whenever I needed to go back in time for some reason, the information was reliably present at the other end with adequate granularity -- with just a couple of cron jobs, rsync, and maybe a little bit of bash script to automate it all.
doodlesdev 16 hours ago
Well-written, succinct.
This small document shows what computer science looked like to me when I was just getting started: a way to make computers more efficient and smarter, to solve real problems. I wish more people who claim to be "computer scientists" or "engineers" would actually work on real problems like this (efficient file sync) instead of having to spend time learning how to use the new React API or patching the f-up NextJS CVE that's affecting a multitude of services.
[-]
- cobertos 3 hours ago
  If only those who claim to be "managers" enabled those "engineers" to do such work, but it's not in their interest to their product, their bottom line, or their performance review. At least in their mind.
  [-]
  - UqWBcuFx6NV4r 45 minutes ago
    …what? IC developers are a huge, huge contributor to the sort of over-complicated engineering and stack churn that’s at the heart of what’s being described here. Take an iota of responsibility for yourself.
- PunchyHamster 13 hours ago
  to be fair level of security of systems back then was pretty fucking bad
  [-]
  - observationist 12 hours ago
    6 characters or fewer passwords, if there were passwords at all. Phreaking still worked into the 90s, and all sorts of really stupid things were done without really thinking about the security at all. They'd print out receipts with the entire credit or debit card number and information on it, or carbon copy the card with an impression, and you'd see these receipts blowing around parking lots, or find entire bags or dumpsters full of them. Knowing an IP address might be sufficient information to gain access to systems that should have been secured. It's pretty amazing that things functioned as well as they did, that society was as trusting and trustworthy as it was, that we were able to build as much as we did with as relatively a tiny level of exploitation that happened.
    If the same level of vulnerability was as prevalent today as it was back then, civilization might collapse overnight.
    [-]
    - gritzko 1 hour ago
      Just read AWS or CloudFlare outage postmortems and you will see: are still there, in the happy land.
    - mjevans 9 hours ago
      To be fair, back then it was relatively easy for anyone intelligent enough to be able to abuse any of that to have a well paying 'white collar' job with things like full health benefits, a pension, and more than sufficient income to support an entirely family SOLO. They even owned houses!
      When your life is set like that why risk trying to defraud someone a the cost of a nice suit when that's something that can be done legally and written off as a business expense on taxes?
  - axiolite 9 hours ago
    In 1996? OpenBSD and Apache had been around for a year. PGP had been around for several years. HTTPS was used where needed. SecurID tokens were common for organizations that cared about security.
    Admittedly SSH wasn't around, but kerberos+rlogin and SSL+telnet was available. Organizations who cared about security would have SecurID tokens issued to their employees and required for login.
    Dial-in over phone lines, and requiring a password, was much less discoverable or exploitable than services exposed to the internet, today.
    [-]
    - wmf 6 hours ago
      And every machine had 100 RCEs that you could discover with a few hours of effort.
teleforce 9 hours ago
Fun facts, the author of rsync, Andrew Tridgell, is also the one who reverse-engineered Microsoft SMB that laid the foundation for Samba [1].
How he did manage to avoid lawsuits from Microsoft is beyond me.
[1] Server Message Block:
https://en.wikipedia.org/wiki/Server_Message_Block
[-]
- webdevver 4 minutes ago
  >How he did manage to avoid lawsuits from Microsoft is beyond me.
  MS probably chose not to shut down that effort on the basis that it was enabling the MS stack in Linux.
  I wish I could dig up an internal presentation that was prepared in the 90s for Bill Gates at the time, which evaluated the threat posed by Linux to Microsoft. I think they were probably happy that Linux now had a reason to talk to Windows machines.
- js2 7 hours ago
  He also wrote a free BitKeeper client, antagonizing Larry McVoy, which is largely why we have git.
  https://blog.brachiosoft.com/en/posts/git/
- oska 6 hours ago
  Australians might like to know he worked on rsync and Samba while a PhD student at the ANU
- kvemkon 8 hours ago
  A protocol is not a software, it is needed for interoperability.
  Similar with header files. Issues arise if there is a "misuse" to derive actually not a compatible but competing solution.
craftkiller 10 hours ago
I've been using this extensively recently. I was setting up remote virtual machines that boot a live ISO containing all the software for the machine. Sometimes I need to change a small config file, which would lead to generating a new 1.7GiB ISO, but 99.9% of that ISO is identical to the previous one. So I used rsync. Blew my mind when after a day of working on these images, uploading 1.7GiB ISO after 1.7GiB ISO, wireguard showed that I had only sent 600MiBs.
Fun surprise, rsync uses file size and modified time first to see if the files are identical. I build these ISOs with nix. Nix sets the time to Jan 1st 1970 for reproducible builds, and I suspect the ISOs are padded out to the next sector. So rsync was not noticing the new ISO images when I made small changes to config files until I added the --checksum flag.
[-]
- seb1204 43 minutes ago
  In the past I downloaded daily diffs from iso which were only few MB. I then applied this diff to my iso from yesterday. Forgot the name of this tool though. I did this on my machine, if parent wants to update in a remote machine I'm not sure it works the same way.
imiric 3 hours ago
Rsync is one of my favorite programs. I use it daily. The CLI is a bit quirky (e.g. trailing slashes), but once you get used to it, it makes sense. And I really always use the same flags: `-avmLP`, with `-n` for dry runs.
One alternative I'd like to try is Google's abandoned CDC[1], which claims to be up to 30x faster than rsync in certain scenarios. Does anyone know if there is a maintained fork with full Linux support?
[1]: https://github.com/google/cdc-file-transfer
bix6 9 hours ago
Funny timing, I just used this today while setting up my NAS
snvzz 3 hours ago
Besides Tridgell's venerable rsync, there exists a permissively licensed implementation[0] by openbsd.
0. https://www.openrsync.org/