I have an hourly cron job that copies about 40GB of data from a source folder into a new folder with the hour appended on the end.
When it’s done, the job prunes anything older than 24 hours. This data changes very often during work hours and is on a samba file share. Here’s how the folder structure looks:
The contents of each new folder compared to the last one usually doesn’t change very much, since this is a hourly job.
Now you might be thinking that I’m an idiot for setting dreaming this up. Truth is, I just found out. It’s actually been used for years and is so incredibly simple, anyone could delete the ENTIRE 40GB share (imagine that dialog spooling up… deleting thousands and thousands of files) and it would actually be faster to restore by moving the latest copy back to the source than it took to delete.
Now to top this off, I need to efficiently replicate this 960GB of “mostly similar” data to a remote server over WAN link, with the replication happening as close to real-time as possible — think hot spare, disaster recovery, etc.
My first thought was rsync.
Rsync sees it sees a deletion of the folder that is 24 hours old and the addition of a new folder with 30GB of data to sync! I also looked at rdiff-backup and unison, they both appear to use similar algorithms and do not keep enough meta-data to do this intelligently.
Best thing that I can find “out of the box” to do this is Windows Server “Distributed Filesystem Replication” which uses “Remote Differential Compression” — After reading the background information on how this works, it actually looks like exactly what I need.
Problem: Both servers are running Linux. D’oh! One approach to this I’m looking at is this, say it’s 5AM and the cron job finishes:
- New Version.5 folder arrives at on local server
- SSH to remote server and copy Version.4 to Version.5
- Run rsync on the local server pushing changes to the remote server. Rsync finally knows to do a differential copy between Version.4 and Version.5
Is there a smarter way to replicate Samba shares as close to real-time as possible?
Anything out there that does “Remote Differential Compression” on Linux?
You should seriously consider DRBD. DRBD is a RAID1 over TCP-IP software. It will replicate a block device over any link in real time. Every time you modify a HD block, it will replicate it.
It is FS-agnostic therefore you can put any FS you want on top of it. It will also work hand-in-hand with heartbeat which will allow you to start the hot spare as soon as the original node dies.
- iSCSI real-time replication
- Near real-time replication of WordPress blogs
- what is the best way to copy and replicate data
- What are the options for synchronizing files between Linux servers in real time without an intermediary or remote share?
- What program to use for incremental backup of large single file