Copyright (C) 2008 Renaissance Technologies Corp. main developer: HP Wei Copyright (C) 2006 Renaissance Technologies Corp. main developer: HP Wei Copyright (C) 2005 Renaissance Technologies Corp. Copyright (C) 2001 Renaissance Technologies Corp. main developer: HP Wei This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; see the file COPYING. If not, write to the Free Software Foundation, 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. +------------+ | CHANGE LOG | +------------+ Jul - Oct 2008 -- version 4.0 -- previously, missing page report is sent back one page at a time. I change it to ~60K pages at a time. This drastically reduces the network traffic. -- clean up scheme in choosing a monitor machine. Feb - May 2006 -- version 3.0 verision 3.0.0 major update -- large file support -- platform independence (between linux, unix) -- backup feature (as in rsync) -- removing meta-file-info We transmit the file_stat info right before the file is being synced. And the file-stat info is transfered by ascii format to avoid byte-ordering translation. The orignal meta_data is not necessay under this scheme. -- catching slow machine as the feedback monitor [ the signal_handler scheme in version 2.0 is removed. ] -- mcast options version 3.0.[1-9] bug fixes -- logic flaw which under certain condition caused premature dropout due to unsuccessful EOF, CLOSE_FILE and caused unwanranted SIT-OUT cases. -- tested on Debian 64-bit arch by Nicolas Marot in France Nicolas also translated mrsync.py to mrsync.sh (nicolas.marot@gmail.com) version 3.1.0 -- codes for IPv6 are ready (but not tested) IPv4 is tested ok. June 9, 2005 -- version 2.1 Clint Byrum at clint@careercast.com asks about the exit status of multicaster, which is changed so that it exits whenever there are bad_machines or files_sat_out. May 2005 -- Second version The major changes are: (1) implementing congestion control so that mrsync is more net-traffic-friendly. (2) using python script as the glue to put everything together. i.e. the previous mrsync.c is replaced by mrsync.py Other changes collected over the years through interactions with users include: (1) adding options to change the default IP address for multicast, and the default port number for flow control. This was suggested by Robert Dack (2) replacing memory mapped file IO with the usual seek() and write() sequence. This change was echoed by Clint Byrum (3) adding verbose control so that by default mrsync prints only essential info instead of detailed status report. This was suggested by Clint Byrum Jan 2002 -- First version uploaded to http://freshmeat.net/projects/mrsync/ The tar file is in ftp://felder.org/pub/mrsync.tar +-----------------+ | MRSYNC vs RSYNC | +-----------------+ mrsync is a package that consists of utilities for transfering a bunch of files from a master machine to multiple target machines simultaneously by using the multicasting capability in the UNIX system. The name 'mrsync' is inspired by the popular utility 'rsync' for synchronizing files between two machines. However, beyond this similarity in the functionality, mrsync is fundamentally different from rsync in three areas. (1) rsync uses TCP while mrsync needs UDP in order to use the multicasting part of UNIX's socket communication. The former limits the data commuinication to one-to-one-machine whereas the latter allows one-to-many. UDP has, however, no built in flow control. As a result, the major part of mrsync (more precisely, the multicaster and multicatcher), is devoted to synchronizing the data flow. (2) For a given file, rsync transfers optionally only those parts in the file that are different in the two versions on the master and the target machine. This saves time, esp on a slow network. mrsync, in contrast, transfers the whole content of a file to all targets at one time. The time gain of mrsync comes from its concurrency as described in (3). (3) Replicated data servers have become poplular to serve thousands of clients. Using rsync to replicate the data to hundreds of disks is very time consuming. The total time it takes is proportional to the number of target disks. In contrast, mrsync scales much better because it puts the contents of the file on the network only one time. e.g. we have used mrsync to transfer 140GB of data to 100 machines in a 1Gbit LAN in about 4-6 hours. This project has started in 2001. We have been using the tool every day. Recently, I have upgraded this tool to handle large-files, to make it platform independent, and to make it 64bit ready. The current version is 4.0.0 which is stable. It has been used by many people in the UNIX community. Lately, people in France has helped me test the code on 64bit arch. They are proposing to make it a standard package in the Debian distribution. mrsync was originally posted on Freshmeat.net. But the link there points to one of our company's disks. We would like to find a place on SourceForge.net for this package. +-------------------+ | HISTORY OF MRSYNC | + ------------------+ The project of mrsync stemmed from the necessity to transfer many files to hundreds of machines running Linux at Renaissance Technologies Corp. Looking into the Open Source Community, we found a preliminary utility codes of multicasting written by Aaron Hillegass. Many unsuccessful test-runs on a huge amount of data files, however, led us to embark on an overhaul of the code. Most of the following items were inherited and bug-fixed from the original codes. * The low level functions that interact with UNIX's multicasting sockets. * Meta_data -- the essential info about a file which the master machine will first transmit to the target machines. [ removed in 2006 (see above change log)]. * Division of a file into many 'pages'. * The idea of maintaining a missing page flag. * The idea of a multicaster and multicatcher loop. In this mrsync, we develop two new critical elements: flow-control message communication conducted by the multicaster, and a four-state page reader (processor) in the multicatcher. The former is to synchronize the task each machine is performing. For example, the master will not start sending the pages of a file unless all machines have acknowledged the completion of openning the disk i/o for the file. In order to accomodate these elements, the codes have been changed significantly from the original version. For example, the multicatcher now never asks for slowing down. And multicaster sends data on a file-by-file basis. The file integrity is achieved by orchestrating the data flow which is closely monitored and conducted by the master machine. [200505] we add congestion control into the code. After the master sends one page for a file, it will not send the next page until it receives the acknowledgement (ack) message sent by a monitor target (a feedbacker if you will). This simply-minded congestion control prevents sending pages at a pace with which a busy network cannot digest. The feedbacker is chosen at the initialization stage of multicaster. It may be changed by multicaster whenever it fails to send back ack message within a certain duration. [2006] The monitor machine will send back ack message only after it has written the page to disk. This way the 'rtt' calculated by the master machine includes the disk IO time, instead of just the network-traffic time. The master also selects the slowest target machine as the monitor machine. These two ingredients provide more precise picture of how busy the whole system is, and thus tend to leave no target machine behind. As of today, mrsync has been in full use at Renaissance on a regular basis. +-------------------+ | TEST RUNNING TIME | +-------------------+ [200810] using version 4.0.0. The performance improves (if the target machines are not busy). Number of targets = 32. All linux machines on 1Gbit network. Total number of files = 283 Pages w/o ack = 0 ( 0.00%) Total number of pages = 100129 Pages re-sent = 6852 ( 6.84%) Total number of bytes = 6449407171 Bytes re-sent = 442030792 ( 6.85%) Total time spent = 2.55 (min) ~ 0.37 (min/GB) Send pages time = 1.78 (min) ~ 0.26 (min/GB) rtt histogram msec counts ---- -------- 0 103415 1 1985 2 162 3 73 4 145 5 68 6 253 7 203 8 305 9 184 10 156 [200604] using version 3.x.x, here is an example of the final statistics in one of our routine syncing jobs. Number of targets = 32. All linux machines on 1Gbit network. total number of files = 4076 Total number of pages = 480719 Pages re-sent = 39994 ( 8.32%) Total number of bytes = 30875638130 Bytes re-sent = 2573924662 ( 8.34%) Total time spent = 39.50 (min) ~ 1.18 (min/GB) rtt histogram msec counts ---- -------- 0 371023 1 118874 2 21977 3 3779 4 2226 5 760 6 584 7 580 8 288 9 162 10 92 ... [200201] 25 minutes for a group of files whose total size amounts to 4.6GB. (This data is obtained from running on 5 SUN machines with Solaris 8 on an Ethernet LAN whose bandwidth is 100Mbits/sec.) +-------------------------------------+ | MAIN ALGORITHM FOR SENDING ONE FILE | +-------------------------------------+ In the multicaster code running on the master machine: In the initialization stage, the code selects one target machine as a monitor machine (feedbacker), which sends back acknowledgement message when one page is received. The main loop for multicaster is as follows. (1) Send OPEN_FILE command (message). (2) Wait for all machine to send back acknowledgement (ack). (3) If any machine does not ack back within a certain time period, that machine is considered bad and is dropped out of the monitoring list. During that waiting time period, the master sends the OPEN_FILE message to those potentially bad machines regularly. (4) Start sending pages, (all pages, in the first time round; those missing pages reported back by target machines, in other rounds). During this period, all target machines except the feedbacker, if activated, just receive and process whatever pages that arrive at the receive buffer. There are two modes of operation for the master to send these pages. (a) When the congestion control is turned off (-x option in mrsync.py) the master sends out one page and wait for a duration (interpage interval), which is specified as DT_PERPAGE in main.h, before it sends the next one. (b) If the congestion control is turned on (the default behavior of mrsync.py), the master will not send the next page until it receives the ack. On the feedbacker, once a SIGIO is triggered indicating the arrival of a page, it sends back ack message to the master. (5) Send EOF message to signal the end of transmission for this file. (6) Wait for all machines to send back ack. They either report ok or report a list of page numbers that are missing. (7) If any machine does not ack back within a certain time period, that machine is considered bad and is dropped out of the monitoring list. During that waiting time period, the master sends the EOF message to those potentially bad machines regularly. (8) If there are missing pages, go back to (4). (9) Send CLOSE_FILE message. (10) Wait for all machine to send back ack. (11) If any machine does not ack back within a certain time period, that machine is considered bad and is dropped out of the monitoring list. During that waiting time period, the master sends the CLOSE_FILE message to those potentially bad machines regularly. (12) If there are more files to transfer, go back to (1). In multicatcher running on the target machine: the main loop performs the following steps, (1) Start with IDLE_STATE. (2) Upon receiving OPEN_FILE and being in IDLE_STATE, set up a momory mapped temporary file whose size equals that specified in the meta_data, which has been received in prior time. if things are not successful, don't send back ack. else send back ack and enter GET_DATA_STATE. (3) Upon receiving one UDP and being in GET_DATA_STATE, store it into the right place in the temporary file. Expect to receive more UDP data and go back to (3). (4) Upon receiving EOF and being in GET_DATA_STATE, if there are some missing pages, report them and expect to go back to (3). else there are no missing page, send back ack and enter DATA_READY_STATE. if there is sick conditions, send back ack and enter SICK_STATE. (5) Upon receiving CLOSE_FILE and being in DATA_READY_STATE, if in DATA_READY_STATE, rename(temporary_file, the_real_filename), clean up the memory mapped area. if things go well, send back the ack, enter IDLE_STATE and expect to go back to (1), else don't send back ack. else if in SICK_STATE, send back sit_out ack, enter IDLE_STATE and expect to go back to (1). In addition to the above main loop that runs on all target machines, the selected monitor machine (feedbacker) needs to send back an received_ack message for every page received and processed. -------------------------------------------------------------- Note: The original version dealt with three types of 'files': directory, softlink and regular file. mrsync includes one more: hardlink file. HP Wei hp@rentec.com Renaissance Technologies Corp. Nov 15, 2001 (first version) May 20, 2005 (second version) Apr 16, 2006 (third version) Oct 28, 2008 (fourth version)