Mercurial(Hg), as well as other source control systems, was made to handle text-type files, and not any kind of binary files. We as Game Developers have found DVCSs like Hg and Git useful for working with code, but the problem of handling large files remains. In Hg we are discouraged from editing history, and every file committed even once stays in the Repository forever.

What Hg does when it gets to committing a binary file is it loads it in memory, and tries to mangle it to figure out differences from the last commit. Then it stores the differences in an efficient way. For in depth description, look here. However, most binary files are ill-suited to this, if you have some file format that uses compression then most of the file will be different even for a small change. The best example is an Open Office (which, by the way, you should never use) file format that is just a zipped up rich text. So even if you change one word in the document, the whole saved file is different. Let's say you have 10 revisions, and each revision takes 5MB. That would quickly amount to 50MB, because Hg would have to save the whole file as a completely new entry in the repo every time you commit it. That's why in our shop we avoid excess commits of binary files.

I was recently looking for ways to work around this problem. Some studios use both a DVCS for code and CVCS for assets, but this gets a bit complicated because you don't have a unique rollback point - the commits of assets and code are not in sync. Then I found this thread, where largefiles is mentioned as the solution to this.

Seems people have read the intro to the docs of these extensions, and concluded that you'd get something conceptually different and better with handling largefiles while using any of the popular extensions: bigfiles, largefiles, bfiles, snap. It's simply not true. Let's take a look what they do, and let's start with an article from the Kiln platform documentation:

Should I use LargeFiles at all? Probably not.

The benefit of LargeFiles is to save some bandwidth on push/pull/clone commands and save a few CPU cycles when making a commit. This comes at a cost though! LargeFiles is considered a "last resort" extension for Mercurial because it breaks the D in DVCS ("Distributed Version Control System").

There is a very specific problem that LargeFiles is designed to solve: in the case where a repository has binary files which change frequently, the changesets can get very large because binary files are not very easy to diff or to compress. LargeFiles basically solves this problem by allowing certain files to be treated in the same way that Source Control Systems like Subversion treated them: as snapshots of the file at certain times. Then when you do a pull, clone, or update command, you only get the snapshots of the LargeFiles that correspond with the relevant revision. This means you do not have a full copy of the repository on your local machine. You rely on the central repository (on Kiln) to store all the revisions of the LargeFiles and only pull them over the network when you update to a revision that needs them. This saves bandwidth and some CPU cycles.

Okay, so once we have started, the most important thing to know is: This does not in any way change how Hg handles files in memory. For anything larger then 10MB commits will be long (I'm talking even about commits and pushes to a remote in the same hard drive). Hg has to take the file and consume many times more memory during the commit then the size of the file, to try to figure out what the differences are. In fact, SourceTree will warn you that the file is larger than 10MB before the commit, as will Hg if you use the command line. Thus, there is a limit to the file size that you can handle in Mercurial. If you're having a huge cinematic as a video file there may be problems: with the file too large the commit will simply fail, and you will have to exclude the Giant file and distribute it in another way. What are the limits? I did some testing on my fast machine (with "only" 8GB RAM) Hg would refuse to obey somewhere between 500MB and 1GB. I could commit files 250+MB large, but it does take a while. So, committing a 5-10MB large texture asset is not an issue, it's just a bit slower.

The next problem is everyone collaborating on the project would have to take a huge Pull with the new large files, for every version of the large file they don't yet have, which could take a while. This is where (and only where) our extensions from above come to rescue. They have you pull only the last commit for the large files, and would need to "secretly download" files if you wanted to see a revision in between. However, if you want to go back to a revision you haven't pulled yet and the Server is not up you're out of luck. That means you should get all the commits at some point anyway (because you want all the code versions at your side), so what's the point?

But in fact, this is not a problem in general practice. If we're talking about a large 20MB asset file, everyone pulls it once and has it forever. The only problem would be if the asset file would be changing quicker than your rate of Pushing, then you would have to pull every revision of the large file to your machine (taking up drive space and network bandwidth). But if we work under the presumption that these files will not be changed more than once/twice - e.g have 2 revisions after the first one: you'd not gain much by possibly not downloading the revision #2. This is reasonable for 3D assets - you commit them once after you have done this process: 1) Have imported them from your 3D tool. 2) Eyeballed them in Unity. 3) Re-worked any problems you've seen. 4) When you're happy, you commit for the rest of the team to pull and see. Then, if there are any voices that want the asset reworked (Practical question: how many bosses can you have in an indie studio, so that they can return your asset for rework many times?) then you'd rework it at most once or twice. In the meantime, the coding team has committed a bunch of times anyway, and in order to push their changes in between your changes, they'd have to pull, so as long as most people commit faster than the problematic asset is you'd actually not gain anything by using any of the extensions above. The problem that we're having is that we have too many binary files, and not necessarily too many extra copies of binary files.

However, users can manually clean up the local store in order to get rid of all the versions of the files, which provides the most recent one has to be re-download again.

Here is how you'd work with repos that have grown with time. Simply, you'd 'migrate' the repo, and archive the old one - like starting from scratch. You'd not have history going all the way back, but you'd get rid of the amount of revisions that have accumulated by putting them in the backup repo.

So, what would you lose by using the extensions above? Well, they handle files by placing them outside the repo, and storing only the hash of the file in the repo itself (all bigfile hashes inside one file). This has an unfortunate effect that you won't be able to tell which exact bigfile/largefile has actually been modified when looking in the history - the only thing you'd see in the repo is the cumulative file that holds the hashes of all bigfiles as having a change. For this you'd have to use a command line, something like hg bstat (applicable to bigfiles extension).

Second thing you'd most definitely lose is DVCS design as mentioned in the article above. This simply means that not everyone has a complete repo on their own machine even if Hg 'thinks' they do. If you did benefit from largefiles (which, as we have seen already, is tough for game development), when you got disconnected from the repo (eg. taking a work laptop home) returning any commit that has the problematic asset would fail.

Lastly, you would not be able to use bitbucket, which is a big deal for small projects: because the remote has to have the extension enabled too. Presumably, they will do this like GitHub, starting only with paid plans.

All the extensions work around the same principle, and as far as I know none in the DVCS world has yet found a way to solve the problem systemically.

Update 10.03.2017: I am confident this problem can be completely worked around with some semi-complicated pre and post hooks on the server and clients, and another SVN repository on the same server where Hg is hosted. In an intranet-type environment this could provide disk space savings as well as workflow improvements, provided users clean cache automatically.

Also, Facebook seems to be working on their own implementation of LFS for Hg that is similar to Git's LFS and will use the same protocol.


If you're finding this article helpful, consider our asset Dialogical on the Unity Asset store for your game dialogues.