Sunday, October 18, 2009

Git filter-branch with GitSharp

I'm currently working on migrating the Castle project Subversion repository to git. There are already a couple of svn mirrors on github, but this migration is intended to eventually replace svn as the official repository. With over 6000 commits in 5 years of history, it's not a trivial migration.

One of the issues is that all subprojects are currently being split from the main trunk to make them more independent. I'll leave that for another post.

Another issue is the committer mapping. Each svn username needs to be mapped to a github account (name + email). Roelof Blom kindly provided this map, so I was set to import with git-svn. Two days later git-svn finished and I pushed the repository to github.
Much to my dismay, I found that some committers on my repository weren't matching their github accounts. The Ken Egozi on my repo wasn't the same Ken Egozi on github!

I had two options: either fix the user mappings and re-run git-svn or change the committers with git filter-branch. I went with filter-branch as described in the Pro Git book. It was processing about 1 commit/s so I left it working and went to sleep.

The next morning I went to see and it had segfaulted halfway through. Now that did it. I updated my GitSharp fork and wrote this kind of specific filter-branch:

internal class Program {
    private static void Main(string[] args) {
        var committerMap = new Dictionary<string, string> {
            {"Ken Egozi", "a@b.com"},
            {"Krzysztof Ko┼║mic", "c@d.com"},
        };
        var repo = Repository.Open(args[0]);
        var refs = repo.getAllRefs()
            .Where(x => x.Key.StartsWith("refs/heads") || x.Key.StartsWith("refs/tags"))
            .ToDictionary(x => x.Key, x => x.Value);

        var commitMap = new Dictionary<string, string>();

        foreach (var r in refs) {
            Console.WriteLine("Processing ref {0}", r.Key);
            var startCommit = repo.MapCommit(r.Value.ObjectId);
            var newHead = commitMap.ContainsKey(r.Value.ObjectId.Name) ? 
                repo.MapCommit(commitMap[r.Value.ObjectId.Name]) : 
                Rewrite(repo, startCommit, commitMap, committerMap);
            var newRef = repo.UpdateRef(r.Value.Name);
            newRef.NewObjectId = newHead.CommitId;
            newRef.IsForceUpdate = true;
            newRef.Update();
        }
    }

    public static Commit Rewrite(Repository repo, Commit startCommit, Dictionary<string, string> commitMap, Dictionary<string, string> committerMap) {
        Commit lastCommit = null;
        var walker = new RevWalk(repo);
        walker.sort(RevSort.Strategy.REVERSE);
        walker.markStart(walker.parseCommit(startCommit.CommitId));
        foreach (var rcommit in walker.iterator().AsEnumerable()) {
            var commit = rcommit.AsCommit(walker);
            if (commitMap.ContainsKey(commit.CommitId.Name)) {
                lastCommit = repo.MapCommit(commitMap[commit.CommitId.Name]);
                Console.WriteLine("{0} already visited, skipping", commit.CommitId.Name);
                continue;
            }
            if (committerMap.ContainsKey(commit.Author.Name))
                commit.Author = new PersonIdent(commit.Author.Name, committerMap[commit.Author.Name], commit.Author.When, commit.Author.TimeZoneOffset);
            var newCommit = new Commit(repo) {
                TreeId = commit.TreeId,
                Author = commit.Author,
                Committer = commit.Author,
                Message = commit.Message,
                ParentIds = commit.ParentIds.Select(x => repo.MapCommit(commitMap[x.Name]).CommitId).ToArray(),
            };
            newCommit.Save();
            commitMap[commit.CommitId.Name] = newCommit.CommitId.Name;
            lastCommit = newCommit;
        }
        return lastCommit;
    }
}

It took this little code 3 minutes to rewrite the whole repository. That's 34 commits/s !

When time allows, I'll try and clean this up, then merge it into the new GitSharp.CLI project. Instead of using a bash script to define transformations like the original filter-branch, this could use an embedded Boo or IronPython script!

DISCLAIMER: The code shown here is completely throwaway quality. It does not intend to be a reference GitSharp app or anything like that. It does not intend to be a general filter-branch replacement. It works on my machine, etc. Do not run this code on your repositories unless you know what you're doing. It will rewrite your whole repository! You have been warned.

2 comments:

henon said...

hey mauricio,
very nice. how did you get rid of the old commit objects?

mausch said...

I just ran git gc --prune=now