In this post I am sharing a code kata we did recently at Sinbadsoft: sorting large files with limited memory. By large files I mean files that would not fit in the available amount of RAM. A regular sort algorithm will not perform well in this case as the whole data cannot be loaded at once in memory. The solution of this classic problem is external sorting. It consists of two steps: first, split the file into small chunks that would fit in memory, load each chunk, sort it, and write it back on disk. Second, perform a k-way merge on all the sorted chunks to get the final result.

In order to simplify the problem, we limited ourselves to the following conditions:

  • We only deal with text files as input. We sort file lines using string comparison.
  • Lines have a “reasonable” maximum length and consistent line endings.

We will be using C# here but it wouldn’t be much different in Java.

Split the big file in small sorted chunks

The first step is quite easy: we open a stream reader on the big file and keep reading until we have enough data in our buffer to fill a file chunk. When the buffer is full, we sort the data in memory and write it on disk on a temp file. We continue until the whole input file is processed. As a result, we have K chunk files on disk. The SplitInSortedChunks method below implements this algorithm. It returns the chunk file paths collection as a result.

IEnumerable<string> SplitInSortedChunks(string filepath, long chunkSize)
{
    var buffer = new List<string>();
    var size = 0L;

    using (var reader = new StreamReader(filepath))
    for (string line = reader.ReadLine(); line != null; line = reader.ReadLine())
    {
        if (size + line.Length + 2 >= chunkSize)
        {
            size = 0L;
            yield return FlushBuffer(buffer);
        }

        size += line.Length + 2;
        buffer.Add(line);
    }

    if (buffer.Any())
    {
        yield return FlushBuffer(buffer);
    }
}

The magic number 2 used to increment the current size is the supposedĀ length of the line ending used in the input file. It’s probably more reliable here to useĀ Environment.NewLine.Length or have this value as parameter.

The FlushBuffer method implementation is quite simple:

string FlushBuffer(List<string> buffer)
{
    buffer.Sort(StringComparer.Ordinal);
    var chunkFilePath = Path.GetTempFileName();
    File.WriteAllLines(chunkFilePath, buffer);
    buffer.Clear();
    return chunkFilePath;
}

K-way merge

Now that we have K small files filled with sorted data, we are going to merge them in order to build our final result file. To perform the merge, we need first to open K file streams, one for each chunk file. Then, we have to read one line per file and take the smallest line each time. We repeat this operation until we finish reading all the chunks. In order to avoid doing a lot of string comparisons at each iteration, we keep the current line values in an ordered structure.

The algorithm is quite straightforward. Still we made an interesting mistake in our first implementation that I am going to share with you here.

A buggy implementation

Our K-way merge function takes the chunk file paths and the desired result file path as parameters. In order to save space on disk, we can use the input file path as output path; but this is not mandatory.

private static void KwayMerge(IEnumerable<string> chunkFilePaths, string resultFilePath)
{
    var chunkReaders = chunkFilePaths
        .Select(path => new StreamReader(path))
        .Where(chunkReader => !chunkReader.EndOfStream)
        .ToList();

    // BUG: SortedDictionary won't cut it here
    var sortedDict = new SortedDictionary<string, TextReader>();
    chunkReaders.ForEach(chunkReader => sortedDict.Add(chunkReader.ReadLine(), chunkReader));

    using (var resultWriter = new StreamWriter(resultFilePath, false))       
    while (sortedDict.Any())
    {
        var line = sortedDict.Keys.First();
        var chunkReader = sortedDict[line];
        sortedDict.Remove(line);

        resultWriter.WriteLine(line);

        var nextLine = chunkReader.ReadLine();
        if (nextLine != null)
        {
            sortedDict.Add(nextLine, chunkReader);
        }
        else
        {
            chunkReader.Dispose();
        }
    }
}

This implementation worked fine until we tested it with files with duplicate lines. It crashed complaining about duplicate keys in the SortedDictionary. Ouch! Notice that we would’ve had the same problem with SortedList or SortedSet. We need a sorted multiset structure here to fix the bug.

A better implementation

In order to fix the problem we have with duplicate lines, we are going to use a priority queue (based on a max heap) as a sorted multiset. We have already discussed the implementation of priority queue and heap data structures in a previous post (the implementation is also on github and available as nuget).

void KwayMerge(IEnumerable<string> chunkFilePaths, string resultFilePath)
{
    var chunkReaders = chunkFilePaths
        .Select(path => new StreamReader(path))
        .Where(chunkReader => !chunkReader.EndOfStream)
        .ToList();

    var queue = new PriorityQueue<string,TextReader>((x,y) => -string.CompareOrdinal(x,y));
    chunkReaders.ForEach(chunkReader => queue.Enqueue(chunkReader.ReadLine(), chunkReader));
    
    using (var resultWriter = new StreamWriter(resultFilePath, false))
    while (queue.Count > 0)
    {
        var smallest = queue.Dequeue();
        var line = smallest.Key;
        var chunkReader = smallest.Value;

        resultWriter.WriteLine(line);

        var nextLine = chunkReader.ReadLine();
        if (nextLine != null)
        {
            queue.Enqueue(nextLine, chunkReader);
        }
        else
        {
            chunkReader.Dispose();
        }
    }
}

Conclusion

Now that we have our functions to split the input file in chunks and to do the k-way merge, we simply need to chain them in order to sort.

void SortFile(string filepath, string resultFilePath, long chunkSize)
{
    var chunkFilePaths = SplitInSortedChunks(filepath, chunkSize);
    KwayMerge(chunkFilePaths, resultFilePath);
}

That’s it for this code kata! If you have comments or suggestions on how to improve this algorithm don’t hesitate to comment below.