[Mcl-users] large graphs and mcl
Stijn van Dongen
stijn at ebi.ac.uk
Thu Jul 8 11:21:48 BST 2010
Recent experiences in Ensembl Compara indicate that the latest version of MCL
(10-148) has improved with respect to memory management.
Ensembl family graphs are now around 2.5M nodes and 500M edges.
MCL exits with an out-of-memory error, while memory does not
seem exhausted. Giving MCL more memory (with LSF) does not help.
As a result of changes made to mcl-10-148 mcl now completes for such a graph
in about 4 hours using 4 CPUs. We are not yet clear about memory usage; it
seems somewhere inbetween 20 and 30G. The input graph size is about 8G.
Of course, time usage depends on the hardware used.
MCL uses a memory model where malloc, free and realloc are used many times,
and many C objects come into existence and disappear during an MCL run.
This should all be fine. MCL is tested with valgrind a lot. It releases
all memory that it should, has no leaks, and no errors. This does not
preclude all MCL-side coding errors as valgrind is necessarily specific
to code-paths taken and data put in, but it does give confidence.
It is not possible to run the failing cases under valgrind, as the size
of the data and the time/space overhead incurred by valgrind preclude this.
An MCL-specific pattern is that large stretches of memory are first
malloced, than later realloced to a much smaller chunk. Say from 100K to
1K, happening more than 500K times.
The following change was made in 10-148. If the realloc target amount
of memory is less than half of the originally malloced (or realloced)
amount, it now frees the large chunk and mallocs an entirely new chunk. All
this is done with wrappers around C malloc and realloc of course.
It is unclear why this change helps. Code paths should not change as far as
I am aware, and the data we input is the same. A possibility is that the
malloc implementation on our system has issues. It is extremely unusual for
system libaries to have bugs in them however, as they are widely used and
bugs tend to be noticed. Conceivably the large-memory characteristics of
these computations combined with MCL's rapid churn of memory chunks make
this a corner of compute-space not often visited.
Debugging this is hard because of the size of data involved. The current
status is that a single patch (wrapping realloc as described above, only
when invoked for one specific user-defined type, a vector) solves the
issue without changing code-paths, so for now I am leaving it at this.
If you have very large graphs, better use mcl-10-148.
Stijn van Dongen >8< -o) O< forename pronunciation: [Stan]
EMBL-EBI /\\ Tel: +44-(0)1223-492675
Hinxton, Cambridge, CB10 1SD, UK _\_/ http://micans.org/stijn
More information about the Mcl-users