Marshal Dumps Faster, Cpickle Loads Faster
Solution 1:
cPickle
has a smarter algorithm than marshal
and is able to do tricks to reduce the space used by large objects. That means it'll be slower to decode but faster to encode as the resulting output is smaller.
marshal
is simplistic and serializes the object straight as-is without doing any further analyze it. That also answers why the marshal
loading is so inefficient, it simply has to do more work - as in reading more data from disk - to be able to do the same thing as cPickle
.
marshal
and cPickle
are really different things in the end, you can't really get both fast saving and fast loading since fast saving implies analyzing the data structures less which implies saving a lot of data to disk.
Regarding the fact that marshal
might be incompatible to other versions of Python, you should generally use cPickle
:
"This is not a general “persistence” module. For general persistence and transfer of Python objects through RPC calls, see the modules pickle and shelve. The marshal module exists mainly to support reading and writing the “pseudo-compiled” code for Python modules of .pyc files. Therefore, the Python maintainers reserve the right to modify the marshal format in backward incompatible ways should the need arise. If you’re serializing and de-serializing Python objects, use the pickle module instead – the performance is comparable, version independence is guaranteed, and pickle supports a substantially wider range of objects than marshal." (the python docs about marshal)
Solution 2:
Some people might think this too much of a hack, but I've had great success by simply wrapping the pickle dump calls with gc.disable() and gc.enable(). For example, the the snips below writing a ~50MB list of dictionaries goes from 78 seconds to 4.
# not a complete example....
gc.disable()
cPickle.dump(params,fout,cPickle.HIGHEST_PROTOCOL)
fout.close()
gc.enable()
Solution 3:
The difference between these benchmarks gives one idea for speeding up cPickle:
Input: ["This is a string of 33 characters"for _ in xrange(1000000)]
cPickle dumps 0.199 s loads 0.099 s 2002041bytes
marshal dumps 0.368 s loads 0.138 s 38000005bytes
Input: ["This is a string of 33 "+"characters"for _ in xrange(1000000)]
cPickle dumps 1.374 s loads 0.550 s 40001244bytes
marshal dumps 0.361 s loads 0.141 s 38000005bytes
In the first case, the list repeats the same string. The second list is equivalent, but each string is a separate object, because it is the result of an expression. Now, if you are originally reading your data in from an external source, you could consider some kind of string deduplication.
Solution 4:
You can make cPickle cca. 50x (!) faster by creating instance of cPickle.Pickler and then setting undocumented option 'fast' to 1:
outfile = open('outfile.pickle')
fastPickler = cPickle.Pickler(outfile, cPickle.HIGHEST_PROTOCOL)
fastPickler.fast = 1
fastPickler.dump(myHugeObject)
outfile.close()
But if your myHugeObject has cyclic references, the dump method will never end.
Solution 5:
As you can see, the output produced by cPickle.dump
has about 1/4 of the length of the output produced by marshal.dump
. This means that cPickle
must use a more complicated algorithm to dump the data as unneeded things are removed. When loading the dumped list, marshal
has to work through much more data while cPickle
can process its data quickly as there is less data that has to be analysed.
Regarding the fact that marshal
might be incompatible to other versions of Python, you should generally use cPickle
:
"This is not a general “persistence” module. For general persistence and transfer of Python objects through RPC calls, see the modules pickle and shelve. The marshal module exists mainly to support reading and writing the “pseudo-compiled” code for Python modules of .pyc files. Therefore, the Python maintainers reserve the right to modify the marshal format in backward incompatible ways should the need arise. If you’re serializing and de-serializing Python objects, use the pickle module instead – the performance is comparable, version independence is guaranteed, and pickle supports a substantially wider range of objects than marshal." (the python docs about marshal)
Post a Comment for "Marshal Dumps Faster, Cpickle Loads Faster"