forum.alglib.net
http://forum.alglib.net/

kdtreeserialize fails on very large trees in .Net
http://forum.alglib.net/viewtopic.php?f=2&t=3780
Page 1 of 1

Author:  ddurschlag [ Fri Dec 02, 2016 3:28 pm ]
Post subject:  kdtreeserialize fails on very large trees in .Net

With this setting:

<runtime>
<gcAllowVeryLargeObjects enabled="true" />
</runtime>

.Net can allocate very large arrays/matrices. While kdtreebuildtagged will successfully build a tree for these matrices, kdtreeserialize throws when attempting to serialize them. I'm using a 3,000,000 x 300 matrix in my personal case.

Author:  Sergey.Bochkanov [ Mon Dec 05, 2016 1:18 pm ]
Post subject:  Re: kdtreeserialize fails on very large trees in .Net

I can not be 100% sure, but I suspect that enture tree ensemble is bigger than 2.000.000.000 double precision numbers. Although NET supports large objects, there still exist limit on 1D array size (no more than 2G elements).

Author:  ddurschlag [ Mon Dec 05, 2016 2:30 pm ]
Post subject:  Re: kdtreeserialize fails on very large trees in .Net

Sergey.Bochkanov wrote:
I can not be 100% sure, but I suspect that enture tree ensemble is bigger than 2.000.000.000 double precision numbers. Although NET supports large objects, there still exist limit on 1D array size (no more than 2G elements).


I don't think this is the issue -- the array allocates fine, I fill it in fine, the KD tree builds fine, and the approximate search runs fine and returns good results. The only part that fails is serialization.

Author:  Sergey.Bochkanov [ Tue Dec 06, 2016 11:46 am ]
Post subject:  Re: kdtreeserialize fails on very large trees in .Net

My current version is still that the data can fit into multidimensional xy array, but do not fit into 1D structure used to store them. Can you tell me sizes of tree.innerobj.xy and tree.innerobj.splits fields? These should be the largest arrays in the structure.

P.S. If you want, you can submit request for trial of commercial version. ALGLIB for C# with native computational core should not be prone to such 32-bit index limitations - on 64-bit systems it internally uses 64-bit indexes.

Author:  ddurschlag [ Tue Dec 06, 2016 3:44 pm ]
Post subject:  Re: kdtreeserialize fails on very large trees in .Net

Sergey.Bochkanov wrote:
My current version is still that the data can fit into multidimensional xy array, but do not fit into 1D structure used to store them. Can you tell me sizes of tree.innerobj.xy and tree.innerobj.splits fields? These should be the largest arrays in the structure.

P.S. If you want, you can submit request for trial of commercial version. ALGLIB for C# with native computational core should not be prone to such 32-bit index limitations - on 64-bit systems it internally uses 64-bit indexes.


Here is the debugger readout of my tree:

_Tree.innerobj {alglib.nearestneighbor.kdtree} alglib.nearestneighbor.kdtree
approxf 0 double
boxmax {double[300]} double[]
boxmin {double[300]} double[]
buf {double[3000000]} double[]
curboxmax {double[300]} double[]
curboxmin {double[300]} double[]
curdist 0 double
debugcounter 0 int
idx {int[3000000]} int[]
kcur 0 int
kneeded 0 int
n 3000000 int
nodes {int[36000000]} int[]
normtype 2 int
nx 300 int
ny 0 int
r {double[3000000]} double[]
rneeded 0 double
selfmatch false bool
splits {double[6000000]} double[]
tags {int[3000000]} int[]
x {double[300]} double[]
xy {double[3000000, 600]} double[,]

xy appears to be 300,000 x 600, while splits is a 6,000,000 element single-dimensional array. My employer already holds a full commercial license to alglib -- I simply hadn't acquired the library because the free downloadable version is more readily available. I'll have to give the commercial edition a try.

As a note, the maximum size of a single dimension of an array of doubles appears to be 2,146,435,071 per the documentation of gcAllowVeryLargeObjects at https://msdn.microsoft.com/en-us/library/hh285054%28v=vs.110%29.aspx?f=255&MSPPError=-2147217396

Update: Tried commercial version (alglib64_hpc.dll):

System.AccessViolationException was unhandled
HResult=-2147467261
Message=Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

The stacktrace is useless (only goes into my code for some reason), but the exception occus on line 2334 of alglib_hpc.cs:

int _error_code = _i_x_kdtreeserialize(&_error_msg, &_x, &_out);

In context:

public static unsafe void kdtreeserialize(kdtree obj, out string s_out)
{
byte *_error_msg = null;
byte *_out = null;
void *_x = obj.ptr;
int _error_code = _i_x_kdtreeserialize(&_error_msg, &_x, &_out);

So I'm not sure exactly how to debug/fix this. Potentially interesting is that the process currently has nearly 28GB of memory allocated.

Author:  Sergey.Bochkanov [ Wed Dec 07, 2016 9:49 am ]
Post subject:  Re: kdtreeserialize fails on very large trees in .Net

it gets more and more interesting :) I will perform several experiments at my side, will keep you posted on the issue.

Author:  Sergey.Bochkanov [ Thu Dec 08, 2016 2:15 pm ]
Post subject:  Re: kdtreeserialize fails on very large trees in .Net

Hi!

You were wrong, xy appears to be 3.000.000x600, not 300.000. It gives about 1.8G elements, with each of them occupying 8 bytes in memory when stored in binary format and about 12 bytes when stored as text. So, original array requires 14GB, and serialized array should take about 21GB. However, this memory is allocated in the unmanaged address space; if you want to transfer data to managed environment, you will need one more allocation of 21GB. It gives us memory requirement of ~56GB.

As for the 100% managed version, it will require "just" 35GB (14+21), but it will have to allocate array with 35G elements, which is definitely out of NET capabilities (32-bit int index).

Arrays are big, and no wonder that ALGLIB has problems with them. Nevertheless, this message about access violation looks suspicious - it should either allocate data correctly, or fail to allocate at all. It should not access incorrect memory locations. I suspect that somewhere deep in the ALGLIB we have one "incorrect" cast of 64-bit index to 32-bit int...

I will continue investigation, just wanted to keep you informed about current progress.

Author:  ddurschlag [ Thu Dec 08, 2016 2:29 pm ]
Post subject:  Re: kdtreeserialize fails on very large trees in .Net

Thanks for the update.

One of the things I'd most like to see in alglib is a move to a more stream-oriented set of data types. For example, while double[,] makes sense for genuine matrices, I get the impression that in many cases it's really just a bunch of double[] that are looped over. Using IEnumerable<double[]> for this rather than double[,] would potentially save a ton of memory. In this particular case, if the kdtree serialization method could take a StreamWriter instead of returning a string, the same benefit should be seen.

Whether this is plausible with alglib's unusual translated-pascal architecture I don't know, but it is something I've wished for repeatedly :)

Author:  Sergey.Bochkanov [ Mon Dec 12, 2016 9:38 am ]
Post subject:  Re: kdtreeserialize fails on very large trees in .Net

Hi!

I think that it is possible to implement "serialize to file" and "serialize to stream" functionality which will completely avoid allocation of additional memory. It would be good if you become our first beta-tester of this functionality :) I think that first prototype will be implemented in a few days.

As for the stream-oriented interface - you were right, it is impossible to integrate it directly into our automatically generated framework. The best thing we can do is to implement interface which silently creates temporary double[,] matrix to store the stream. Not sure whether it makes sense or not...

Author:  ddurschlag [ Mon Dec 12, 2016 1:26 pm ]
Post subject:  Re: kdtreeserialize fails on very large trees in .Net

The "silent conversion" is something I typically implement in my adapter layer on top of alglib. The problem, of course, is that this loads everything into memory. While it might make the interface look more like other C# code, the main benefit of limiting allocations is lost.

Even if I can't have streaming everything, I'd certainly be happy to beta-test streaming (de)serialization! Let me know when/what I can do.

Page 1 of 1 All times are UTC
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/