forum.alglib.net

ALGLIB forum
It is currently Thu Mar 28, 2024 2:07 pm

All times are UTC


Forum rules


1. This forum can be used for discussion of both ALGLIB-related and general numerical analysis questions
2. This forum is English-only - postings in other languages will be removed.



Post new topic Reply to topic  [ 10 posts ] 
Author Message
 Post subject: kdtreeserialize fails on very large trees in .Net
PostPosted: Fri Dec 02, 2016 3:28 pm 
Offline

Joined: Fri Dec 02, 2016 3:18 pm
Posts: 6
With this setting:

<runtime>
<gcAllowVeryLargeObjects enabled="true" />
</runtime>

.Net can allocate very large arrays/matrices. While kdtreebuildtagged will successfully build a tree for these matrices, kdtreeserialize throws when attempting to serialize them. I'm using a 3,000,000 x 300 matrix in my personal case.


Top
 Profile  
 
 Post subject: Re: kdtreeserialize fails on very large trees in .Net
PostPosted: Mon Dec 05, 2016 1:18 pm 
Offline
Site Admin

Joined: Fri May 07, 2010 7:06 am
Posts: 903
I can not be 100% sure, but I suspect that enture tree ensemble is bigger than 2.000.000.000 double precision numbers. Although NET supports large objects, there still exist limit on 1D array size (no more than 2G elements).


Top
 Profile  
 
 Post subject: Re: kdtreeserialize fails on very large trees in .Net
PostPosted: Mon Dec 05, 2016 2:30 pm 
Offline

Joined: Fri Dec 02, 2016 3:18 pm
Posts: 6
Sergey.Bochkanov wrote:
I can not be 100% sure, but I suspect that enture tree ensemble is bigger than 2.000.000.000 double precision numbers. Although NET supports large objects, there still exist limit on 1D array size (no more than 2G elements).


I don't think this is the issue -- the array allocates fine, I fill it in fine, the KD tree builds fine, and the approximate search runs fine and returns good results. The only part that fails is serialization.


Top
 Profile  
 
 Post subject: Re: kdtreeserialize fails on very large trees in .Net
PostPosted: Tue Dec 06, 2016 11:46 am 
Offline
Site Admin

Joined: Fri May 07, 2010 7:06 am
Posts: 903
My current version is still that the data can fit into multidimensional xy array, but do not fit into 1D structure used to store them. Can you tell me sizes of tree.innerobj.xy and tree.innerobj.splits fields? These should be the largest arrays in the structure.

P.S. If you want, you can submit request for trial of commercial version. ALGLIB for C# with native computational core should not be prone to such 32-bit index limitations - on 64-bit systems it internally uses 64-bit indexes.


Top
 Profile  
 
 Post subject: Re: kdtreeserialize fails on very large trees in .Net
PostPosted: Tue Dec 06, 2016 3:44 pm 
Offline

Joined: Fri Dec 02, 2016 3:18 pm
Posts: 6
Sergey.Bochkanov wrote:
My current version is still that the data can fit into multidimensional xy array, but do not fit into 1D structure used to store them. Can you tell me sizes of tree.innerobj.xy and tree.innerobj.splits fields? These should be the largest arrays in the structure.

P.S. If you want, you can submit request for trial of commercial version. ALGLIB for C# with native computational core should not be prone to such 32-bit index limitations - on 64-bit systems it internally uses 64-bit indexes.


Here is the debugger readout of my tree:

_Tree.innerobj {alglib.nearestneighbor.kdtree} alglib.nearestneighbor.kdtree
approxf 0 double
boxmax {double[300]} double[]
boxmin {double[300]} double[]
buf {double[3000000]} double[]
curboxmax {double[300]} double[]
curboxmin {double[300]} double[]
curdist 0 double
debugcounter 0 int
idx {int[3000000]} int[]
kcur 0 int
kneeded 0 int
n 3000000 int
nodes {int[36000000]} int[]
normtype 2 int
nx 300 int
ny 0 int
r {double[3000000]} double[]
rneeded 0 double
selfmatch false bool
splits {double[6000000]} double[]
tags {int[3000000]} int[]
x {double[300]} double[]
xy {double[3000000, 600]} double[,]

xy appears to be 300,000 x 600, while splits is a 6,000,000 element single-dimensional array. My employer already holds a full commercial license to alglib -- I simply hadn't acquired the library because the free downloadable version is more readily available. I'll have to give the commercial edition a try.

As a note, the maximum size of a single dimension of an array of doubles appears to be 2,146,435,071 per the documentation of gcAllowVeryLargeObjects at https://msdn.microsoft.com/en-us/library/hh285054%28v=vs.110%29.aspx?f=255&MSPPError=-2147217396

Update: Tried commercial version (alglib64_hpc.dll):

System.AccessViolationException was unhandled
HResult=-2147467261
Message=Attempted to read or write protected memory. This is often an indication that other memory is corrupt.

The stacktrace is useless (only goes into my code for some reason), but the exception occus on line 2334 of alglib_hpc.cs:

int _error_code = _i_x_kdtreeserialize(&_error_msg, &_x, &_out);

In context:

public static unsafe void kdtreeserialize(kdtree obj, out string s_out)
{
byte *_error_msg = null;
byte *_out = null;
void *_x = obj.ptr;
int _error_code = _i_x_kdtreeserialize(&_error_msg, &_x, &_out);

So I'm not sure exactly how to debug/fix this. Potentially interesting is that the process currently has nearly 28GB of memory allocated.


Top
 Profile  
 
 Post subject: Re: kdtreeserialize fails on very large trees in .Net
PostPosted: Wed Dec 07, 2016 9:49 am 
Offline
Site Admin

Joined: Fri May 07, 2010 7:06 am
Posts: 903
it gets more and more interesting :) I will perform several experiments at my side, will keep you posted on the issue.


Top
 Profile  
 
 Post subject: Re: kdtreeserialize fails on very large trees in .Net
PostPosted: Thu Dec 08, 2016 2:15 pm 
Offline
Site Admin

Joined: Fri May 07, 2010 7:06 am
Posts: 903
Hi!

You were wrong, xy appears to be 3.000.000x600, not 300.000. It gives about 1.8G elements, with each of them occupying 8 bytes in memory when stored in binary format and about 12 bytes when stored as text. So, original array requires 14GB, and serialized array should take about 21GB. However, this memory is allocated in the unmanaged address space; if you want to transfer data to managed environment, you will need one more allocation of 21GB. It gives us memory requirement of ~56GB.

As for the 100% managed version, it will require "just" 35GB (14+21), but it will have to allocate array with 35G elements, which is definitely out of NET capabilities (32-bit int index).

Arrays are big, and no wonder that ALGLIB has problems with them. Nevertheless, this message about access violation looks suspicious - it should either allocate data correctly, or fail to allocate at all. It should not access incorrect memory locations. I suspect that somewhere deep in the ALGLIB we have one "incorrect" cast of 64-bit index to 32-bit int...

I will continue investigation, just wanted to keep you informed about current progress.


Top
 Profile  
 
 Post subject: Re: kdtreeserialize fails on very large trees in .Net
PostPosted: Thu Dec 08, 2016 2:29 pm 
Offline

Joined: Fri Dec 02, 2016 3:18 pm
Posts: 6
Thanks for the update.

One of the things I'd most like to see in alglib is a move to a more stream-oriented set of data types. For example, while double[,] makes sense for genuine matrices, I get the impression that in many cases it's really just a bunch of double[] that are looped over. Using IEnumerable<double[]> for this rather than double[,] would potentially save a ton of memory. In this particular case, if the kdtree serialization method could take a StreamWriter instead of returning a string, the same benefit should be seen.

Whether this is plausible with alglib's unusual translated-pascal architecture I don't know, but it is something I've wished for repeatedly :)


Top
 Profile  
 
 Post subject: Re: kdtreeserialize fails on very large trees in .Net
PostPosted: Mon Dec 12, 2016 9:38 am 
Offline
Site Admin

Joined: Fri May 07, 2010 7:06 am
Posts: 903
Hi!

I think that it is possible to implement "serialize to file" and "serialize to stream" functionality which will completely avoid allocation of additional memory. It would be good if you become our first beta-tester of this functionality :) I think that first prototype will be implemented in a few days.

As for the stream-oriented interface - you were right, it is impossible to integrate it directly into our automatically generated framework. The best thing we can do is to implement interface which silently creates temporary double[,] matrix to store the stream. Not sure whether it makes sense or not...


Top
 Profile  
 
 Post subject: Re: kdtreeserialize fails on very large trees in .Net
PostPosted: Mon Dec 12, 2016 1:26 pm 
Offline

Joined: Fri Dec 02, 2016 3:18 pm
Posts: 6
The "silent conversion" is something I typically implement in my adapter layer on top of alglib. The problem, of course, is that this loads everything into memory. While it might make the interface look more like other C# code, the main benefit of limiting allocations is lost.

Even if I can't have streaming everything, I'd certainly be happy to beta-test streaming (de)serialization! Let me know when/what I can do.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 10 posts ] 

All times are UTC


Who is online

Users browsing this forum: No registered users and 55 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group