I have many similar datasets that I currently use PCA to reduce dimensionality on, storing the basis vectors and variance values for each of these data sets. I would like to be able to utilize the similarity of the datasets to reduce how much data I end up storing by running the PCA algorithm on the entire dataset, generating a small number of basis vectors that would be shared for all data sets.
My thought on how to implement this is as follows:
Create a large matrix of all sample points
Run truncated subspace PCA to generate N basis vectors
No need to retain variance values; just the N basis vectors
For each smaller dataset:
Create matrix of small dataset sample points (same number of rows as original)
Project dataset onto original basis vectors, reducing the dimensionality somewhat. (How could I calculate the variance values from this?)
Run PCA algorithm on dataset to generate additional basis vectors and variance values.
Store the additional basis vectors and the full set of variance values for each smaller dataset.
Does this seem like a reasonable way to approach the problem? I think that the subset basis vectors will be orthogonal to the shared basis vectors, but this is not a requirement for my application. Thanks!