使用SciPy / Numpy在Python中连接稀疏矩阵（python稀疏矩阵计算）

25-01-27 27

本文将分享使用SciPy/Numpy在Python中连接稀疏矩阵的详细内容，并且还将对python稀疏矩阵计算进行详尽解释，此外，我们还将为大家带来关于CSV到Python中的稀疏矩阵、python–

本文将分享使用SciPy / Numpy在Python中连接稀疏矩阵的详细内容，并且还将对python稀疏矩阵计算进行详尽解释，此外，我们还将为大家带来关于CSV到Python中的稀疏矩阵、python – Scipy – 如何进一步优化随机梯度下降的稀疏矩阵码、python – Scipy稀疏 – 距离矩阵(Scikit或Scipy)、python – 在numpy / scipy中为稀疏的矩阵添加一个非常重复的矩阵？的相关知识，希望对你有所帮助。

本文目录一览：

使用SciPy / Numpy在Python中连接稀疏矩阵（python稀疏矩阵计算）
CSV到Python中的稀疏矩阵
python – Scipy – 如何进一步优化随机梯度下降的稀疏矩阵码
python – Scipy稀疏 – 距离矩阵(Scikit或Scipy)
python – 在numpy / scipy中为稀疏的矩阵添加一个非常重复的矩阵？

使用SciPy / Numpy在Python中连接稀疏矩阵（python稀疏矩阵计算）

使用SciPy / Numpy在Python中连接稀疏矩阵的最有效方法是什么？

在这里，我使用以下内容：

>>> np.hstack((X,X2))
array([ <49998x70000 sparse matrix of type '<class 'numpy.float64'>'
        with 1135520 stored elements in Compressed Sparse Row format>,<49998x70000 sparse matrix of type '<class 'numpy.int64'>'
        with 1135520 stored elements in Compressed Sparse Row format>],dtype=object)

我想在回归中使用两个预测变量，但是当前格式显然不是我想要的格式。是否有可能获得以下信息：

    <49998x1400000 sparse matrix of type '<class 'numpy.float64'>'
     with 2271040 stored elements in Compressed Sparse Row format>

它太大，无法转换为深格式。

CSV到Python中的稀疏矩阵

我有一个很大的csv文件，其中列出了图中节点之间的连接。例：

0001,95784
0001,98743
0002,00082
0002,00091

因此，这意味着节点id
0001连接到节点95784和98743，依此类推。我需要将其读入numpy中的稀疏矩阵。我怎样才能做到这一点？我是python的新手，所以有关此的教程也将有所帮助。

答案1

小编典典

使用scipy的lil_matrix（列表矩阵列表）的示例。

基于行的链表列表。
它包含一个self.rows行列表（），每个行都是一个非零元素列索引的排序列表。它还包含self.data这些元素的列表（）。

$ cat 1938894-simplified.csv0,321,211,231,322,232,532,823,824,465,757,868,28

码：

#!/usr/bin/env pythonimport csvfrom scipy import sparserows, columns = 10, 100matrix = sparse.lil_matrix( (rows, columns) )csvreader = csv.reader(open(''1938894-simplified.csv''))for line in csvreader:    row, column = map(int, line)    matrix.data[row].append(column)print matrix.data

输出：

[[32] [21, 23, 32] [23, 53, 82] [82] [46] [75] [] [86] [28] []]

python – Scipy – 如何进一步优化随机梯度下降的稀疏矩阵码

我正在使用Scipy的稀疏矩阵实现推荐系统的随机梯度下降算法.

这是第一个基本实现的样子：

N = self.model.shape[0] #no of users
    M = self.model.shape[1] #no of items
    self.p = np.random.rand(N,K)
    self.q = np.random.rand(M,K)
    rows,cols = self.model.nonzero()        
    for step in xrange(steps):
        for u,i in zip(rows,cols):
            e=self.model-np.dot(self.p,self.q.T) #calculate error for gradient
            p_temp = learning_rate * ( e[u,i] * self.q[i,:] - regularization * self.p[u,:])
            self.q[i,:]+= learning_rate * ( e[u,i] * self.p[u,:] - regularization * self.q[i,:])
            self.p[u,:] += p_temp

不幸的是,我的代码仍然很慢,即使是一个小的4×5评级矩阵.我在想这可能是由于循环的稀疏矩阵.我尝试使用花哨的索引来表达q和p的变化但是因为我仍然是scipy和numpy的新手,我无法想出更好的方法来做到这一点.

你有没有关于如何避免明确地迭代稀疏矩阵的行和列的指针？

解决方法

我差点忘了关于推荐系统的一切,所以我可能会错误地翻译你的代码,但你在每个循环中重新评估self.model-np.dot(self.p,self.qT),而我几乎确信它应该被评估一次每一步.

然后看起来你手工进行矩阵乘法,可能会加速直接矩阵多重复制(numpy或scipy会比你手动更快),类似的东西：

for step in xrange(steps):
    e = self.model - np.dot(self.p,self.q.T)
    p_temp = learning_rate * np.dot(e,self.q)
    self.q *= (1-regularization)
    self.q += learning_rate*(np.dot(e.T,self.p))
    self.p *= (1-regularization)
    self.p += p_temp

python – Scipy稀疏 – 距离矩阵(Scikit或Scipy)

我试图在scikit-learn的DictVectorizer返回的Scipy稀疏矩阵上计算最近邻居聚类.但是,当我尝试使用scikit-learn计算距离矩阵时,我通过pairwise.euclidean_distances和pairwise.pairwise_distances使用’euclidean’距离得到错误消息.我的印象是scikit-learn可以计算这些距离矩阵.

我的矩阵非常稀疏,形状为：< 364402x223209稀疏矩阵类型< class'numpy.float64'>
使用压缩稀疏行格式的728804存储元素>.

我也在Scipy中尝试了诸如pdist和kdtree之类的方法,但是还收到了其他无法处理结果的错误.

任何人都可以请我指出一个有效地允许我计算距离矩阵和/或最近邻结果的解决方案吗？

一些示例代码：

import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import pairwise
import scipy.spatial

file = 'FileLocation'
data = []
FILE = open(file,'r')
for line in FILE:
    templine = line.strip().split(',')
    data.append({'user':str(int(templine[0])),str(int(templine[1])):int(templine[2])})
FILE.close()

vec = DictVectorizer()
X = vec.fit_transform(data)

result = scipy.spatial.KDTree(X)

错误：

Traceback (most recent call last):
  File "__init__
    self.n,self.m = np.shape(self.data)
ValueError: need more than 0 values to unpack

同样,如果我跑：

scipy.spatial.distance.pdist(X,'euclidean')

我得到以下内容：

Traceback (most recent call last):
  File "distance.py",line 1169,in pdist
    [X] = _copy_arrays_if_base_present([_convert_to_double(X)])
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py",line 113,in _convert_to_double
    X = X.astype(np.double)
ValueError: setting an array element with a sequence.

最后,在scikit-learn中运行NearestNeighbor会导致内存错误,使用：

nbrs = NearestNeighbors(n_neighbors=10,algorithm='brute')

最佳答案

首先,你不能使用稀疏矩阵的KDTree和pdist,你必须将它转换为密集(你的选择是否是你的选择)：

>>> X
<2x3 sparse matrix of type 'pressed Sparse Row format>

>>> scipy.spatial.KDTree(X.todense())
dist(X.todense(),'euclidean')
array([ 6.55743852])

第二,从the docs：

Efficient brute-force neighbors searches can be very competitive for small data samples. However,as the number of samples N grows,the brute-force approach quickly becomes infeasible.

您可能想尝试’ball_tree’算法并查看它是否可以处理您的数据.

python – 在numpy / scipy中为稀疏的矩阵添加一个非常重复的矩阵？

我正在尝试在NumPy / Scipy中实现一个函数来计算单个(训练)向量和大量其他(观察)向量之间的 Jensen-Shannon divergence.观察向量存储在非常大的(500,000×65536) Scipy sparse matrix中(密集矩阵不适合存储器).

作为算法的一部分,我需要为每个观察向量Oi计算T Oi,其中T是训练向量.我无法使用NumPy的常规广播规则找到一种方法,因为稀疏矩阵似乎不支持那些(如果T保留为密集阵列,Scipy会尝试使稀疏矩阵首先密集,哪些运行内存不足;如果我将T变成稀疏矩阵,则T Oi失败,因为形状不一致).

目前,我正在采取将训练向量平铺为500,000×65536稀疏矩阵的非常低效的步骤：

training = sp.csr_matrix(training.astype(np.float32))
tindptr = np.arange(0,len(training.indices)*observations.shape[0]+1,len(training.indices),dtype=np.int32)
tindices = np.tile(training.indices,observations.shape[0])
tdata = np.tile(training.data,observations.shape[0])
mtraining = sp.csr_matrix((tdata,tindices,tindptr),shape=observations.shape)

但是当它只存储~1500个“真实”元素时,它占用了大量的内存(大约6GB).构建起来也很慢.

我试图通过使用stride_tricks使CSR矩阵的indptr变得聪明,数据成员不会在重复数据上使用额外的内存.

training = sp.csr_matrix(training)
mtraining = sp.csr_matrix(observations.shape,dtype=np.int32)
tdata = training.data
vdata = np.lib.stride_tricks.as_strided(tdata,(mtraining.shape[0],tdata.size),(0,tdata.itemsize))
indices = training.indices
vindices = np.lib.stride_tricks.as_strided(indices,indices.size),indices.itemsize))
mtraining.indptr = np.arange(0,len(indices)*mtraining.shape[0]+1,len(indices),dtype=np.int32)
mtraining.data = vdata
mtraining.indices = vindices

但是这不起作用,因为跨步视图mtraining.data和mtraining.indices是错误的形状(根据this answer,没有办法使它成为正确的形状).尝试使用.flat迭代器使它们看起来平坦失败,因为它看起来不像数组(例如它没有dtype成员),并且使用flatten()方法最终制作副本.

有没有办法完成这项工作？

解决方法

你甚至没有考虑过的另一个选择是自己以稀疏格式实现总和,这样你就可以充分利用数组的周期性.如果你滥用scipy的稀疏矩阵的这种特殊行为,这可能很容易做到：

>>> a = sps.csr_matrix([1,2,3,4])
>>> a.data
array([1,4])
>>> a.indices
array([0,1,3])
>>> a.indptr
array([0,4])

>>> b = sps.csr_matrix((np.array([1,4,5]),...                     np.array([0,0]),5])),shape=(1,4))
>>> b
<1x4 sparse matrix of type '<type 'numpy.int32'>'
    with 5 stored elements in Compressed Sparse Row format>
>>> b.todense()
matrix([[6,4]])

因此,您甚至不必在训练向量和观察矩阵的每一行之间寻找巧合来添加它们：只需用正确的指针填充所有数据,并且需要求和的东西将得到求和何时访问数据.

编辑

鉴于第一个代码的速度很慢,您可以按如下方式将内存换成速度：

def csr_add_sparse_vec(sps_mat,sps_vec) :
    """Adds a sparse vector to every row of a sparse matrix"""
    # No checks done,but both arguments should be sparse matrices in CSR
    # format,both should have the same number of columns,and the vector
    # should be a vector and have only one row.

    rows,cols = sps_mat.shape
    nnz_vec = len(sps_vec.data)
    nnz_per_row = np.diff(sps_mat.indptr)
    longest_row = np.max(nnz_per_row)

    old_data = np.zeros((rows * longest_row,),dtype=sps_mat.data.dtype)
    old_cols = np.zeros((rows * longest_row,dtype=sps_mat.indices.dtype)

    data_idx = np.arange(longest_row) < nnz_per_row[:,None]
    data_idx = data_idx.reshape(-1)
    old_data[data_idx] = sps_mat.data
    old_cols[data_idx] = sps_mat.indices
    old_data = old_data.reshape(rows,-1)
    old_cols = old_cols.reshape(rows,-1)

    new_data = np.zeros((rows,longest_row + nnz_vec,dtype=sps_mat.data.dtype)
    new_data[:,:longest_row] = old_data
    del old_data
    new_cols = np.zeros((rows,dtype=sps_mat.indices.dtype)
    new_cols[:,:longest_row] = old_cols
    del old_cols
    new_data[:,longest_row:] = sps_vec.data
    new_cols[:,longest_row:] = sps_vec.indices
    new_data = new_data.reshape(-1)
    new_cols = new_cols.reshape(-1)
    new_pointer = np.arange(0,(rows + 1) * (longest_row + nnz_vec),longest_row + nnz_vec)

    ret = sps.csr_matrix((new_data,new_cols,new_pointer),shape=sps_mat.shape)
    ret.eliminate_zeros()

    return ret

它没有以前那么快,但它可以在大约1秒内完成10,000行：

In [2]: a
Out[2]: 
<10000x65536 sparse matrix of type '<type 'numpy.float64'>'
    with 15000000 stored elements in Compressed Sparse Row format>

In [3]: b
Out[3]: 
<1x65536 sparse matrix of type '<type 'numpy.float64'>'
    with 1500 stored elements in Compressed Sparse Row format>

In [4]: csr_add_sparse_vec(a,b)
Out[4]: 
<10000x65536 sparse matrix of type '<type 'numpy.float64'>'
    with 30000000 stored elements in Compressed Sparse Row format>

In [5]: %timeit csr_add_sparse_vec(a,b)
1 loops,best of 3: 956 ms per loop

编辑此代码非常非常慢

def csr_add_sparse_vec(sps_mat,cols = sps_mat.shape

    new_data = sps_mat.data
    new_pointer = sps_mat.indptr.copy()
    new_cols = sps_mat.indices

    aux_idx = np.arange(rows + 1)

    for value,col in itertools.izip(sps_vec.data,sps_vec.indices) :
        new_data = np.insert(new_data,new_pointer[1:],[value] * rows)
        new_cols = np.insert(new_cols,[col] * rows)
        new_pointer += aux_idx

    return sps.csr_matrix((new_data,shape=sps_mat.shape)

关于使用SciPy / Numpy在Python中连接稀疏矩阵和python稀疏矩阵计算的问题我们已经讲解完毕，感谢您的阅读，如果还想了解更多关于CSV到Python中的稀疏矩阵、python – Scipy – 如何进一步优化随机梯度下降的稀疏矩阵码、python – Scipy稀疏 – 距离矩阵(Scikit或Scipy)、python – 在numpy / scipy中为稀疏的矩阵添加一个非常重复的矩阵？等相关内容，可以在本站寻找。

本文标签：