以numpy或pandas处理巨大数字（numpy 将大于某个值设置为0）

25-02-25 10

这篇文章主要围绕以numpy或pandas处理巨大数字和numpy将大于某个值设置为0展开，旨在为您提供一份详细的参考资料。我们将全面介绍以numpy或pandas处理巨大数字的优缺点，解答numpy

这篇文章主要围绕以numpy或pandas处理巨大数字和numpy 将大于某个值设置为0展开，旨在为您提供一份详细的参考资料。我们将全面介绍以numpy或pandas处理巨大数字的优缺点，解答numpy 将大于某个值设置为0的相关问题，同时也会为您带来16-numpy笔记-莫烦pandas-4、Numpy Pandas、numpy – 在pandas 0.10.1上使用pandas.read_csv指定dtype float32、numpy.random.random & numpy.ndarray.astype & numpy.arange的实用方法。

本文目录一览：

以numpy或pandas处理巨大数字（numpy 将大于某个值设置为0）
16-numpy笔记-莫烦pandas-4
Numpy Pandas
numpy – 在pandas 0.10.1上使用pandas.read_csv指定dtype float32
numpy.random.random & numpy.ndarray.astype & numpy.arange

以numpy或pandas处理巨大数字（numpy 将大于某个值设置为0）

我正在参加比赛，向我提供匿名数据。相当多的列具有HUGE值。最大的是40位数字！我曾经使用过，pd.read_csv但是这些列已被转换为对象。

我最初的计划是按比例缩小数据，但是由于它们被视为对象，因此我无法对此进行算术运算。

有没有人对如何处理Pandas或Numpy中的大量数字提出建议？

请注意，我尝试将值转换为auint64时没有运气。我收到错误消息：“长度太大，无法转换”

答案1

小编典典

您可以在导入字符串时使用Pandas转换器调用int或在字符串上使用其他自定义转换器函数：

import pandas as pd from StringIO import StringIOtxt=''''''\line,Big_Num,text1,1234567890123456789012345678901234567890,"That sure is a big number"2,9999999999999999999999999999999999999999,"That is an even BIGGER number"3,1,"Tiny"4,-9999999999999999999999999999999999999999,"Really negative"''''''df=pd.read_csv(StringIO(txt), converters={''Big_Num'':int})print df

印刷品：

   line                                    Big_Num                           text0     1   1234567890123456789012345678901234567890      That sure is a big number1     2   9999999999999999999999999999999999999999  That is an even BIGGER number2     3                                          1                           Tiny3     4  -9999999999999999999999999999999999999999                Really negative

现在测试算术：

n=df["Big_Num"][1]print n,n+1

印刷品：

9999999999999999999999999999999999999999 10000000000000000000000000000000000000000

如果该列中有任何可能导致int崩溃的值，则可以执行以下操作：

txt=''''''\line,Big_Num,text1,1234567890123456789012345678901234567890,"That sure is a big number"2,9999999999999999999999999999999999999999,"That is an even BIGGER number"3,0.000000000000000001,"Tiny"4,"a string","Use 0 for strings"''''''def conv(s):    try:        return int(s)    except ValueError:        try:            return float(s)        except ValueError:            return 0df=pd.read_csv(StringIO(txt), converters={''Big_Num'':conv})print df

印刷品：

   line                                   Big_Num                           text0     1  1234567890123456789012345678901234567890      That sure is a big number1     2  9999999999999999999999999999999999999999  That is an even BIGGER number2     3                                     1e-18                           Tiny3     4                                         0              Use 0 for strings

然后，列中的每个值都将是Python int或float并支持算术。

16-numpy笔记-莫烦pandas-4

代码

import pandas as pd
import numpy as np

dates = pd.date_range(''20130101'', periods=6)
df=pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=[''A'',''B'',''C'',''D''])

# 行数，列数，赋值
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan

# 以行丢掉
print(''-1-'')
print(df.dropna(axis=0))

# 有nan就丢 这是默认情况
print(''-2-'')
print(df.dropna(axis=0, how=''any''))

# 全是nan再丢
print(''-3-'')
print(df.dropna(axis=0, how=''all''))

# 填上
print(''-4-'')
print(df.fillna(value=0))

# 判断每个的结果
print(''-5-'')
print(df.isnull())

# 整体内是不是有null
print(''-6-'')
print(np.any(df.isnull()) == True)

# 读取保存数据 read_csv to_csv
df1 = pd.DataFrame(np.ones((3,4))*0,columns=[''a'',''b'',''c'',''d''])
df2 = pd.DataFrame(np.ones((3,4))*1,columns=[''a'',''b'',''c'',''d''])
df3 = pd.DataFrame(np.ones((3,4))*2,columns=[''a'',''b'',''c'',''d''])

print(''-7-'')
print(df1)
print(df2)
print(df3)

# axis=0 竖向合并
res = pd.concat([df1,df2,df3], axis=0)
print(''-8-'')
print(res)

res = pd.concat([df1,df2,df3], axis=0, ignore_index=True)
print(''-9-'')
print(res)


df1 = pd.DataFrame(np.ones((3,4))*0,columns=[''a'',''b'',''c'',''d''],index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1,columns=[''b'',''c'',''d'',''e''],index=[2,3,4])
print(''-10-'')
print(df1)
print(df2)

# 组合模式
res = pd.concat([df1,df2])
print(''-11-'')
print(res)
# defalut 并集
res = pd.concat([df1,df2], join=''outer'')
print(''-12-'')
print(res)
# 交集
res = pd.concat([df1,df2], join=''inner'')
print(''-13-'')
print(res)

res = pd.concat([df1,df2], join=''inner'', ignore_index=True)
print(''-14-'')
print(res)

# axis=1 左右合并 只考虑df1的index
res = pd.concat([df1,df2], axis=1,join_axes=[df1.index])
print(''-15-'')
print(res)

# axis=1 左右合并
res = pd.concat([df1,df2], axis=1)
print(''-16-'')
print(res)

df1 = pd.DataFrame(np.ones((3,4))*0,columns=[''a'',''b'',''c'',''d''])
df2 = pd.DataFrame(np.ones((3,4))*1,columns=[''a'',''b'',''c'',''d''])
df3 = pd.DataFrame(np.ones((3,4))*2,columns=[''b'',''c'',''d'',''e''],index=[2,3,4])

res = df1.append(df2, ignore_index=True)
print(''-17-'')
print(res)

res = df1.append([df2, df3], ignore_index=True)
print(''-18-'')
print(res)

s1 = pd.Series([1,2,3,4], index=[''a'',''b'',''c'',''d''])
res = df1.append(s1,ignore_index=True)

print(''-19-'')
print(res)

输出

-1-
             A     B     C   D
2013-01-03   8   9.0  10.0  11
2013-01-04  12  13.0  14.0  15
2013-01-05  16  17.0  18.0  19
2013-01-06  20  21.0  22.0  23
-2-
             A     B     C   D
2013-01-03   8   9.0  10.0  11
2013-01-04  12  13.0  14.0  15
2013-01-05  16  17.0  18.0  19
2013-01-06  20  21.0  22.0  23
-3-
             A     B     C   D
2013-01-01   0   NaN   2.0   3
2013-01-02   4   5.0   NaN   7
2013-01-03   8   9.0  10.0  11
2013-01-04  12  13.0  14.0  15
2013-01-05  16  17.0  18.0  19
2013-01-06  20  21.0  22.0  23
-4-
             A     B     C   D
2013-01-01   0   0.0   2.0   3
2013-01-02   4   5.0   0.0   7
2013-01-03   8   9.0  10.0  11
2013-01-04  12  13.0  14.0  15
2013-01-05  16  17.0  18.0  19
2013-01-06  20  21.0  22.0  23
-5-
                A      B      C      D
2013-01-01  False   True  False  False
2013-01-02  False  False   True  False
2013-01-03  False  False  False  False
2013-01-04  False  False  False  False
2013-01-05  False  False  False  False
2013-01-06  False  False  False  False
-6-
True
-7-
     a    b    c    d
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
     a    b    c    d
0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0
2  1.0  1.0  1.0  1.0
     a    b    c    d
0  2.0  2.0  2.0  2.0
1  2.0  2.0  2.0  2.0
2  2.0  2.0  2.0  2.0
-8-
     a    b    c    d
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
0  1.0  1.0  1.0  1.0
1  1.0  1.0  1.0  1.0
2  1.0  1.0  1.0  1.0
0  2.0  2.0  2.0  2.0
1  2.0  2.0  2.0  2.0
2  2.0  2.0  2.0  2.0
-9-
     a    b    c    d
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
3  1.0  1.0  1.0  1.0
4  1.0  1.0  1.0  1.0
5  1.0  1.0  1.0  1.0
6  2.0  2.0  2.0  2.0
7  2.0  2.0  2.0  2.0
8  2.0  2.0  2.0  2.0
-10-
     a    b    c    d
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
3  0.0  0.0  0.0  0.0
     b    c    d    e
2  1.0  1.0  1.0  1.0
3  1.0  1.0  1.0  1.0
4  1.0  1.0  1.0  1.0
d:\Alex\WorkLog\34-deeplearning\myWorks\TransferLearningExample\mofangTransferLearning\1.py:62: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass ''sort=True''.

To retain the current behavior and silence the warning, pass sort=False

  res = pd.concat([df1,df2])
-11-
     a    b    c    d    e
1  0.0  0.0  0.0  0.0  NaN
2  0.0  0.0  0.0  0.0  NaN
3  0.0  0.0  0.0  0.0  NaN
2  NaN  1.0  1.0  1.0  1.0
3  NaN  1.0  1.0  1.0  1.0
4  NaN  1.0  1.0  1.0  1.0
d:\Alex\WorkLog\34-deeplearning\myWorks\TransferLearningExample\mofangTransferLearning\1.py:66: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass ''sort=True''.

To retain the current behavior and silence the warning, pass sort=False

  res = pd.concat([df1,df2], join=''outer'')
-12-
     a    b    c    d    e
1  0.0  0.0  0.0  0.0  NaN
2  0.0  0.0  0.0  0.0  NaN
3  0.0  0.0  0.0  0.0  NaN
2  NaN  1.0  1.0  1.0  1.0
3  NaN  1.0  1.0  1.0  1.0
4  NaN  1.0  1.0  1.0  1.0
-13-
     b    c    d
1  0.0  0.0  0.0
2  0.0  0.0  0.0
3  0.0  0.0  0.0
2  1.0  1.0  1.0
3  1.0  1.0  1.0
4  1.0  1.0  1.0
-14-
     b    c    d
0  0.0  0.0  0.0
1  0.0  0.0  0.0
2  0.0  0.0  0.0
3  1.0  1.0  1.0
4  1.0  1.0  1.0
5  1.0  1.0  1.0
-15-
     a    b    c    d    b    c    d    e
1  0.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN
2  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
3  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
-16-
     a    b    c    d    b    c    d    e
1  0.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN
2  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
3  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
4  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0
-17-
     a    b    c    d
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
3  1.0  1.0  1.0  1.0
4  1.0  1.0  1.0  1.0
5  1.0  1.0  1.0  1.0
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py:6201: FutureWarning: Sorting because non-concatenation axis
is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass ''sort=True''.

To retain the current behavior and silence the warning, pass sort=False

  sort=sort)
-18-
     a    b    c    d    e
0  0.0  0.0  0.0  0.0  NaN
1  0.0  0.0  0.0  0.0  NaN
2  0.0  0.0  0.0  0.0  NaN
3  1.0  1.0  1.0  1.0  NaN
4  1.0  1.0  1.0  1.0  NaN
5  1.0  1.0  1.0  1.0  NaN
6  NaN  2.0  2.0  2.0  2.0
7  NaN  2.0  2.0  2.0  2.0
8  NaN  2.0  2.0  2.0  2.0
-19-
     a    b    c    d
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
3  1.0  2.0  3.0  4.0

Numpy Pandas

数据分析中计算比py中自带字典要快的模块 Numpy和Pandas是基于C编写的，运用大量矩阵，可以避免计算成本高的问题，速度成倍数加快。在Tensorflow、机器学习等领域均适用。

Numpy安装 Google搜素：www.numpy.org/ Getting Numpy 找到SourceForge 下载numpy

Anaconda全家桶一键拥有全世界 (或者选择Miniconda)

windows终端输入 pip3 install numpy pip3 install pandas

numpy – 在pandas 0.10.1上使用pandas.read_csv指定dtype float32

我试图读一个简单的空间分隔的文件用pandas read_csv方法。然而，熊猫似乎没有服从我的dtype参数。也许我错误地指定它？

我已经把我对read_csv的一些复杂的调用归结为这个简单的测试用例。我实际上在我的“真实”场景中使用转换器的参数，但我删除了为简单。

下面是我的ipython会话：

>>> cat test.out
a b
0.76398 0.81394
0.32136 0.91063
>>> import pandas
>>> import numpy
>>> x = pandas.read_csv('test.out',dtype={'a': numpy.float32},delim_whitespace=True)
>>> x
         a        b
0  0.76398  0.81394
1  0.32136  0.91063
>>> x.a.dtype
dtype('float64')

我也试过这个用numpy.int32或numpy.int64的dtype。这些选择导致异常：

AttributeError: 'nonetype' object has no attribute 'dtype'

我假设AttributeError是因为pandas不会自动尝试转换/截断浮点值为整数？

我在一个32位的机器上运行32位版本的Python。

>>> !uname -a
Linux ubuntu 3.0.0-13-generic #22-Ubuntu SMP Wed Nov 2 13:25:36 UTC 2011 i686 i686 i386 GNU/Linux
>>> import platform
>>> platform.architecture()
('32bit','ELF')
>>> pandas.__version__
'0.10.1'

解决方法

0.10.1并不真正支持float32

见http://pandas.pydata.org/pandas-docs/dev/whatsnew.html#dtype-specification

你可以在0.11这样做：

# dont' use dtype converters explicity for the columns you care about
# they will be converted to float64 if possible,or object if they cannot
df = pd.read_csv('test.csv'.....)

#### this is optional and related to the issue you posted ####
# force anything that is not a numeric to nan
# columns are the list of columns that you are interesetd in
df[columns] = df[columns].convert_objects(convert_numeric=True)


    # astype
    df[columns] = df[columns].astype('float32')

see http://pandas.pydata.org/pandas-docs/dev/basics.html#object-conversion

Its not as efficient as doing it directly in read_csv (but that requires

我已经确认用0.11-dev，这个DOES工作(对32位和64位，结果是一样的)

In [5]: x = pd.read_csv(StringIO.StringIO(data),dtype={'a': np.float32},delim_whitespace=True)

In [6]: x
Out[6]: 
         a        b
0  0.76398  0.81394
1  0.32136  0.91063

In [7]: x.dtypes
Out[7]: 
a    float32
b    float64
dtype: object

In [8]: pd.__version__
Out[8]: '0.11.0.dev-385ff82'

In [9]: quit()
vagrant@precise32:~/pandas$ uname -a
Linux precise32 3.2.0-23-generic-pae #36-Ubuntu SMP Tue Apr 10 22:19:09 UTC 2012 i686 i686 i386 GNU/Linux

 some low-level changes)

numpy.random.random & numpy.ndarray.astype & numpy.arange

今天看到这样一句代码：

xb = np.random.random((nb, d)).astype(''float32'') #创建一个二维随机数矩阵（nb行d列）

xb[:, 0] += np.arange(nb) / 1000. #将矩阵第一列的每个数加上一个值

要理解这两句代码需要理解三个函数

1、生成随机数

numpy.random.random(size=None)

size为None时，返回float。

size不为None时，返回numpy.ndarray。例如numpy.random.random((1,2))，返回1行2列的numpy数组

2、对numpy数组中每一个元素进行类型转换

numpy.ndarray.astype(dtype)

返回numpy.ndarray。例如 numpy.array([1, 2, 2.5]).astype(int)，返回numpy数组 [1, 2, 2]

3、获取等差数列

numpy.arange([start,]stop,[step,]dtype=None)

功能类似python中自带的range()和numpy中的numpy.linspace

返回numpy数组。例如numpy.arange(3)，返回numpy数组[0, 1, 2]

我们今天的关于以numpy或pandas处理巨大数字和numpy 将大于某个值设置为0的分享已经告一段落，感谢您的关注，如果您想了解更多关于16-numpy笔记-莫烦pandas-4、Numpy Pandas、numpy – 在pandas 0.10.1上使用pandas.read_csv指定dtype float32、numpy.random.random & numpy.ndarray.astype & numpy.arange的相关信息，请在本站查询。

本文标签：