在Python中以熊猫的方式对数据框进行装箱（用python绘制熊猫图案）

25-01-24 20

在这篇文章中，我们将为您详细介绍在Python中以熊猫的方式对数据框进行装箱的内容，并且讨论关于用python绘制熊猫图案的相关问题。此外，我们还会涉及一些关于PythonTypeError：无法将系

在这篇文章中，我们将为您详细介绍在Python中以熊猫的方式对数据框进行装箱的内容，并且讨论关于用python绘制熊猫图案的相关问题。此外，我们还会涉及一些关于Python TypeError：无法将系列转换为尝试对数据框进行数学运算时、python – 以相同的方式对两个pandas数据帧进行采样、Python-在Python中以扩展名.txt查找目录中的所有文件、Python-对整个数据框进行LogReturn的知识，以帮助您更全面地了解这个主题。

本文目录一览：

在Python中以熊猫的方式对数据框进行装箱（用python绘制熊猫图案）
Python TypeError：无法将系列转换为尝试对数据框进行数学运算时
python – 以相同的方式对两个pandas数据帧进行采样
Python-在Python中以扩展名.txt查找目录中的所有文件
Python-对整个数据框进行LogReturn

在Python中以熊猫的方式对数据框进行装箱（用python绘制熊猫图案）

给出以下熊猫数据框：

import numpy as npdf = pandas.DataFrame({"a": np.random.random(100), "b": np.random.random(100), "id": np.arange(100)})

其中id针对由以下组成的每个点的IDa和b值，我怎样才能仓a和b成一组指定的仓（这样我可以再取中值/平均值a和b每个仓中）？
df可能对中的任何给定行具有或（或两者都有）NaN值。谢谢。a``b``df

这是将Joe Kington的解决方案与更实际的df结合使用的更好示例。我不确定的事情是如何访问以下每个df.a组的df.b元素：

a = np.random.random(20)df = pandas.DataFrame({"a": a, "b": a + 10})# bins for df.abins = np.linspace(0, 1, 10)# bin df according to agroups = df.groupby(np.digitize(df.a,bins))# Get the mean of a in each groupprint groups.mean()## But how to get the mean of b for each group of a?# ...

答案1

小编典典

也许有一种更有效的方法（我觉得pandas.crosstab这里很有用），但是这是我的方法：

import numpy as npimport pandasdf = pandas.DataFrame({"a": np.random.random(100),                       "b": np.random.random(100),                       "id": np.arange(100)})# Bin the data frame by "a" with 10 bins...bins = np.linspace(df.a.min(), df.a.max(), 10)groups = df.groupby(np.digitize(df.a, bins))# Get the mean of each bin:print groups.mean() # Also could do "groups.aggregate(np.mean)"# Similarly, the median:print groups.median()# Apply some arbitrary function to aggregate binned dataprint groups.aggregate(lambda x: np.mean(x[x > 0.5]))

编辑：作为OP是为刚刚手段特别要求b在分级的价值观a，只是做

groups.mean().b

另外，如果您希望索引看起来更好（例如，显示间隔作为索引），如@bdiamante的示例中所示，请使用pandas.cut代替numpy.digitize。（对比达曼特表示敬意。我没有意识到pandas.cut存在。）

import numpy as npimport pandasdf = pandas.DataFrame({"a": np.random.random(100),                        "b": np.random.random(100) + 10})# Bin the data frame by "a" with 10 bins...bins = np.linspace(df.a.min(), df.a.max(), 10)groups = df.groupby(pandas.cut(df.a, bins))# Get the mean of b, binned by the values in aprint groups.mean().b

结果是：

a(0.00186, 0.111]    10.421839(0.111, 0.22]       10.427540(0.22, 0.33]        10.538932(0.33, 0.439]       10.445085(0.439, 0.548]      10.313612(0.548, 0.658]      10.319387(0.658, 0.767]      10.367444(0.767, 0.876]      10.469655(0.876, 0.986]      10.571008Name: b

Python TypeError：无法将系列转换为尝试对数据框进行数学运算时

我有一个看起来像这样的数据框：

defaultdict(<class ''list''>, {''XYF'':             TimeUS           GyrX           GyrY           GyrZ         AccX  \0        207146570    0.000832914    0.001351716  -0.0004189798    -0.651183   1        207186671    0.001962787    0.001242457  -0.0001859666   -0.6423497   2        207226791   9.520243E-05    0.001076498  -0.0005664826   -0.6360412   3        207246474   0.0001093059    0.001616917   0.0003615251   -0.6342875   4        207286244    0.001412051   0.0007565815  -0.0003780428    -0.637755[103556 rows x 12 columns], ''DAR'':           TimeUS RSSI RemRSSI TxBuf Noise RemNoise RxErrors Fixed0      208046965  159     161    79    25       29        0     01      208047074  159     161    79    25       29        0     02      208927455  159     159    91    28       28        0     03      208927557  159     159    91    28       28        0     0[4136 rows x 8 columns], ''NK2'':            TimeUS    IVN    IVE   IVD    IPN   IPE    IPD IMX  IMY IMZ  IYAW  \0       207147350  -0.02   0.02  0.00  -0.02  0.01   0.20   0    0   0  1.94   1       207187259  -0.02   0.02  0.00  -0.02  0.01   0.20   0    0   0  1.94   2       207227559  -0.02   0.02  0.00  -0.02  0.01   0.14   0    0   0  1.77   3       207308304   0.02   0.02  0.00  -0.01  0.01  -0.05   0    0   0  1.77   4       207347766   0.02   0.02  0.00  -0.01  0.01  -0.05   0    0   0  0.82

我首先分离了要进行数学运算的列：

new_time = dfs[''XYF''][''TimeUS'']

然后我尝试了几件事来做一些数学运算，但是我没有运气。首先，我只是将其视为列表。所以

new_time_F = new_time / 1000000

那没有用，给了我一个浮动错误：

TypeError: unsupported operand type(s) for /: ''str'' and ''int''

所以我这样做：

new_time_F = float (new_time) / 1000000

这给我一个错误：

TypeError: cannot convert the series to <class ''float''>

我不知道从这里去哪里。

答案1

小编典典

如果您这样做（如先前所建议），该怎么办：

new_time = dfs[''XYF''][''TimeUS''].astype(float)new_time_F = new_time / 1000000

python – 以相同的方式对两个pandas数据帧进行采样

我正在进行具有两个数据帧的机器学习计算 – 一个用于因子,另一个用于目标值.我必须将它们分成训练和测试部分.在我看来,我找到了方法,但我正在寻找更优雅的解决方案.这是我的代码：

import pandas as pd
import numpy as np
import random

df_source = pd.DataFrame(np.random.randn(5,2),index = range(0,10,columns=list('AB'))
df_target = pd.DataFrame(np.random.randn(5,columns=list('CD'))

rows = np.asarray(random.sample(range(0,len(df_source)),2))

df_source_train = df_source.iloc[rows]
df_source_test = df_source[~df_source.index.isin(df_source_train.index)]
df_target_train = df_target.iloc[rows]
df_target_test = df_target[~df_target.index.isin(df_target_train.index)]

print('rows')
print(rows)
print('source')
print(df_source)
print('source train')
print(df_source_train)
print('source_test')
print(df_source_test)

—-编辑 – unutbu解决方案(midified)—

np.random.seed(2013)
percentile = .6
rows = np.random.binomial(1,percentile,size=len(df_source)).astype(bool)

df_source_train = df_source[rows]
df_source_test = df_source[~rows]
df_target_train = df_target[rows]
df_target_test = df_target[~rows]

解决方法

如果你将行设为长度为len(df)的布尔数组,则可以使用df [rows]获取True行,并使用df [〜rows]获取False行：

import pandas as pd
import numpy as np
import random
np.random.seed(2013)

df_source = pd.DataFrame(
    np.random.randn(5,index=range(0,columns=list('AB'))

rows = np.random.randint(2,size=len(df_source)).astype('bool')

df_source_train = df_source[rows]
df_source_test = df_source[~rows]

print(rows)
# [ True  True False  True False]

# if for some reason you need the index values of where `rows` is True
print(np.where(rows))  
# (array([0,1,3]),)

print(df_source)
#           A         B
# 0  0.279545  0.107474
# 2  0.651458 -1.516999
# 4 -1.320541  0.679631
# 6  0.833612  0.492572
# 8  1.555721  1.741279

print(df_source_train)
#           A         B
# 0  0.279545  0.107474
# 2  0.651458 -1.516999
# 6  0.833612  0.492572

print(df_source_test)
#           A         B
# 4 -1.320541  0.679631
# 8  1.555721  1.741279

Python-在Python中以扩展名.txt查找目录中的所有文件

如何.txt在python中具有扩展名的目录中找到所有文件？

答案1

小编典典

你可以使用glob：

import glob, osos.chdir("/mydir")for file in glob.glob("*.txt"):    print(file)

或者简单地os.listdir：

import osfor file in os.listdir("/mydir"):    if file.endswith(".txt"):        print(os.path.join("/mydir", file))

或者如果要遍历目录，请使用os.walk：

import osfor root, dirs, files in os.walk("/mydir"):    for file in files:        if file.endswith(".txt"):             print(os.path.join(root, file))

Python-对整个数据框进行LogReturn

使用diff的其他方式：

new_df = np.log(df).diff()
print(new_df)

输出

               AAPL      TSLA      NESN        FB      ROCH       TOT  \
Date                                                                    
2/1/2019        NaN       NaN       NaN       NaN       NaN       NaN   
3/1/2019  -0.104924 -0.031978  0.025128 -0.029469  0.022162  0.002271   
4/1/2019   0.041803  0.056094  0.016647  0.046061  0.010114  0.028874   
7/1/2019  -0.002228  0.052935 -0.010583  0.000725 -0.008844 -0.001838   
8/1/2019   0.018884  0.001164  0.003139  0.031937  0.025992 -0.003132   
9/1/2019   0.016839  0.009438  0.009238  0.011857  0.000927  0.021722   
10/1/2019  0.003191  0.018845  0.007732 -0.000208  0.006771 -0.007613   
11/1/2019 -0.009866  0.006616  0.001421 -0.002778 -0.005845 -0.019661   
14/1/2019 -0.015151 -0.037736 -0.000947  0.010996 -0.002471  0.007579   
15/1/2019  0.020260  0.029553  0.003075  0.024191  0.004937 -0.009065   
16/1/2019  0.012143  0.004692 -0.008062 -0.009511 -0.001540 -0.003910   
17/1/2019  0.005920  0.003634  0.006052  0.005138 -0.000617  0.002981   
18/1/2019  0.006140 -0.138930  0.001301  0.011665  0.005843  0.014771   
22/1/2019 -0.022702 -0.011112 -0.005450 -0.016599 -0.012342 -0.019993   
23/1/2019  0.004036 -0.038640  0.005687 -0.022408  0.008348 -0.009960   
24/1/2019 -0.007958  0.013538  0.005654  0.010547 -0.012704  0.007338   
25/1/2019  0.032600  0.018792 -0.006955  0.021572 -0.000312  0.016179   
28/1/2019 -0.009298 -0.002224  0.008950 -0.010389  0.002181 -0.011503   
29/1/2019 -0.010419  0.003637  0.016856 -0.022493  0.004348  0.008917   
30/1/2019  0.066101  0.037317  0.003567  0.042300  0.000310  0.001848   

               VISA       JPM  
Date                           
2/1/2019        NaN       NaN  
3/1/2019  -0.036702 -0.022402  
4/1/2019   0.042179  0.036202  
7/1/2019   0.017872  0.000695  
8/1/2019   0.005424 -0.001887  
9/1/2019   0.011700 -0.001692  
10/1/2019  0.001877 -0.000100  
11/1/2019 -0.004409 -0.004793  
14/1/2019 -0.006978  0.010257  
15/1/2019  0.001749  0.007304  
16/1/2019  0.000000  0.008032  
17/1/2019 -0.000437  0.004089  
18/1/2019  0.008848  0.016096  
22/1/2019 -0.003254 -0.015902  
23/1/2019 -0.007562 -0.002529  
24/1/2019  0.005023  0.000584  
25/1/2019  0.007020  0.006307  
28/1/2019 -0.019516  0.004728  
29/1/2019 -0.007307  0.002788  
30/1/2019  0.019076  0.002301

当然，只需删除列名：

df1 = np.log(df/df.shift(1))
#alternative for lower pandas versions
#df1 = pd.DataFrame(np.log(df/df.shift(1)),index=df.index,columns=df.columns)

DataFrame.pct_change的另一个想法：

df = np.log(df.pct_change().add(1))

print (df1)
               AAPL      TSLA      NESN        FB      ROCH       TOT  \
Date                                                                    
2/1/2019        NaN       NaN       NaN       NaN       NaN       NaN   
3/1/2019  -0.104924 -0.031978  0.025128 -0.029469  0.022162  0.002271   
4/1/2019   0.041803  0.056094  0.016647  0.046061  0.010114  0.028874   
7/1/2019  -0.002228  0.052935 -0.010583  0.000725 -0.008844 -0.001838   
8/1/2019   0.018884  0.001164  0.003139  0.031937  0.025992 -0.003132   
9/1/2019   0.016839  0.009438  0.009238  0.011857  0.000927  0.021722   
10/1/2019  0.003191  0.018845  0.007732 -0.000208  0.006771 -0.007613   
11/1/2019 -0.009866  0.006616  0.001421 -0.002778 -0.005845 -0.019661   
14/1/2019 -0.015151 -0.037736 -0.000947  0.010996 -0.002471  0.007579   
15/1/2019  0.020260  0.029553  0.003075  0.024191  0.004937 -0.009065   
16/1/2019  0.012143  0.004692 -0.008062 -0.009511 -0.001540 -0.003910   
17/1/2019  0.005920  0.003634  0.006052  0.005138 -0.000617  0.002981   
18/1/2019  0.006140 -0.138930  0.001301  0.011665  0.005843  0.014771   
22/1/2019 -0.022702 -0.011112 -0.005450 -0.016599 -0.012342 -0.019993   
23/1/2019  0.004036 -0.038640  0.005687 -0.022408  0.008348 -0.009960   
24/1/2019 -0.007958  0.013538  0.005654  0.010547 -0.012704  0.007338   
25/1/2019  0.032600  0.018792 -0.006955  0.021572 -0.000312  0.016179   
28/1/2019 -0.009298 -0.002224  0.008950 -0.010389  0.002181 -0.011503   
29/1/2019 -0.010419  0.003637  0.016856 -0.022493  0.004348  0.008917   
30/1/2019  0.066101  0.037317  0.003567  0.042300  0.000310  0.001848   

               VISA       JPM  
Date                           
2/1/2019        NaN       NaN  
3/1/2019  -0.036702 -0.022402  
4/1/2019   0.042179  0.036202  
7/1/2019   0.017872  0.000695  
8/1/2019   0.005424 -0.001887  
9/1/2019   0.011700 -0.001692  
10/1/2019  0.001877 -0.000100  
11/1/2019 -0.004409 -0.004793  
14/1/2019 -0.006978  0.010257  
15/1/2019  0.001749  0.007304  
16/1/2019  0.000000  0.008032  
17/1/2019 -0.000437  0.004089  
18/1/2019  0.008848  0.016096  
22/1/2019 -0.003254 -0.015902  
23/1/2019 -0.007562 -0.002529  
24/1/2019  0.005023  0.000584  
25/1/2019  0.007020  0.006307  
28/1/2019 -0.019516  0.004728  
29/1/2019 -0.007307  0.002788  
30/1/2019  0.019076  0.002301

今天的关于在Python中以熊猫的方式对数据框进行装箱和用python绘制熊猫图案的分享已经结束，谢谢您的关注，如果想了解更多关于Python TypeError：无法将系列转换为尝试对数据框进行数学运算时、python – 以相同的方式对两个pandas数据帧进行采样、Python-在Python中以扩展名.txt查找目录中的所有文件、Python-对整个数据框进行LogReturn的相关知识，请在本站进行查询。

本文标签：