如何在PySpark中将字符串转换为字典（JSON）的ArrayType（pyspark 字符串转数字）

25-03-04 12

对于如何在PySpark中将字符串转换为字典感兴趣的读者，本文将会是一篇不错的选择，我们将详细介绍JSON的ArrayType，并为您提供关于Java将字符串转换为字符数组toCharArray()、

对于如何在PySpark中将字符串转换为字典感兴趣的读者，本文将会是一篇不错的选择，我们将详细介绍JSON的ArrayType，并为您提供关于Java 将字符串转换为字符数组 toCharArray()、Pyspark-将json字符串转换为DataFrame、Pyspark：从Struct中识别arrayType列并调用udf将数组转换为字符串、python – 将字符串转换为字典的有用信息。

本文目录一览：

如何在PySpark中将字符串转换为字典（JSON）的ArrayType（pyspark 字符串转数字）
Java 将字符串转换为字符数组 toCharArray()
Pyspark-将json字符串转换为DataFrame
Pyspark：从Struct中识别arrayType列并调用udf将数组转换为字符串
python – 将字符串转换为字典

如何在PySpark中将字符串转换为字典（JSON）的ArrayType（pyspark 字符串转数字）

尝试将StringType转换为JSON的ArrayType，以获取由CSV生成的数据帧。

pyspark在上使用Spark2

我正在处理的CSV文件；如下-

date,attribute2,count,attribute32017-09-03,''attribute1_value1'',2,''[{"key":"value","key2":2},{"key":"value","key2":2},{"key":"value","key2":2}]''2017-09-04,''attribute1_value2'',2,''[{"key":"value","key2":20},{"key":"value","key2":25},{"key":"value","key2":27}]''

如上所示，它"attribute3"在文字字符串中包含一个属性，从技术上讲，它是一列字典（JSON），其精确长度为2。（这是函数的输出）

的摘录 printSchema()

attribute3: string (nullable = true)

我正在尝试将"attribute3"转换ArrayType为

temp = dataframe.withColumn(    "attribute3_modified",    dataframe["attribute3"].cast(ArrayType()))

Traceback (most recent call last):  File "<stdin>", line 1, in <module>TypeError: __init__() takes at least 2 arguments (1 given)

确实，ArrayType期望数据类型作为参数。我尝试使用"json"，但是没有用。

所需的输出-最后，我需要转换attribute3为ArrayType()普通的简单Python列表。（我正在尝试避免使用eval）

如何将其转换为ArrayType，以便将其视为JSON列表？

我在这里想念什么吗？

（文档没有以简单的方式解决此问题）

答案1

小编典典

from_json与匹配attribute3列中的实际数据的架构一起使用，以将json转换为ArrayType：

原始数据框：

df.printSchema()#root# |-- date: string (nullable = true)# |-- attribute2: string (nullable = true)# |-- count: long (nullable = true)# |-- attribute3: string (nullable = true)from pyspark.sql.functions import from_jsonfrom pyspark.sql.types import *

创建模式：

schema = ArrayType(    StructType([StructField("key", StringType()),                 StructField("key2", IntegerType())]))

用途from_json：

df = df.withColumn("attribute3", from_json(df.attribute3, schema))df.printSchema()#root# |-- date: string (nullable = true)# |-- attribute2: string (nullable = true)# |-- count: long (nullable = true)# |-- attribute3: array (nullable = true)# |    |-- element: struct (containsNull = true)# |    |    |-- key: string (nullable = true)# |    |    |-- key2: integer (nullable = true)df.show(1, False)#+----------+----------+-----+------------------------------------+#|date      |attribute2|count|attribute3                          |#+----------+----------+-----+------------------------------------+#|2017-09-03|attribute1|2    |[[value, 2], [value, 2], [value, 2]]|#+----------+----------+-----+------------------------------------+

Java 将字符串转换为字符数组 toCharArray()

Java 手册

toCharArray

public char[] toCharArray()

将此字符串转换为一个新的字符数组。

返回：: 一个新分配的字符数组，它的长度是此字符串的长度，它的内容被初始化为包含此字符串表示的字符序列。

public class toCharArray {
    public static void main(String[] args) {
        
        String str = "abcdefgh";
        char[] c = str.toCharArray();
        System.out.println(c);
    }
}

运行结果：

abcdefgh

输出结果看似是一个字符串，但是此时它的类型是char类型。

Pyspark-将json字符串转换为DataFrame

我有一个test2.json文件，其中包含简单的json：

{  "Name": "something","Url": "https://stackoverflow.com","Author": "jangcy","BlogEntries": 100,"Caller": "jangcy"}

我已将文件上传到Blob存储并从中创建一个DataFrame：

df = spark.read.json("/example/data/test2.json")

那么我可以看到它没有任何问题：

df.show()
+------+-----------+------+---------+--------------------+
|Author|BlogEntries|Caller|     Name|                 Url|
+------+-----------+------+---------+--------------------+
|jangcy|        100|jangcy|something|https://stackover...|
+------+-----------+------+---------+--------------------+

第二种情况：我在笔记本中声明了相同的json字符串：

newJson = '{  "Name": "something","Caller": "jangcy"}'

我可以打印它。但是现在，如果我想从中创建一个DataFrame：

df = spark.read.json(newJson)

我收到“绝对URI中的相对路径”错误：

'java.net.URISyntaxException: Relative path in absolute URI: {  "Name":%20%22something%22,%20%20%22Url%22:%20%22https:/stackoverflow.com%22,%20%20%22Author%22:%20%22jangcy%22,%20%20%22BlogEntries%22:%20100,%20%20%22Caller%22:%20%22jangcy%22%7D'
Traceback (most recent call last):
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py",line 249,in json
    return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",line 1133,in __call__
    answer,self.gateway_client,self.target_id,self.name)
  File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py",line 79,in deco
    raise IllegalArgumentException(s.split(': ',1)[1],stackTrace)
pyspark.sql.utils.IllegalArgumentException: 'java.net.URISyntaxException: Relative path in absolute URI: {  "Name":%20%22something%22,%20%20%22Caller%22:%20%22jangcy%22%7D'

是否应该对newJson字符串应用其他转换？如果是，应该是什么？如果这太琐碎，请原谅我，因为我是Python和Spark的新手。

我正在将Jupyter笔记本与PySpark3内核一起使用。

提前致谢。

Pyspark：从Struct中识别arrayType列并调用udf将数组转换为字符串

如何解决Pyspark：从Struct中识别arrayType列并调用udf将数组转换为字符串？

我正在创建一个加速器，用于将数据从源迁移到目标。例如，我将从 API 中选取数据并将数据迁移到 csv。在将数据转换为 csv 时，我遇到了处理数组类型的问题。我使用过 withColumn 和 concat_ws 方法（即 df1=df.withColumn(''films'',F.concat_ws('':'',F.col("films"))) films 是arraytype 列 ）用于此转换并且它起作用了。现在我希望这动态发生。我的意思是，在不指定列名的情况下，有没有一种方法可以从具有数组类型的结构中选择列名，然后调用 udf？

感谢您的宝贵时间！

解决方法

您可以使用 df.schema 获取列的类型。根据列的类型，您可以应用 concat_ws 或不应用：

data = [["test1","test2",[1,2,3],["a","b","c"]]]
schema= ["col1","col2","arr1","arr2"]
df = spark.createDataFrame(data,schema)

array_cols = [F.concat_ws(":",c.name).alias(c.name) \
    for c in df.schema if isinstance(c.dataType,T.ArrayType) ]
other_cols = [F.col(c.name) \
    for c in df.schema if not isinstance(c.dataType,T.ArrayType) ]

df = df.select(other_cols + array_cols)

结果：

+-----+-----+-----+-----+
| col1| col2| arr1| arr2|
+-----+-----+-----+-----+
|test1|test2|1:2:3|a:b:c|
+-----+-----+-----+-----+

python – 将字符串转换为字典

我知道这看起来像一个愚蠢的问题,但无论如何.

我正在尝试将字典的字符串表示转换回字典.

我的工作流程如下：

d = {1:2}
s = str(d)

当我做：

dict(s)

我明白了：

ValueError: dictionary update sequence element #0 has length 1; 2 is required

当我这样做时：
json.loads(s)
我明白了：

ValueError: Expecting property name: line 1 column 1 (char 1)

如何将其转换回字典？

更新：

我应该提一下,实际数据如下：

{‘cell_num’: u”,‘home_num’: u’16047207276′,‘registration_country’:
u’US’,‘registration_ip’: u’71.102.221.29′,‘last_updated’:
datetime.datetime(2010,9,27,15,41,59),‘address’: {‘country’:
u’US’,‘state’: u’CA’,‘zip’: u”,‘city’: u’Santa Barbara’,‘street’:
u”,‘confirmed’: False,‘created’: datetime.datetime(2010,6,24,10,
23),‘updated’: datetime.datetime(2010,23)},
‘old_home_num’: u’16047207276′,‘old_cell_num’: u”}

在这种情况下,使用json.loads和ast.literal_eval()的选项是不合适的.
所以我进一步尝试用pickle标准python库反序列化它.

import pickle

pickle.loads(data)

但后来我得到：

KeyError: ‘{‘

解决方法

如果你想要一个可移植的字符串表示,使用s = json.dumps(d),然后可以使用json.loads(s)重新加载

但是,这仅限于JSON兼容类型.如果你只想在python中使用它,最强大的选项是pickle(小心：永远不要破坏不受信任的数据！).

要使用pickle.loads()创建一个可加载的字符串,您需要使用pickle.dumps()从原始对象创建它(即就像您使用json但使用pickle一样).

但是,如果您已经发布了该字符串,则可以使用eval(s)将其评估为python表达式.这通常是一个坏主意,并且使用repr仅适用于实际具有有效python代码的repr的对象.

今天关于如何在PySpark中将字符串转换为字典和JSON的ArrayType的介绍到此结束，谢谢您的阅读，有关Java 将字符串转换为字符数组 toCharArray()、Pyspark-将json字符串转换为DataFrame、Pyspark：从Struct中识别arrayType列并调用udf将数组转换为字符串、python – 将字符串转换为字典等更多相关知识的信息可以在本站进行查询。

本文标签：