Parsing A Json Streaming With Pyspark

May 24, 2023 Post a Comment

I'm very new on Spark Streaming and I'm trying to read and parse a JSON streaming from Kafka using pyspark. Reading the stream is ok and also I can pprint() the RDDs. {'Address':

Solution 1:

Why not just doing:

dstream = kvs.map(lambda x: json.loads(x[1]))

dstream.pprint()

Solution 2:

You will need to invoke one of the following operations

https://spark.apache.org/docs/1.6.0/streaming-programming-guide.html

Transformation  Meaning
map(func)   Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func)   Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)    Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions)  Changes the level of parallelism in this DStream by creating more or fewer partitions.
count() Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func)    Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel.
countByValue()  When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.

.. etc

One or more of these will need to be invoked on your dstream.

Python Development

Parsing A Json Streaming With Pyspark

Solution 1:

Solution 2:

Post a Comment for "Parsing A Json Streaming With Pyspark"