Skip to content Skip to sidebar Skip to footer
Showing posts with the label Apache Spark

Add Jar To Pyspark When Using Notebook

I'm trying the mongodb hadoop integration with spark but can't figure out how to make the j… Read more Add Jar To Pyspark When Using Notebook

Pyspark Merge Multiple Columns Into A Json Column

I asked the question a while back for python, but now I need to do the same thing in PySpark. I hav… Read more Pyspark Merge Multiple Columns Into A Json Column

How To Merge Multiple Rows Into Single Cell Based On Id And Then Count?

How to merge multiple rows into single cell based on id using PySpark? I have a dataframe with ids … Read more How To Merge Multiple Rows Into Single Cell Based On Id And Then Count?

Improve Speed Of Spark App

This is part of my python-spark code which parts of it run too slow for my needs. Especially this p… Read more Improve Speed Of Spark App

Apply Udf To Multiple Columns And Use Numpy Operations

I have a dataframe named result in pyspark and I want to apply a udf to create a new column as belo… Read more Apply Udf To Multiple Columns And Use Numpy Operations

Error Pythonudfrunner: Python Worker Exited Unexpectedly (crashed)

I am running a PySpark job that calls udfs. I know udfs are bad with memory and slow due to seriali… Read more Error Pythonudfrunner: Python Worker Exited Unexpectedly (crashed)