Skip to content Skip to sidebar Skip to footer

Can't Instantiate Spark Context In Ipython

I'm trying to set up a stand alone instance of spark locally on a mac and use the Python 3 API. To do this I've done the following, 1. I've downloaded and installed Scala and Spark

Solution 1:

Well, as I have argued elsewhere, setting PYSPARK_DRIVER_PYTHON to jupyter (or ipython) is a really bad and plain wrong practice, which can lead to unforeseen outcomes downstream, such as when you try to use spark-submit with the above settings...

There is one and only one proper way to customize a Jupyter notebook in order to work with other languages (PySpark here), and this is the use of Jupyter kernels.

The first thing to do is run a jupyter kernelspec list command, to get the list of any already available kernels in your machine; here is the result in my case (Ubuntu):

$ jupyter kernelspec list
Available kernels:
  python2       /usr/lib/python2.7/site-packages/ipykernel/resources
  caffe         /usr/local/share/jupyter/kernels/caffe
  ir            /usr/local/share/jupyter/kernels/ir
  pyspark       /usr/local/share/jupyter/kernels/pyspark
  pyspark2      /usr/local/share/jupyter/kernels/pyspark2
  tensorflow    /usr/local/share/jupyter/kernels/tensorflow

The first kernel, python2, is the "default" one coming with IPython (there is a great chance of this being the only one present in your system); as for the rest, I have 2 more Python kernels (caffe & tensorflow), an R one (ir), and two PySpark kernels for use with Spark 1.6 and Spark 2.0 respectively.

The entries of the list above are directories, and each one contains one single file, named kernel.json. Let's see the contents of this file for my pyspark2 kernel:

{
 "display_name": "PySpark (Spark 2.0)",
 "language": "python",
 "argv": [
  "/opt/intel/intelpython27/bin/python2",
  "-m",
  "ipykernel",
  "-f",
  "{connection_file}"
 ],
 "env": {
  "SPARK_HOME": "/home/ctsats/spark-2.0.0-bin-hadoop2.6",
  "PYTHONPATH": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python:/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/lib/py4j-0.10.1-src.zip",
  "PYTHONSTARTUP": "/home/ctsats/spark-2.0.0-bin-hadoop2.6/python/pyspark/shell.py",
  "PYSPARK_PYTHON": "/opt/intel/intelpython27/bin/python2"
 }
}

Now, the easiest way for you would be to manually do the necessary changes (paths only) to my above shown kernel and save it in a new subfolder of the .../jupyter/kernels directory (that way, it should be visible if you run again a jupyter kernelspec list command). And if you think this approach is also a hack, well, I would agree with you, but it is the one recommended in the Jupyter documentation (page 12):

However, there isn’t a great way to modify the kernelspecs. One approach uses jupyter kernelspec list to find the kernel.json file and then modifies it, e.g. kernels/python3/kernel.json, by hand.

If you don't have already a .../jupyter/kernels folder, you can still install a new kernel using jupyter kernelspec install - haven't tried it, but have a look at this SO answer.

If you want to pass command-line arguments to PySpark, you should add the PYSPARK_SUBMIT_ARGS setting under env; for example, here is the last line of my respective kernel file for Spark 1.6.0, where we still had to use the external spark-csv package for reading CSV files:

"PYSPARK_SUBMIT_ARGS": "--master local --packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell"

Finally, don't forget to remove all the PySpark/Jupyter-related environment variables from your bash profile (leaving only SPARK_HOME and PYSPARK_PYTHON should be OK).

Another possibility could be to use Apache Toree, but I haven't tried it myself yet.

Solution 2:

Documentation seams to say that environment variables are read from a certain file and not as shell environment variables.

Certain Spark settings can be configured through environment variables, which are read from the conf/spark-env.sh script in the directory where Spark is installed

Post a Comment for "Can't Instantiate Spark Context In Ipython"