Setup Spark on Windows 10

Installing Spark on Windows 10 can be tricky but this tutorial will get you up in running in minutes.

 

  1. Install JAVA
    1. Download the JAVA 8 JDK from Oracle here.
    2. Set the JAVA_HOME environment variable and add to the path.
      • JAVA_HOME = C:\Progra~1\Java\jdk1.8.0_201
      • PATH = C:\Progra~1\Java\jdk1.8.0_201\bin
      • _JAVA_OPTIONS= -Xmx512M -Xms512M

Notice we are just changing the location from Program Files to Progra~1

  1. Download and Install SPARK
    1. Download Spark from the official site here
      Get the Spark latest version that is Prebuilt for Apache Hadoop 2.7 and later. Click the Download Spark link.
    1. Extract the .TGZ file to a folder called c:\Spark. You can use a free program like 7 Zip to extract the .TGZ file.
    2. Set the environment variables:
      • SPARK_HOME = C:\Spark\ spark-2.4.0-bin-hadoop2.7
      • HADOOP_HOME= C:\Spark\ spark-2.4.0-bin-hadoop2.7
      • Add to the PATH = C:\Spark\ spark-2.4.0-bin-hadoop2.7\bin
  2. Install Winutils
    1. Download Winutils from here. Pick the version that matches you Hadoop version.  In this case 2.71.
      https://github.com/steveloughran/winutils
    2. Place the file winutils.exe in the D:\Spark\ spark-2.4.0-bin-hadoop2.7\bin folder.
  3. Install Python
    1. Download Python and get the latest version 3.7+.
    2. Install Python and click “Add Python 3.7 to PATH”
  4. Final Steps
    1. Add Hive permissions and local tmp permissions
      • Run command prompt as administrator
      • winutil chmod -R 777 C:\tmp\hive
      • winutils.exe chmod -R 777 C:\Users\Jim\AppData\Local\Temp
        Notice:  Change out Jim with your Windows username
    2. Download this updated worker.py file here.  Copy and replace the worker.py file in the zip file located here C:\Distributed\spark-2.4.0-bin-hadoop2.7\python\lib\pyspark.zip

Finally, lets test a short Python script.

import random
NUM_SAMPLES=100
def inside(p):
    x, y = random.random(), random.random()
    return x*x + y*y < 1

count = sc.parallelize(range(0, NUM_SAMPLES)).filter(inside).count()
print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))

Leave a Reply

Copyright Jim's Blog 2019
Tech Nerd theme designed by Siteturner