Useful Spark Commands
You can run Spark jobs via the command line. Some examples below, including just starting and configuring a session. Also how to query a hive table, and write back a Spark Data Frame to Hive table.
Some useful hints below:-
Start a Spark Session
$>spark-shell
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.2.6.0.3-8
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Load/Run Scala Script in Session
scala>: load test_script.scala
Start a Spark Session – configure UI Port
$>spark-shell –conf “spark_ui_port=1081”
This command starts a scala session, and allows you to browse to the Spark UI http://localhost:1081
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.2.6.0.3-8
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Example 1 : Query and Create a Hive Table
<spark_shell_example.sh>
spark-shell –conf “spark.ui.port=1081” -i spark_example.scala
<spark_example.scala>
import java.io.File
import java.util.Calendar
val now = Calendar.getInstance().getTime()
//Define a variable that contains SQL to execute
val sql_text = (“select * from rms.class”)
//view output
//sql(sql_text).show()
//Create a DataFrame for SQL above
val sqlDF_output = sql(sql_text)
//Create a tenporary view
sqlDF_output.createOrReplaceTempView(“df_output”)
//Persist the data frame into a Hive Table
sqlDF_output.write.format(“orc”).mode(“overwrite”).saveAsTable(“rms.df_output”)
val now = Calendar.getInstance().getTime()
Example 2 : Create HIVE table from Spark Data Frames
This solution was tested within a Zepplin Notebook on an Oracle Big Compute environment
Create 2 data Frames,
These two data frames are registered against an existing Hive Table, which we want to join to itself, which tool a long time just using HiveQL.
val person_dedupe_temp = spark.sql(“select * from mr2_stg.person_dedupe_temp”)
person_dedupe_temp.toDF().registerTempTable(“person_dedupe_temp”)
person_dedupe_temp.toDF().registerTempTable(“person_dedupe_temp_a”)
Create Hive TableĀ
Now create a Hive table using Spark-SQL interpreter, but using the two dataframes. Generates the table within a minute or so.
%sql
create table mr2_stg.person_dedupe_temp2_np as
select
case when
case when pdt.account_create_date = ‘Account’ then pdt.created_date
when pdt.account_create_date = ‘Order’ then pdt.order_date
else pdt.time_now end
>
case when pdt2.account_create_date = ‘Account’ then pdt2.created_date
when pdt2.account_create_date = ‘Order’ then pdt2.order_date
else pdt2.time_now end
then pdt2.urn
else pdt.urn
end as Person_id,
pdt.urn,
pdt2.urn as urn2
from
person_dedupe_temp pdt
left join person_dedupe_temp_a pdt2 on pdt.email = pdt2.email
where pdt.urn <> pdt2.urn