Spark Tips – hilllaneconsulting.co.uk

Useful Spark Commands

You can run Spark jobs via the command line. Some examples below, including just starting and configuring a session. Also how to query a hive table, and write back a Spark Data Frame to Hive table.

Some useful hints below:-

Start a Spark Session

$>spark-shell
Welcome to

____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.2.6.0.3-8
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.
scala>

Load/Run Scala Script in Session
scala>: load test_script.scala

Start a Spark Session – configure UI Port
$>spark-shell –conf “spark_ui_port=1081”

This command starts a scala session, and allows you to browse to the Spark UI http://localhost:1081
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0.2.6.0.3-8
/_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_141)
Type in expressions to have them evaluated.
Type :help for more information.
scala>

Example 1 : Query and Create a Hive Table

<spark_shell_example.sh>
spark-shell –conf “spark.ui.port=1081” -i spark_example.scala

<spark_example.scala>
import java.io.File
import java.util.Calendar

val now = Calendar.getInstance().getTime()

//Define a variable that contains SQL to execute
val sql_text = (“select * from rms.class”)

//view output
//sql(sql_text).show()

//Create a DataFrame for SQL above
val sqlDF_output = sql(sql_text)

//Create a tenporary view
sqlDF_output.createOrReplaceTempView(“df_output”)

//Persist the data frame into a Hive Table
sqlDF_output.write.format(“orc”).mode(“overwrite”).saveAsTable(“rms.df_output”)

val now = Calendar.getInstance().getTime()

Example 2 : Create HIVE table from Spark Data Frames

This solution was tested within a Zepplin Notebook on an Oracle Big Compute environment

Create 2 data Frames,

These two data frames are registered against an existing Hive Table, which we want to join to itself, which tool a long time just using HiveQL.

val person_dedupe_temp = spark.sql(“select * from mr2_stg.person_dedupe_temp”)
person_dedupe_temp.toDF().registerTempTable(“person_dedupe_temp”)
person_dedupe_temp.toDF().registerTempTable(“person_dedupe_temp_a”)

Create Hive Table

Now create a Hive table using Spark-SQL interpreter, but using the two dataframes. Generates the table within a minute or so.

%sql
create table mr2_stg.person_dedupe_temp2_np as
select
case when
case when pdt.account_create_date = ‘Account’ then pdt.created_date
when pdt.account_create_date = ‘Order’ then pdt.order_date
else pdt.time_now end
>
case when pdt2.account_create_date = ‘Account’ then pdt2.created_date
when pdt2.account_create_date = ‘Order’ then pdt2.order_date
else pdt2.time_now end
then pdt2.urn
else pdt.urn
end as Person_id,
pdt.urn,
pdt2.urn as urn2
from
person_dedupe_temp pdt
left join person_dedupe_temp_a pdt2 on pdt.email = pdt2.email
where pdt.urn <> pdt2.urn

Leave a Reply Cancel reply