MDFT Academy Quizzes

Analyze Spark Code for Data Processing

You work for MDFT Academy, a well-known training agency. You have a Fabric workspace that uses the default Spark starter pool and runtime version 1.2.

You plan to read a CSV file named students_raw.csv in a lakehouse, select columns, and save the data as a Delta table to the managed area of the lakehouse. students_raw.csv contains 12 columns.

You have the following code:

from pyspark.sql.functions import year
(spark
.read
.format("csv")
.option("header", true)
.load("Files/students_raw.csv")
.select("SalesOrderNumber", "OrderDate", "StudentName", "UnitPrice")
.withColumn("Year", year("OrderDate"))
.write
.partitionBy("Year")
.saveAsTable("students")
)

Which statements are true?

Choose all correct answers from the options below.

Explanations for each answer:

The Spark engine will read only the SalesOrderNumber, OrderDate, StudentName and UnitPrice columns from students_raw.csv is correct.
Removing the partition will reduce the execution time of the query is incorrect. Partitioning data by year can actually improve query performance for time-based queries, as it allows Spark to skip irrelevant partitions during data processing
Adding inferSchema="true" to the options will increase the execution time of the query is correct. When inferSchema is set to true, Spark needs to perform an additional pass over the data to determine the data types of each column, which increases the overall execution time
The UnitPrice column will be saved to the students table in numeric format is incorrect. Without specifying a schema or setting inferSchema=true, Spark reads all CSV columns as strings by default. The UnitPrice column will be saved as a string in the students table

Exam Preparation Quiz

Analyze Spark Code for Data Processing