Practice Exam

Question 45 of 75

Analyze PySpark DataFrame Transformation Code

You work for MDFT Pro, a well-known training agency that processes student enrollment and purchase data. Mark, a Data Engineer at MDFT Pro, is developing a notebook to clean and transform order data from the student management system. The source data contains sales orders with customer names, email addresses, and order dates. Mark has written PySpark code to handle data quality issues including null or empty customer names, extract domain names from email addresses for marketing segmentation, remove duplicate orders on the same date, and extract year information for annual reporting. The table below shows sample data from the source system:

SalesOrderNumberOrderDateCustomerNameEmail
SO491722021-01-01Brian Howardbrian@adventureworks.com
SO491732021-01-01Linda Alvarezlinda@adventureworks.com
SO491742021-01-01Gina Hernandezgina@adventureworks.com
SO491782021-01-01Beth Ruizbeth@adventureworks.com
SO491792021-01-01Evan Wardevan@adventureworks.com

Mark has written the following code in his notebook:

01: df = df.withColumn("CustomerName", when((col("CustomerName").isNull() | (col("CustomerName")=="")),lit("Unknown")).otherwise(col("CustomerName")))
02: df = df.withColumn("Username", split(col("Email"), "@").getItem(1))
03: df = df.dropDuplicates(["OrderDate"]).select(col("OrderDate"), year("OrderDate").alias("year"), col("CustomerName"), col("Username"))
04: display(df.head(10))

Which of the following statements is true about this code?

Choose the correct answer from the options below.

Explanations for each answer:

Learn more about PySpark Column getItem method:
PySpark Column getItem
Next Question
Discuss this question on social media: