BenchMark Quizzes by MDFT Pro

Analyze PySpark DataFrame Transformation Code

You work for MDFT Pro, a well-known training agency that processes student enrollment and purchase data. Mark, a Data Engineer at MDFT Pro, is developing a notebook to clean and transform order data from the student management system. The source data contains sales orders with customer names, email addresses, and order dates. Mark has written PySpark code to handle data quality issues including null or empty customer names, extract domain names from email addresses for marketing segmentation, remove duplicate orders on the same date, and extract year information for annual reporting. The table below shows sample data from the source system:

SalesOrderNumber	OrderDate	CustomerName	Email
SO49172	2021-01-01	Brian Howard	brian@adventureworks.com
SO49173	2021-01-01	Linda Alvarez	linda@adventureworks.com
SO49174	2021-01-01	Gina Hernandez	gina@adventureworks.com
SO49178	2021-01-01	Beth Ruiz	beth@adventureworks.com
SO49179	2021-01-01	Evan Ward	evan@adventureworks.com

Mark has written the following code in his notebook:

01: df = df.withColumn("CustomerName", when((col("CustomerName").isNull() | (col("CustomerName")=="")),lit("Unknown")).otherwise(col("CustomerName")))
02: df = df.withColumn("Username", split(col("Email"), "@").getItem(1))
03: df = df.dropDuplicates(["OrderDate"]).select(col("OrderDate"), year("OrderDate").alias("year"), col("CustomerName"), col("Username"))
04: display(df.head(10))

Which of the following statements is true about this code?

Choose the correct answer from the options below.

Explanations for each answer:

Line 01 will replace all the null and empty values in the CustomerName column with the Unknown value is correct. Line 01 uses the when() function with a condition that checks if CustomerName is null OR empty string (col("CustomerName").isNull() | (col("CustomerName")=="")). When this condition is true, it replaces the value with lit("Unknown") using the otherwise clause. This correctly handles both null and empty string cases, replacing them with "Unknown" while preserving all other values.
Line 02 will extract the value before the @ character and generate a new column named Username is incorrect. Line 02 uses split(col("Email"), "@").getItem(1) which splits the email by the @ character and retrieves the item at index 1 (the second element). For "brian@adventureworks.com", this extracts "adventureworks.com" (the domain after @), not "brian" (the username before @). To extract the username, you would need getItem(0) instead of getItem(1).
Line 03 will extract the year value from the OrderDate column and keep only the first occurrence for each year is incorrect. While line 03 does extract the year value using year("OrderDate").alias("year"), the dropDuplicates(["OrderDate"]) removes duplicates based on OrderDate, not based on year. This means it keeps only the first occurrence for each unique OrderDate value, not for each year. Multiple orders from different dates in the same year would all be retained.
Line 04 will display all rows in the DataFrame is incorrect. Line 04 uses df.head(10) which returns only the first 10 rows of the DataFrame, not all rows. The display() function then shows these 10 rows. To display all rows, you would use display(df) without the head() method.

Learn more about PySpark Column getItem method:

PySpark Column getItem

Practice Exam

Analyze PySpark DataFrame Transformation Code

Discuss this question on social media: