You work for MDFT Pro, a well-known training agency that processes student enrollment and purchase data. Mark, a Data Engineer at MDFT Pro, is developing a notebook to clean and transform order data from the student management system. The source data contains sales orders with customer names, email addresses, and order dates. Mark has written PySpark code to handle data quality issues including null or empty customer names, extract domain names from email addresses for marketing segmentation, remove duplicate orders on the same date, and extract year information for annual reporting. The table below shows sample data from the source system:
| SalesOrderNumber | OrderDate | CustomerName | |
|---|---|---|---|
| SO49172 | 2021-01-01 | Brian Howard | brian@adventureworks.com |
| SO49173 | 2021-01-01 | Linda Alvarez | linda@adventureworks.com |
| SO49174 | 2021-01-01 | Gina Hernandez | gina@adventureworks.com |
| SO49178 | 2021-01-01 | Beth Ruiz | beth@adventureworks.com |
| SO49179 | 2021-01-01 | Evan Ward | evan@adventureworks.com |
Mark has written the following code in his notebook:
01: df = df.withColumn("CustomerName", when((col("CustomerName").isNull() | (col("CustomerName")=="")),lit("Unknown")).otherwise(col("CustomerName")))
02: df = df.withColumn("Username", split(col("Email"), "@").getItem(1))
03: df = df.dropDuplicates(["OrderDate"]).select(col("OrderDate"), year("OrderDate").alias("year"), col("CustomerName"), col("Username"))
04: display(df.head(10))
Which of the following statements is true about this code?
Choose the correct answer from the options below.
Explanations for each answer: