# Quiz 2 **Repository Path**: arsalabangash/quiz2 ## Basic Information - **Project Name**: Quiz 2 - **Description**: No description available - **Primary Language**: Unknown - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 59 - **Created**: 2025-04-27 - **Last Updated**: 2025-06-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Quiz 2 **Workflow & Collaboration:** **工作流程与协作:** * This assignment is to be completed in groups. 本次作业需要以小组形式完成。 * Fork the assignment repository 请将作业仓库 Fork 到你们小组的 * All code contributions must be submitted via Pull Requests (PRs) within your group's forked repository. 所有代码贡献必须通过你们小组 Fork 后仓库内的 Pull Request (PR) 提交。 * Divide the work based on group size: 根据小组人数分工: If your group has 3 members: Each member must take responsibility for one script: clean2025.py, clean2024.py, and clean2020.py. Ensure you have the corresponding input data for each year. 如果小组有 3 名成员: 每名成员必须负责一个脚本:clean2025.py、clean2024.py 和 clean2020.py。请确保你们拥有对应年份的输入数据。 If your group has 2 members: One member should implement the cleaning logic in clean2020.py, and the other member should implement the logic in clean2024.py. 如果小组有 2 名成员: 一名成员负责实现 clean2020.py 中的清理逻辑,另一名成员负责实现 clean2024.py 中的清理逻辑。 --- **Objective:** Clean and process the provided school data CSV files (``baoshan-schools-2025.csv`, baoshan-schools-2024.csv` and `baoshan-schools-2020.csv`) independently. The goal is to create two separate, clean datasets focused on school population metrics for public schools. **目标:** 分别清理并处理提供的学校数据 CSV 文件(`baoshan-schools-2025.csv`, `baoshan-schools-2024.csv` 和 `baoshan-schools-2020.csv`)。目标是创建两个独立的、专注于公办学校规模指标的干净数据集。 **Input:** **输入:** * `baoshan-schools-2025.csv` * `baoshan-schools-2024.csv` * `baoshan-schools-2020.csv` **Output:** **输出:** * `clean-baoshan-schools-2025.csv` * `clean-baoshan-schools-2024.csv` * `clean-baoshan-schools-2020.csv` **Requirements (Apply these steps to *each* input file separately):** **要求 (对 *每个* 输入文件分别执行以下步骤):** 1. **Load Data:** Load the data from the input CSV file. Address potential structural issues like repeated headers or excess blank rows. **加载数据:** 从输入的 CSV 文件加载数据。处理潜在的结构问题,如重复的表头或过多的空白行。 2. **Standardize Headers:** Rename columns using the following English snake\_case names. Focus on the columns needed for the final output: **标准化列名:** 使用以下英文蛇形命名法 (snake\_case) 名称重命名列。重点关注最终输出所需的列: * `序号` -> `serial_no` * `学校名称` / `学校` -> `school_name` * `学校性质` -> `school_type` (Needed for filtering) * `教职工人数` -> `total_staff_count` * `专职教师人数` -> `full_time_teacher_count` * `师生比` -> `student_faculty_ratio_raw` (Temporary name for the raw ratio column before processing) 3. **Filter Data:** Keep only data for Public schools (where `school_type` is '公办'). **筛选数据:** 只保留公办学校的数据(即 `school_type` 为 '公办' 的行)。 4. **Process Ratio:** **处理师生比:** * Clean the raw student-faculty ratio column (`student_faculty_ratio_raw`). Retain only rows with the standard colon format (e.g., `1:XX`). 清理原始的师生比列 (`student_faculty_ratio_raw`)。只保留具有标准冒号格式(例如 `1:XX`)的行。 * Extract the numeric ratio value (the 'XX' part). Handle potential errors during extraction. 提取比率的数值部分 ('XX' 部分)。处理提取过程中可能出现的错误。 5. **Calculate Population Metrics:** **计算规模指标:** * Calculate `estimated_student_count` using the extracted numeric ratio value and the `total_staff_count`. 使用提取出的比率数值和 `total_staff_count` 计算 `estimated_student_count` (估算学生总数)。 * Calculate using the `estimated_student_count` and the `full_time_teacher_count`. Handle potential division by zero errors. 使用 `estimated_student_count` 和 `full_time_teacher_count` 计算 (每位专职教师对应的学生数)。处理潜在的除零错误。 6. **Ensure Numeric Data:** Convert `total_staff_count`, `full_time_teacher_count`, and the calculated `estimated_student_count` and columns to a numeric data type. **确保数值类型:** 将 `total_staff_count`、`full_time_teacher_count` 以及计算得出的 `estimated_student_count` 列转换为数值数据类型。 7. **Handle Missing Data:** Remove any rows that contain missing values in any of the final required columns after processing. **处理缺失数据:** 在处理完成后,删除在任何最终所需列中包含缺失值的行。 8. **Select & Order Final Columns:** Create the final dataset containing only the columns listed below, in the specified order. Discard all other intermediate or unneeded columns (like `school_type`, `student_faculty_ratio_raw`). **选择并排序最终列:** 创建最终的数据集,使其仅包含下方列出的列,并按指定顺序排列。丢弃所有其他中间列或不需要的列(如 `school_type`, `student_faculty_ratio_raw`)。 9. **Save Output:** Save the cleaned data to the corresponding output CSV file name. **保存输出:** 将清理后的数据保存到对应的输出 CSV 文件名。 **Final Required Columns & Order (for both output files):** **最终所需列及顺序 (适用于两个输出文件):** 1. `serial_no` 2. `school_name` 3. `total_staff_count` 4. `full_time_teacher_count` 5. `estimated_student_count`