# Quiz 2

**Repository Path**: arsalabangash/quiz2

## Basic Information

- **Project Name**: Quiz 2
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 1
- **Forks**: 59
- **Created**: 2025-04-27
- **Last Updated**: 2025-06-22

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# Quiz 2

**Workflow & Collaboration:**
**工作流程与协作：**

*   This assignment is to be completed in groups.

    本次作业需要以小组形式完成。

*   Fork the assignment repository

    请将作业仓库 Fork 到你们小组的 

*   All code contributions must be submitted via Pull Requests (PRs) within your group's forked repository.

    所有代码贡献必须通过你们小组 Fork 后仓库内的 Pull Request (PR) 提交。

*   Divide the work based on group size:
    根据小组人数分工：

    If your group has 3 members: Each member must take responsibility for one script: clean2025.py, clean2024.py, and clean2020.py. Ensure you have the corresponding input data for each year.

    如果小组有 3 名成员： 每名成员必须负责一个脚本：clean2025.py、clean2024.py 和 clean2020.py。请确保你们拥有对应年份的输入数据。
    
    If your group has 2 members: One member should implement the cleaning logic in clean2020.py, and the other member should implement the logic in clean2024.py.

    如果小组有 2 名成员： 一名成员负责实现 clean2020.py 中的清理逻辑，另一名成员负责实现 clean2024.py 中的清理逻辑。

---

**Objective:** Clean and process the provided school data CSV files (``baoshan-schools-2025.csv`, baoshan-schools-2024.csv` and `baoshan-schools-2020.csv`) independently. The goal is to create two separate, clean datasets focused on school population metrics for public schools.

**目标：** 分别清理并处理提供的学校数据 CSV 文件（`baoshan-schools-2025.csv`, `baoshan-schools-2024.csv` 和 `baoshan-schools-2020.csv`）。目标是创建两个独立的、专注于公办学校规模指标的干净数据集。

**Input:**
**输入：**
*   `baoshan-schools-2025.csv`
*   `baoshan-schools-2024.csv`
*   `baoshan-schools-2020.csv`

**Output:**
**输出：**
*   `clean-baoshan-schools-2025.csv`
*   `clean-baoshan-schools-2024.csv`
*   `clean-baoshan-schools-2020.csv`

**Requirements (Apply these steps to *each* input file separately):**
**要求 (对 *每个* 输入文件分别执行以下步骤)：**

1.  **Load Data:** Load the data from the input CSV file. Address potential structural issues like repeated headers or excess blank rows.

    **加载数据：** 从输入的 CSV 文件加载数据。处理潜在的结构问题，如重复的表头或过多的空白行。

2.  **Standardize Headers:** Rename columns using the following English snake\_case names. Focus on the columns needed for the final output:

    **标准化列名：** 使用以下英文蛇形命名法 (snake\_case) 名称重命名列。重点关注最终输出所需的列：

    *   `序号` -> `serial_no`
    *   `学校名称` / `学校` -> `school_name`
    *   `学校性质` -> `school_type` (Needed for filtering)
    *   `教职工人数` -> `total_staff_count`
    *   `专职教师人数` -> `full_time_teacher_count`
    *   `师生比` -> `student_faculty_ratio_raw` (Temporary name for the raw ratio column before processing)

3.  **Filter Data:** Keep only data for Public schools (where `school_type` is '公办').

    **筛选数据：** 只保留公办学校的数据（即 `school_type` 为 '公办' 的行）。

4.  **Process Ratio:**
    **处理师生比：**
    *   Clean the raw student-faculty ratio column (`student_faculty_ratio_raw`). Retain only rows with the standard colon format (e.g., `1:XX`).

        清理原始的师生比列 (`student_faculty_ratio_raw`)。只保留具有标准冒号格式（例如 `1:XX`）的行。

    *   Extract the numeric ratio value (the 'XX' part). Handle potential errors during extraction.

        提取比率的数值部分 ('XX' 部分)。处理提取过程中可能出现的错误。

5.  **Calculate Population Metrics:**
    **计算规模指标：**
    *   Calculate `estimated_student_count` using the extracted numeric ratio value and the `total_staff_count`.

        使用提取出的比率数值和 `total_staff_count` 计算 `estimated_student_count` (估算学生总数)。

    *   Calculate using the `estimated_student_count` and the `full_time_teacher_count`. Handle potential division by zero errors.
        使用 `estimated_student_count` 和 `full_time_teacher_count` 计算  (每位专职教师对应的学生数)。处理潜在的除零错误。

6.  **Ensure Numeric Data:** Convert `total_staff_count`, `full_time_teacher_count`, and the calculated `estimated_student_count` and columns to a numeric data type.

    **确保数值类型：** 将 `total_staff_count`、`full_time_teacher_count` 以及计算得出的 `estimated_student_count` 列转换为数值数据类型。

7.  **Handle Missing Data:** Remove any rows that contain missing values in any of the final required columns after processing.

    **处理缺失数据：** 在处理完成后，删除在任何最终所需列中包含缺失值的行。

8.  **Select & Order Final Columns:** Create the final dataset containing only the columns listed below, in the specified order. Discard all other intermediate or unneeded columns (like `school_type`, `student_faculty_ratio_raw`).

    **选择并排序最终列：** 创建最终的数据集，使其仅包含下方列出的列，并按指定顺序排列。丢弃所有其他中间列或不需要的列（如 `school_type`, `student_faculty_ratio_raw`）。

9.  **Save Output:** Save the cleaned data to the corresponding output CSV file name.

    **保存输出：** 将清理后的数据保存到对应的输出 CSV 文件名。

**Final Required Columns & Order (for both output files):**

**最终所需列及顺序 (适用于两个输出文件)：**

1.  `serial_no`
2.  `school_name`
3.  `total_staff_count`
4.  `full_time_teacher_count`
5.  `estimated_student_count`