# py-70069-web-scraper

**Repository Path**: Tony36051/py-70069-web-scraper

## Basic Information

- **Project Name**: py-70069-web-scraper
- **Description**: 70069,400
微信大燕燕🐝🐝🐝
私单Bloomberg，300完成
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-04-09
- **Last Updated**: 2025-08-02

## Categories & Tags

**Categories**: Uncategorized

**Tags**: 归档, 美国女博士

## README

git# 特定目标网页信息采集

本仓库只aflcio和bloomberg的代码继续维护，其他文件夹作废。

## aflcio

目标网站：https://aflcio.org/paywatch/company-pay-ratios

描述：
我需要的是把所有Russell3000 下的公司的信息都抄下来 然后 每个公司有年份 要点进去才可以看得见 那个公司的年份不一样 但是在同一个地方 我看到在同一个attribute下面

需求交付件：

|Ticker|Company|Medin_Worker_Pay|Pay_ration|FS_Year|
|----|----|----|----|----|
|ANF|Abercrombie & Fitch Co.|$1,954|4293|2020|


### 技术说明
1. requests模拟http请求
2. BeautifulSoup解析html文档（表格内容）
3. 正则匹配获取年份
4. 因为正文内容有逗号，写csv文件用\t制表符分隔。先打开空白excel，数据-自文件-选tab分隔
5. 使用前安装依赖, pip install -r requirements.txt

## bloomberg

目标网站： https://www.bloomberg.com/graphics/ceo-pay-ratio/

描述： 把所有公司所有年份的下表数据都爬下来

需求交付件：

|Company Name|Ticker|Fiscal Year|CEO compensation|Median employee compensation| CEO pay ratio|
|----|----|----|----|----|----|
|Walmart Inc|WMT|2020|22,105,350|22,484|983|

### 技术说明
1. requests模拟http请求
2. json格式读取
3. 写csv
4. 因为正文内容没有逗号，写csv文件用逗号分隔。先打开空白excel，数据-自文件-选逗号分隔

## payratio

https://www.sec.gov/edgar/search/#

filling type: DEF 14A

1) Sentence which has the Pay Ratio.-含有“pay ratio"的关键词

2) Pay Ratio-

3) CEO Compensation - If mentioned in the pay ratio sentence or within 2 sentences before the pay ratio sentence.（ceo compensation前后两句话里，如果含有pay ratio的话“

4) Median Compensation - If mentioned in the pay ratio sentence or within 2 sentences before the pay ratio sentence.

5) Ratio check - A check to validate the ratio extracted from the above three values

## git practise