For the first part of the package use, we will import the first function for scraping the baseball hitting statistics. This will allow you to scrape the players’ statistics from the baseball-reference site.
Now we will do the second function which allows you to clean the data. Again, this is just the statistical data, not the salary data.
import pandas as pdfrom bbanalysis.gathering_stats import clean_batting_data# df = [1,2,3,4,5]# clean_batting_data(test)# Convert list of dicts to DataFrametest_df = pd.DataFrame(test)# test_df.head()# Pass the DataFrame directly (NOT as a list)cleaned_data = clean_batting_data(test_df)cleaned_data.head()
Year
Rk
Player
Player_Link
Age
Team
Lg
WAR
G
PA
...
rOBA
Rbat+
TB
GIDP
HBP
SH
SF
IBB
Pos
batting_hand
0
2018
367
A.J. Ellis
https://www.baseball-reference.com/players/e/e...
37.0
SDP
NL
0.2
66.0
183.0
...
.326
105.0
52.0
2.0
1.0
3.0
2.0
1.0
2H/7D
right
1
2018
174
AJ Pollock
https://www.baseball-reference.com/players/p/p...
30.0
ARI
NL
2.2
113.0
460.0
...
.346
103.0
200.0
6.0
8.0
1.0
7.0
2.0
*8/H
right
2
2018
287
Aaron Altherr
https://www.baseball-reference.com/players/a/a...
27.0
PHI
NL
-0.9
105.0
285.0
...
.285
66.0
81.0
13.0
4.0
0.0
2.0
0.0
9H8/7
right
3
2018
88
Aaron Hicks
https://www.baseball-reference.com/players/h/h...
28.0
NYY
AL
4.3
137.0
581.0
...
.366
128.0
224.0
1.0
3.0
2.0
6.0
1.0
*8/HD
left
4
2018
145
Aaron Judge
https://www.baseball-reference.com/players/j/j...
26.0
NYY
AL
6.0
112.0
498.0
...
.396
152.0
218.0
10.0
4.0
0.0
5.0
3.0
9D/H8
right
5 rows × 36 columns
This next set of functions from the package are for scraping and organizing the salary data into a nice dataset.
This first one is for creating headers, allowing you to access the data from over 1,000 individaul player pages, as that is the only way to access the salary data.
from bbanalysis.gathering_salaries import create_http_headerscreate_http_headers()
At this point, you should now have scraped all the statistics data and have the setup for scraping the salary data. These following functions futher prepare and execute the scraping of the data from the actual player pages.
import pandas as pdimport jsonfrom bbanalysis.gathering_salaries import extract_unique_links test_csv_path ="test_players.csv"test_data = pd.DataFrame({"Player": ["Francisco Lindor"],"Player_Link": ["https://www.baseball-reference.com/players/l/lindofr01.shtml"]})test_data.to_csv(test_csv_path, index=False)output_json_path ="test_players.json"extract_unique_links(test_csv_path, output_json_path)withopen(output_json_path) as f: data = json.load(f)print(json.dumps(data, indent=2))
Extracted 1 unique links and saved to test_players.json
[
{
"id": 1,
"url": "https://www.baseball-reference.com/players/l/lindofr01.shtml",
"player": "Francisco Lindor"
}
]
This function is important becuase what it does is allow us to organize all the unique player links in one place. Many of the players in the data show up multiple times becuase they played more than one season over the six-year span, so this filters out repeat links.
import jsonimport cloudscraperfrom bbanalysis.gathering_salaries import scrape_with_cloudscraperscraper = cloudscraper.create_scraper()url ="https://www.baseball-reference.com/players/l/lindofr01.shtml"result = scrape_with_cloudscraper(url, scraper)# Transform to required formatformatted_data = {"id": 1,"player": "Francisco Lindor","salaries": result # Already in the right format!}# Save as JSONwithopen('lindor_data.json', 'w', encoding='utf-8') as f: json.dump(formatted_data, f, indent=2)
Now with the scrape_salary_from_url function, you may already be able to scrape the salary data. However, due to rate limits it might be really difficult to scrape the data, and you might get errors. Using cloudscraper allows a user to implement their scraper while still maintaining speed and efficiency. Thus the scrape_with_cloudscraper function prepares us to use a cloudscraper.
Next you will run this funciton:
# If we were to run this, it would take an hour. This function below loops scrape_with_cloudscraper for all links. # from bbanalysis.gathering_salaries import churn_with_cloudscraper# churn_with_cloudscraper()
This function implements the cloudscraper by creating one, then loops through the unique links dataset and scrapes all the salary data from 2018-2025 using scrape_with_cloudscraper. It then puts all the data into a .json file that stores the salary data. Once this finishes,you now have all the salary data from 2018-2025!
Now you just need to convert and organize the salary data properly to combine with the other statistical data into one big dataset. This function will organize it as necessary:
from bbanalysis.gathering_salaries import salaries_json_to_csvsalaries_json_to_csv("lindor_data.json", "lindor_data.csv")
Saved long-format salaries to lindor_data.csv
Now we’ll combine both the datasets using this function:
Loading and merging data...
Merged data saved to test_merged.csv
player year hr rbi salary
0 Mike Trout 2021 45 104 5000000
1 Aaron Judge 2021 39 98 3000000
And now we’re ready to do our analysis!
The first thing we will do is filter out players who ahve less than five seasons in the dataset, because we want to compare how salary affects play and it is hard to make significant findings otherwise.
Kept 2 players with at least 3 seasons.
player year HR
0 A 2018 10
1 A 2019 12
2 A 2020 15
5 C 2016 20
6 C 2017 22
7 C 2018 25
8 C 2019 18
9 C 2020 19
This function filters these players out. Next we will create indicators that mark when a player signs a big contract, leading to a large boost in pay. We will define an increase in $5,000,000 in salary as representative of this.
Creating contract indicators...
Found 1 big contract events.
Columns before groupby: ['player', 'year', 'salary', 'salary_change', 'pct_salary_change', 'big_contract_year']
DataFrame shape before groupby: (8, 6)
Number of players: 3
Columns after groupby: ['player', 'year', 'salary', 'salary_change', 'pct_salary_change', 'big_contract_year', 'years_from_contract']
DataFrame shape after groupby: (8, 7)
'years_from_contract' in columns: True
player year salary salary_change pct_salary_change big_contract_year \
0 A 2018 1000000 NaN NaN False
1 A 2019 1200000 200000.0 0.200000 False
2 A 2020 2500000 1300000.0 1.083333 True
3 B 2019 500000 NaN NaN False
4 B 2020 700000 200000.0 0.400000 False
5 C 2018 2000000 NaN NaN False
6 C 2019 2100000 100000.0 0.050000 False
7 C 2020 2300000 200000.0 0.095238 False
years_from_contract post_contract
0 -2.0 0.0
1 -1.0 0.0
2 0.0 1.0
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\src\bbanalysis\analysis.py:135: FutureWarning:
DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.
Now that we have this created, we can effectively compare statistical performance with before and after a large contract signing. One way to effectively do this is to run mixed effects models. We have a function for this:
Running Model 1: Salary → OPS (all players)
Mixed Linear Model Regression Results
=====================================================
Model: MixedLM Dependent Variable: ops
No. Observations: 9 Method: ML
No. Groups: 3 Scale: 0.0002
Min. group size: 3 Log-Likelihood: 26.3591
Max. group size: 3 Converged: Yes
Mean group size: 3.0
-----------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
-----------------------------------------------------
Intercept 11.122 11.416 0.974 0.330 -11.253 33.497
salary -0.000 0.000 -2.246 0.025 -0.000 -0.000
war 0.045 0.004 12.693 0.000 0.038 0.052
year -0.005 0.006 -0.922 0.356 -0.016 0.006
Group Var 0.000
=====================================================
Running Model 2: Pre vs Post contract performance
Using 6 observations from 2 players.
Mixed Linear Model Regression Results
=============================================================
Model: MixedLM Dependent Variable: ops
No. Observations: 6 Method: ML
No. Groups: 2 Scale: 0.0003
Min. group size: 3 Log-Likelihood: 16.0031
Max. group size: 3 Converged: No
Mean group size: 3.0
-------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
-------------------------------------------------------------
Intercept -104.585 33.739 -3.100 0.002 -170.712 -38.458
post_contract 0.000 0.028 0.000 1.000 -0.055 0.055
age -0.022 0.004 -6.194 0.000 -0.030 -0.015
year 0.052 0.017 3.138 0.002 0.020 0.085
Group Var 0.000
=============================================================
Post-contract effect interpretation:
Coefficient: 0.0000 | p-value: 1.0000 | 95% CI: [-0.0554, 0.0554]
No significant post-contract effect.
Model 1 summary:
Mixed Linear Model Regression Results
=====================================================
Model: MixedLM Dependent Variable: ops
No. Observations: 9 Method: ML
No. Groups: 3 Scale: 0.0002
Min. group size: 3 Log-Likelihood: 26.3591
Max. group size: 3 Converged: Yes
Mean group size: 3.0
-----------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
-----------------------------------------------------
Intercept 11.122 11.416 0.974 0.330 -11.253 33.497
salary -0.000 0.000 -2.246 0.025 -0.000 -0.000
war 0.045 0.004 12.693 0.000 0.038 0.052
year -0.005 0.006 -0.922 0.356 -0.016 0.006
Group Var 0.000
=====================================================
Model 2 summary:
Mixed Linear Model Regression Results
=============================================================
Model: MixedLM Dependent Variable: ops
No. Observations: 6 Method: ML
No. Groups: 2 Scale: 0.0003
Min. group size: 3 Log-Likelihood: 16.0031
Max. group size: 3 Converged: No
Mean group size: 3.0
-------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
-------------------------------------------------------------
Intercept -104.585 33.739 -3.100 0.002 -170.712 -38.458
post_contract 0.000 0.028 0.000 1.000 -0.055 0.055
age -0.022 0.004 -6.194 0.000 -0.030 -0.015
year 0.052 0.017 3.138 0.002 0.020 0.085
Group Var 0.000
=============================================================
Restricted df for Model 2:
player year salary ops war years_from_contract post_contract age
0 A 2018 1000000 0.80 5 -1.0 0.0 24
1 A 2019 1200000 0.85 6 0.0 1.0 25
2 A 2020 2500000 0.90 7 1.0 1.0 26
6 C 2018 2000000 0.75 4 -1.0 0.0 28
7 C 2019 2100000 0.76 4 0.0 1.0 29
8 C 2020 2300000 0.77 5 1.0 1.0 30
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:1634: UserWarning:
Random effects covariance is singular
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:1634: UserWarning:
Random effects covariance is singular
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:1634: UserWarning:
Random effects covariance is singular
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:1634: UserWarning:
Random effects covariance is singular
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2237: ConvergenceWarning:
The MLE may be on the boundary of the parameter space.
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2261: ConvergenceWarning:
The Hessian matrix at the estimated parameter values is not positive definite.
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning:
Maximum Likelihood optimization failed to converge. Check mle_retvals
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2200: ConvergenceWarning:
Retrying MixedLM optimization with lbfgs
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning:
Maximum Likelihood optimization failed to converge. Check mle_retvals
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2200: ConvergenceWarning:
Retrying MixedLM optimization with cg
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning:
Maximum Likelihood optimization failed to converge. Check mle_retvals
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2206: ConvergenceWarning:
MixedLM optimization failed, trying a different optimizer may help.
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2218: ConvergenceWarning:
Gradient optimization failed, |grad| = 1.263158
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2237: ConvergenceWarning:
The MLE may be on the boundary of the parameter space.
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2261: ConvergenceWarning:
The Hessian matrix at the estimated parameter values is not positive definite.
From these mixed effects models we can glean valuable information, like does performance increase or decrease with salary increase or decrease over time, and whether a steep increase in salary can have a big impact on a player’s performance.
The next step would be visualizing the data through using this function:
Generating visualizations → saved to 'test_plots/'
All visualizations saved.
Files in output directory: ['contract_boxplots.png', 'contract_trajectory.png']