Getting Started

Here is how you use my package:

For the first part of the package use, we will import the first function for scraping the baseball hitting statistics. This will allow you to scrape the players’ statistics from the baseball-reference site.

import sys
import os
sys.path.append(os.path.abspath("src"))
from bbanalysis.gathering_stats import scrape_batting_data

URLS = ['https://www.baseball-reference.com/leagues/majors/2018-standard-batting.shtml']
test = scrape_batting_data(URLS)

Scraping 2018...
  1272 players scraped for 2018

Now we will do the second function which allows you to clean the data. Again, this is just the statistical data, not the salary data.

import pandas as pd
from bbanalysis.gathering_stats import clean_batting_data

# df = [1,2,3,4,5]
# clean_batting_data(test)

# Convert list of dicts to DataFrame
test_df = pd.DataFrame(test)
# test_df.head()
# Pass the DataFrame directly (NOT as a list)
cleaned_data = clean_batting_data(test_df)
cleaned_data.head()
Year Rk Player Player_Link Age Team Lg WAR G PA ... rOBA Rbat+ TB GIDP HBP SH SF IBB Pos batting_hand
0 2018 367 A.J. Ellis https://www.baseball-reference.com/players/e/e... 37.0 SDP NL 0.2 66.0 183.0 ... .326 105.0 52.0 2.0 1.0 3.0 2.0 1.0 2H/7D right
1 2018 174 AJ Pollock https://www.baseball-reference.com/players/p/p... 30.0 ARI NL 2.2 113.0 460.0 ... .346 103.0 200.0 6.0 8.0 1.0 7.0 2.0 *8/H right
2 2018 287 Aaron Altherr https://www.baseball-reference.com/players/a/a... 27.0 PHI NL -0.9 105.0 285.0 ... .285 66.0 81.0 13.0 4.0 0.0 2.0 0.0 9H8/7 right
3 2018 88 Aaron Hicks https://www.baseball-reference.com/players/h/h... 28.0 NYY AL 4.3 137.0 581.0 ... .366 128.0 224.0 1.0 3.0 2.0 6.0 1.0 *8/HD left
4 2018 145 Aaron Judge https://www.baseball-reference.com/players/j/j... 26.0 NYY AL 6.0 112.0 498.0 ... .396 152.0 218.0 10.0 4.0 0.0 5.0 3.0 9D/H8 right

5 rows × 36 columns

This next set of functions from the package are for scraping and organizing the salary data into a nice dataset.

This first one is for creating headers, allowing you to access the data from over 1,000 individaul player pages, as that is the only way to access the salary data.

from bbanalysis.gathering_salaries import create_http_headers

create_http_headers()
{'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
 'Accept-Language': 'en-US,en;q=0.5',
 'Accept-Encoding': 'gzip, deflate',
 'Connection': 'keep-alive',
 'Upgrade-Insecure-Requests': '1'}

This next function is to parse the salary data from the players’ pages and scrape it.

from bs4 import BeautifulSoup
from bbanalysis.gathering_salaries import parse_salary_table_from_soup

html = """
<html>
<body>
<table id="br-salaries">
    <tr>
        <th>Year</th>
        <th>Salary</th>
    </tr>
    <tr>
        <td>2021</td>
        <td>$5,000,000</td>
    </tr>
    <tr>
        <td>2022</td>
        <td>$7,500,000</td>
    </tr>
</table>
</body>
</html>
"""

soup = BeautifulSoup(html, "html.parser")

result = parse_salary_table_from_soup(soup)
print(result)
{2021: 5000000, 2022: 7500000}

At this point, you should now have scraped all the statistics data and have the setup for scraping the salary data. These following functions futher prepare and execute the scraping of the data from the actual player pages.

import pandas as pd
import json
from bbanalysis.gathering_salaries import extract_unique_links 

test_csv_path = "test_players.csv"
test_data = pd.DataFrame({
    "Player": ["Francisco Lindor"],
    "Player_Link": ["https://www.baseball-reference.com/players/l/lindofr01.shtml"]
})
test_data.to_csv(test_csv_path, index=False)

output_json_path = "test_players.json"

extract_unique_links(test_csv_path, output_json_path)

with open(output_json_path) as f:
    data = json.load(f)

print(json.dumps(data, indent=2))
Extracted 1 unique links and saved to test_players.json
[
  {
    "id": 1,
    "url": "https://www.baseball-reference.com/players/l/lindofr01.shtml",
    "player": "Francisco Lindor"
  }
]

This function is important becuase what it does is allow us to organize all the unique player links in one place. Many of the players in the data show up multiple times becuase they played more than one season over the six-year span, so this filters out repeat links.

import json
import cloudscraper
from bbanalysis.gathering_salaries import scrape_with_cloudscraper

scraper = cloudscraper.create_scraper()
url = "https://www.baseball-reference.com/players/l/lindofr01.shtml"

result = scrape_with_cloudscraper(url, scraper)

# Transform to required format
formatted_data = {
    "id": 1,
    "player": "Francisco Lindor",
    "salaries": result  # Already in the right format!
}

# Save as JSON
with open('lindor_data.json', 'w', encoding='utf-8') as f:
    json.dump(formatted_data, f, indent=2)
Scraping https://www.baseball-reference.com/players/l/lindofr01.shtml

Now with the scrape_salary_from_url function, you may already be able to scrape the salary data. However, due to rate limits it might be really difficult to scrape the data, and you might get errors. Using cloudscraper allows a user to implement their scraper while still maintaining speed and efficiency. Thus the scrape_with_cloudscraper function prepares us to use a cloudscraper.

Next you will run this funciton:

# If we were to run this, it would take an hour. This function below loops scrape_with_cloudscraper for all links. 

# from bbanalysis.gathering_salaries import churn_with_cloudscraper
# churn_with_cloudscraper()

This function implements the cloudscraper by creating one, then loops through the unique links dataset and scrapes all the salary data from 2018-2025 using scrape_with_cloudscraper. It then puts all the data into a .json file that stores the salary data. Once this finishes,you now have all the salary data from 2018-2025!

Now you just need to convert and organize the salary data properly to combine with the other statistical data into one big dataset. This function will organize it as necessary:

from bbanalysis.gathering_salaries import salaries_json_to_csv
salaries_json_to_csv("lindor_data.json", "lindor_data.csv")
Saved long-format salaries to lindor_data.csv

Now we’ll combine both the datasets using this function:

import pandas as pd
from bbanalysis.analysis import load_and_merge_data

stats_test = pd.DataFrame({
    "Player": ["Mike Trout", "Aaron Judge"],
    "Year": [2021, 2021],
    "HR": [45, 39],
    "RBI": [104, 98]
})
stats_csv_path = "test_stats.csv"
stats_test.to_csv(stats_csv_path, index=False)

salaries_test = pd.DataFrame({
    "player": ["Mike Trout", "Aaron Judge"],
    "year": [2021, 2021],
    "salary": [5000000, 3000000]
})
salaries_csv_path = "test_salaries.csv"
salaries_test.to_csv(salaries_csv_path, index=False)

merged_df = load_and_merge_data(stats_csv_path, salaries_csv_path, "test_merged.csv")

print(merged_df)
Loading and merging data...
Merged data saved to test_merged.csv
        player  year  hr  rbi   salary
0   Mike Trout  2021  45  104  5000000
1  Aaron Judge  2021  39   98  3000000

And now we’re ready to do our analysis!

The first thing we will do is filter out players who ahve less than five seasons in the dataset, because we want to compare how salary affects play and it is hard to make significant findings otherwise.

import pandas as pd
from bbanalysis.analysis import filter_players_with_multiple_seasons

test_data = pd.DataFrame({
    "player": ["A", "A", "A", "B", "B", "C", "C", "C", "C", "C"],
    "year": [2018, 2019, 2020, 2019, 2020, 2016, 2017, 2018, 2019, 2020],
    "HR": [10, 12, 15, 5, 6, 20, 22, 25, 18, 19]
})

filtered_df = filter_players_with_multiple_seasons(test_data, min_seasons=3)

print(filtered_df)
Kept 2 players with at least 3 seasons.
  player  year  HR
0      A  2018  10
1      A  2019  12
2      A  2020  15
5      C  2016  20
6      C  2017  22
7      C  2018  25
8      C  2019  18
9      C  2020  19

This function filters these players out. Next we will create indicators that mark when a player signs a big contract, leading to a large boost in pay. We will define an increase in $5,000,000 in salary as representative of this.

import pandas as pd
import numpy as np
from bbanalysis.analysis import create_contract_indicators

test_data = pd.DataFrame({
    "player": ["A", "A", "A", "B", "B", "C", "C", "C"],
    "year": [2018, 2019, 2020, 2019, 2020, 2018, 2019, 2020],
    "salary": [1_000_000, 1_200_000, 2_500_000, 500_000, 700_000, 2_000_000, 2_100_000, 2_300_000]
})

df_with_contracts = create_contract_indicators(test_data, pct_threshold=0.5, abs_threshold=1_000_000)

print(df_with_contracts)
Creating contract indicators...
Found 1 big contract events.
Columns before groupby: ['player', 'year', 'salary', 'salary_change', 'pct_salary_change', 'big_contract_year']
DataFrame shape before groupby: (8, 6)
Number of players: 3
Columns after groupby: ['player', 'year', 'salary', 'salary_change', 'pct_salary_change', 'big_contract_year', 'years_from_contract']
DataFrame shape after groupby: (8, 7)
'years_from_contract' in columns: True
  player  year   salary  salary_change  pct_salary_change  big_contract_year  \
0      A  2018  1000000            NaN                NaN              False   
1      A  2019  1200000       200000.0           0.200000              False   
2      A  2020  2500000      1300000.0           1.083333               True   
3      B  2019   500000            NaN                NaN              False   
4      B  2020   700000       200000.0           0.400000              False   
5      C  2018  2000000            NaN                NaN              False   
6      C  2019  2100000       100000.0           0.050000              False   
7      C  2020  2300000       200000.0           0.095238              False   

   years_from_contract  post_contract  
0                 -2.0            0.0  
1                 -1.0            0.0  
2                  0.0            1.0  
3                  NaN            NaN  
4                  NaN            NaN  
5                  NaN            NaN  
6                  NaN            NaN  
7                  NaN            NaN  
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\src\bbanalysis\analysis.py:135: FutureWarning:

DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning.

Now that we have this created, we can effectively compare statistical performance with before and after a large contract signing. One way to effectively do this is to run mixed effects models. We have a function for this:

import pandas as pd
import numpy as np
from bbanalysis.analysis import run_mixed_effects_models

test_data = pd.DataFrame({
    "player": ["A", "A", "A", "B", "B", "B", "C", "C", "C"],
    "year": [2018, 2019, 2020, 2018, 2019, 2020, 2018, 2019, 2020],
    "salary": [1_000_000, 1_200_000, 2_500_000, 500_000, 700_000, 700_000, 2_000_000, 2_100_000, 2_300_000],
    "ops": [0.800, 0.850, 0.900, 0.700, 0.720, 0.710, 0.750, 0.760, 0.770],
    "war": [5, 6, 7, 2, 3, 3, 4, 4, 5],
    "years_from_contract": [-1, 0, 1, np.nan, np.nan, np.nan, -1, 0, 1],
    "post_contract": [0, 1, 1, np.nan, np.nan, np.nan, 0, 1, 1],
    "age": [24, 25, 26, 22, 23, 24, 28, 29, 30]
})

results = run_mixed_effects_models(test_data, window_years=1)

print("\nModel 1 summary:")
if 'model1' in results:
    print(results['model1'].summary())

print("\nModel 2 summary:")
if 'model2' in results:
    print(results['model2'].summary())

print("\nRestricted df for Model 2:")
print(results.get('df_model2'))

Running Model 1: Salary → OPS (all players)
        Mixed Linear Model Regression Results
=====================================================
Model:            MixedLM Dependent Variable: ops    
No. Observations: 9       Method:             ML     
No. Groups:       3       Scale:              0.0002 
Min. group size:  3       Log-Likelihood:     26.3591
Max. group size:  3       Converged:          Yes    
Mean group size:  3.0                                
-----------------------------------------------------
          Coef.  Std.Err.   z    P>|z|  [0.025 0.975]
-----------------------------------------------------
Intercept 11.122   11.416  0.974 0.330 -11.253 33.497
salary    -0.000    0.000 -2.246 0.025  -0.000 -0.000
war        0.045    0.004 12.693 0.000   0.038  0.052
year      -0.005    0.006 -0.922 0.356  -0.016  0.006
Group Var  0.000                                     
=====================================================


Running Model 2: Pre vs Post contract performance
Using 6 observations from 2 players.
            Mixed Linear Model Regression Results
=============================================================
Model:                MixedLM   Dependent Variable:   ops    
No. Observations:     6         Method:               ML     
No. Groups:           2         Scale:                0.0003 
Min. group size:      3         Log-Likelihood:       16.0031
Max. group size:      3         Converged:            No     
Mean group size:      3.0                                    
-------------------------------------------------------------
               Coef.   Std.Err.   z    P>|z|  [0.025   0.975]
-------------------------------------------------------------
Intercept     -104.585   33.739 -3.100 0.002 -170.712 -38.458
post_contract    0.000    0.028  0.000 1.000   -0.055   0.055
age             -0.022    0.004 -6.194 0.000   -0.030  -0.015
year             0.052    0.017  3.138 0.002    0.020   0.085
Group Var        0.000                                       
=============================================================


Post-contract effect interpretation:
Coefficient: 0.0000 | p-value: 1.0000 | 95% CI: [-0.0554, 0.0554]
No significant post-contract effect.

Model 1 summary:
        Mixed Linear Model Regression Results
=====================================================
Model:            MixedLM Dependent Variable: ops    
No. Observations: 9       Method:             ML     
No. Groups:       3       Scale:              0.0002 
Min. group size:  3       Log-Likelihood:     26.3591
Max. group size:  3       Converged:          Yes    
Mean group size:  3.0                                
-----------------------------------------------------
          Coef.  Std.Err.   z    P>|z|  [0.025 0.975]
-----------------------------------------------------
Intercept 11.122   11.416  0.974 0.330 -11.253 33.497
salary    -0.000    0.000 -2.246 0.025  -0.000 -0.000
war        0.045    0.004 12.693 0.000   0.038  0.052
year      -0.005    0.006 -0.922 0.356  -0.016  0.006
Group Var  0.000                                     
=====================================================


Model 2 summary:
            Mixed Linear Model Regression Results
=============================================================
Model:                MixedLM   Dependent Variable:   ops    
No. Observations:     6         Method:               ML     
No. Groups:           2         Scale:                0.0003 
Min. group size:      3         Log-Likelihood:       16.0031
Max. group size:      3         Converged:            No     
Mean group size:      3.0                                    
-------------------------------------------------------------
               Coef.   Std.Err.   z    P>|z|  [0.025   0.975]
-------------------------------------------------------------
Intercept     -104.585   33.739 -3.100 0.002 -170.712 -38.458
post_contract    0.000    0.028  0.000 1.000   -0.055   0.055
age             -0.022    0.004 -6.194 0.000   -0.030  -0.015
year             0.052    0.017  3.138 0.002    0.020   0.085
Group Var        0.000                                       
=============================================================


Restricted df for Model 2:
  player  year   salary   ops  war  years_from_contract  post_contract  age
0      A  2018  1000000  0.80    5                 -1.0            0.0   24
1      A  2019  1200000  0.85    6                  0.0            1.0   25
2      A  2020  2500000  0.90    7                  1.0            1.0   26
6      C  2018  2000000  0.75    4                 -1.0            0.0   28
7      C  2019  2100000  0.76    4                  0.0            1.0   29
8      C  2020  2300000  0.77    5                  1.0            1.0   30
C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:1634: UserWarning:

Random effects covariance is singular

C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:1634: UserWarning:

Random effects covariance is singular

C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:1634: UserWarning:

Random effects covariance is singular

C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:1634: UserWarning:

Random effects covariance is singular

C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2237: ConvergenceWarning:

The MLE may be on the boundary of the parameter space.

C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2261: ConvergenceWarning:

The Hessian matrix at the estimated parameter values is not positive definite.

C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2200: ConvergenceWarning:

Retrying MixedLM optimization with lbfgs

C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2200: ConvergenceWarning:

Retrying MixedLM optimization with cg

C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\base\model.py:607: ConvergenceWarning:

Maximum Likelihood optimization failed to converge. Check mle_retvals

C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2206: ConvergenceWarning:

MixedLM optimization failed, trying a different optimizer may help.

C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2218: ConvergenceWarning:

Gradient optimization failed, |grad| = 1.263158

C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2237: ConvergenceWarning:

The MLE may be on the boundary of the parameter space.

C:\Users\Jenna\OneDrive\Desktop\Statistics\Stat 386\Final_Project\.venv\Lib\site-packages\statsmodels\regression\mixed_linear_model.py:2261: ConvergenceWarning:

The Hessian matrix at the estimated parameter values is not positive definite.

From these mixed effects models we can glean valuable information, like does performance increase or decrease with salary increase or decrease over time, and whether a steep increase in salary can have a big impact on a player’s performance.

The next step would be visualizing the data through using this function:

import pandas as pd
import numpy as np
import os
from bbanalysis.analysis import generate_visualizations

test_data = pd.DataFrame({
    "player": ["A", "A", "A", "B", "B", "B", "C", "C", "C"],
    "year": [2018, 2019, 2020, 2018, 2019, 2020, 2018, 2019, 2020],
    "ops": [0.800, 0.850, 0.900, 0.700, 0.720, 0.710, 0.750, 0.760, 0.770],
    "war": [5, 6, 7, 2, 3, 3, 4, 4, 5],
    "years_from_contract": [-1, 0, 1, np.nan, np.nan, np.nan, -1, 0, 1],
    "post_contract": [0, 1, 1, np.nan, np.nan, np.nan, 0, 1, 1]
})

output_dir = "test_plots"
if os.path.exists(output_dir):
    import shutil
    shutil.rmtree(output_dir)

generate_visualizations(test_data, output_dir=output_dir)

print("Files in output directory:", os.listdir(output_dir))

Generating visualizations → saved to 'test_plots/'
All visualizations saved.
Files in output directory: ['contract_boxplots.png', 'contract_trajectory.png']