Tuesday, February 21, 2017

Descriptive Statistics and Mean Centers

Introduction

Fig 2.0: Racer Times
  One goal of this lab is to become familiar with the concept of standard deviation, and other statistics. The second goal is to understand the difference between mean centers and weighted mean centers. This lab is broken up into two separate parts.
  Part one consists of defining and implementing a variety of statistics. These statistics include range, median, mode, kurtosis, skewness, and standard deviation. The statistics will be derived from the data set shown at right in figure 2.0. All statistics will be calculated using Excel except for the standard deviation which will be done by hand. The scenario for this lab is a cycling race (Tour de Geographia). In Tour de Geographia, there is both a team and an individual component. The individual who wins the race wins $300,00 with 25% going to the team owner, but the team that wins is awarded $400,000 with 35% going to the team owner. After performing the statistics based previous race result, team Astana or team Tobler will be chosen depending on which team and team owner will most likely make the most money at the race.
  Part two consists of calculating mean centers and weighted mean centers. Population data from 2000 and 2015 for Wisconsin counties will be used to determine the weighted mean center of population for both 2000 and 2015. Also, the geographic mean center of Wisconsin counties will be shown on the map. The difference between geographic mean center and weighted mean center will be discussed. Then, there will be some discussion about the patterns displayed in the map.

Part I: Defining and Calculating Statistics

Definitions

Range: Is the difference between the highest and lowest values in the data set.

Mean: Is equal to all the values added together divided by the number of items in the data set.

Median: The exact middle value of the data set. If the data set has an even number of data then the median is found by finding the mean between the two middle values in the data set.

Mode: Is the most common value in the data set.

Skewness: Refers to the symmetry of the data distribution. There is positive and negative skewness. Positive skewness occurs when the skew is calculated to be over 1, while negative skewness occurs when the skew is calculated to be less than -1. When the skew is between -1 and 1, it is generally considered a normal distribution. Visually, positive skewness will have a long tail to the right because of the large outliers in that direction, and a negative skewness will have a long tail to the left because of the small outliers in that direction.
Kurtosis
Fig 2.1: Different Types of Kurtosis


Kurtosis:  Is how peaked a data set is. Kurtosis doesn't have units, but it is given as a number.  Positive kurtosis (Leptokurtic) is when the peak is very steep and the number is greater than 4, and negative kurtosis (Platykurtic) is when the peak is spread out and the number is less than 2. When the peak follows a normal distribution, it is given the name Mesokurtic when the number is between 2 and 4. The graphic on the right, in figure 2.1, does a nice job of showing the different types of kurtosis. When kurtosis is calculated in Excel, a 3 is subtracted from the original value.

Standard Deviation Equation
Fig 2.2: Sample vs Population Standard Deviation
Standard Deviation: Is used to see how closely all the observations are clustered around the mean of a data set. In a normally distributed data set, 68.3% of the values will occur within one standard deviation of the mean, while 95.4% of the values will occur withing two standard deviations of the mean. There is a different equation used to calculate standard deviations for samples and populations. Both equations are shown on the right in figure 2.2. The population standard deviation is shown on the top, and it calculated by multiplying one over the number of values in the data set and multiplying that by the sum of each data set value minus the mean squared. The sample standard deviation is calculated the same way except the first number is one over the number of the values in the data set minus one. The higher the standard deviation is, the more spread out the data set will be. The lower the standard deviation, the more that the data is peaked and centered around the mean.

Calculating Standard Deviation
  Below, in figures 2.3 and 2.4, are the standard deviations calculated by hand for team Astana and team Tobler based on the data set shown in figure 2.0. The mean wasn't calculated by hand, but is used in the calculation below. The mean of team Astana's times was 2276.66 minutes and the mean of team Tobler's times was 2285.46 minutes. All the units in the calculation were in minutes to make the math easier. Before the calculation of team Astana, the variables are given in the upper right corner and are labeled with what they stand for. Team Astana's standard deviation calculation is shown in figure 2.3, and team Tobler's standard deviation calculation is displayed in figure 2.4

Team Astana
 Team Astana's Standard Deviation Calculation
Fig 2.3: Team Astana's Standard Deviation Calculation

Team Tobler
Team Tobler's Standard Deviation Calculation
Fig 2.4:  Team Tobler's Standard Deviation Calculation
All Statistics
  The range, mean, median, mode, kurtosis, and skewness were all calculated in excel. the results are shown below in figure 2.5. All values are rounded to the nearest minute. Kurtosis and skewness don't have a unit so they are left just as they were calculated. Notice, that the standard deviation is displayed as well.
All Statistics for both team Astana and team Tobler
Fig 2.5: All Statistics for both team Astana and team Tobler
  Looking at the statistics, team Astana is clearly the team to beat. During the last Tour de Geographia, team Astana averaged 8 minutes fewer than team Tobler. Team Astana's cycling times are more spread out than team Tobler's, as its standard deviation is 9 minutes larger. The range will also support this, as Astana's team range is over twice as large team Tobler's range. The median still gives the edge to team Astana because its 9 minutes fewer than team Tobler's. The mode also gives the edge to team Astana. One interesting thing about the data is that team Tobler's distribution is left skewed, and has a large peak to it where team Astana's distribution is normally distributed and has near normal kurtosis.
  Before I choose which team I should pick, so I can collect the most possible money, a graph was made. It is shown in figure 2.6 which depicts pairing each racer with another racer from the other team. For both teams, the best racer starts at one, and the second best racers resides at two, the thrid best at three and so on.
Tour de Geographia Cycling Times in Minutes
Fig 2.6: Tour de Geographia Cycling Times in Minutes
   This graph shows that Team Astana holds the edge with every driver expect one which also happens to be the worst driver in the race. Because of this graph, the mean finishing time, and the median fishing time, I would choose team Astana. Based on this graph, and the statistics, during the previous Tour de Geographia, the best driver on team Astana finished the race first, with a fastest time of 37 hours and 25 minutes. This was a whole 19 minutes faster than the fastest driver from team Tobler who finished fourth overall. On average, team Astana was 8 minutes faster. After pairing every racer how they finished on their team with the other team, team Astana had a faster time for each racer except for the slowest racer. This graph and the mean are the most important things which led to my decision. I feel like this is a pretty solid bet, and that team Astana should win, both overall, and as a team ( have the lower mean racing time) easily based on the previous race results. This would give me the best chance to win up to $215,000.

Part II: Calculating Mean Centers and Weighted Mean Cetners

Definitions

Geographic Mean Center: A measure of central tendency which is calculated by taking the averages of the x, and y values. It is the exact center of a set of points.

Weighted Mean Center: Is the geographic mean center of set of points adjusted for the values associated with each point. Each point is given a weight depending on the value which it holds. For example, below, in the Wisconsin Population map (Figure 2.7), each county is stored as a point, and each of these points then holds the population of the county. The weighted mean center is taken by taking the sum of the population then multiplying it by the county's center x and y coordinates ,and then finding the center of all these weighted county values.

Map
  Below, in Figure 2.7, is a map which displays the geographic mean center, and the weighted mean center of Wisconsin population at the county level. To find the geographic mean center, the tool Mean Center was used with the input of the Wisconsin counties feature class. To find the weighted mean center by population for 2000, and 2015, the Mean Center tools was also used, but this time, the respective population field was added to the input weight field box.

Geographic Center and Weighted Mean Center of Wisconsin by County
Fig 2.7: Geographic Center and Weighted Mean Center of Wisconsin by County
Discussion
  The map above shows that the geographic mean center is a little more than 50 miles northwest from the weighted mean center of population for 2000. and 2015. The geographic center of Wisconsin is located in Wood county, while both the 2000 and 2015 mean center's of population is located in Green Lake county. This is because of the large population cities of Green Bay, Milwaukee, and Madison are given considerable weight when finding the weighed mean based on population. From the map, it is clear that the weighted mean center for population has moved every so slightly to the north and west. This means that overall, the population has shifted so that a larger percentage of population is now resided west of the 2000 population mean center. Going back into the Excel file from which the population data came from, both Sawyer county, and St. Croix county experienced large increases in population over the last 15 years. Both of these counties are located in the western third of the state. This can help explain for the westward movement of the population mean center. The St. Croix county population has grown by 52,071, and Sawyer county has grown by 46,796. Conversly, Sheboygan county, which borders lake Michigan, has experienced a population loss of 71,083 over the last 15 years. Both of these together help to show that the Wisconsin population is moving west. The reason why people may be moving west may be because of fast growing cities such as Hudson, or Eau Claire. where the demand for jobs at on the increase. Another reason could be that there isn't as much space for the population to expand in the southeast as there is across the rest of the state.

Conclusion

  In conclusion, statistics, and measures of central tendency can be used for different kinds of analysis. The statistics of range, median, mode, mean, skewness, kurtosis, and standard deviation help for one to understand the general shape of a data distribution without having to graph the data. Although sometimes, looking at the statistics themselves is enough to answer some questions, many times one will have to look through the data set and see what the values are. This will help so that outliers can be identified and that the general distribution can be eyeballed. This only works though if the data set is fairly small, like the one for Tour de Gegraphia. 

Sources

No comments:

Post a Comment