Greg Geis 03-28-2014 12:59 PM

I think it would be interesting to play around with the statistical distributions comparing each region for 14.1-14.5. I would like to see how the 50th, 90th, 99% compare for each region. It would also be cool to see if there are "stronger" regions (14.3) or "motor" regions (14.5) or if they get pretty even once you get a large enough dataset of Crossfitters.

The analysis is really easy to put together. You could pull out most of the information with a cumulative probability distribution comparing the different regions. Example: [url][/url] (WFS).

You could also pull the P(10) P(50) P(90) P(99) which would compare the percentiles of each region to see if they are consistent, or do they vary from one to another. The Northeast is stacked at the highest level, does that carry for the top 10%, the average?

The biggest hurdle to getting this is the data size. If it was a small data set I could do it in an hour in excel. But 200,000 competitors X 5 workouts = 1,000,000 data points, which is starting to be a fairly large dataset. Also, you can only pull up 50 competitors at a time on the games site, so you'd need to be able to download the data all at once, copy paste isn't going to cut it.

If I had the data, excel should technically be able to handle it (1,000,000 rows available) but I don't know how well it would handle graphing that large of datasets without crashing. It may be able to handle it, may not.

So in summary:
Is there an interst in seeing this kind of analysis?
Does anyone know a better way to get it?
If previous answers are yes and no, is there a way I can get a copy of the data?

Chuck Golden 03-29-2014 03:48 PM

I'd love to see this kind of data, I've wanted to do some similar analysis but pulling the data from the Games site is so tedious. I wish I knew of my faster way

Christopher Morris 03-29-2014 10:46 PM

We have an idea of which regions are fittest when considering the winners (e.g. the men in Central East). I take it you want to look at the data considering all athletes in each region, or at least all athletes that submitted scores for every Open workout? You want to know which region has the fittest average.

I would definitely find that analysis interesting.

Greg Geis 03-30-2014 08:16 AM

My title was more of an attention grabber than my actual intent. The average is one of the easy primary statistics you might look at, but I think there is more to the story than just the average would tell you. Thats why I want to build and compare the distributions. Some regions may have a higher elite segment. You might see a more ballsy region with clearly a larger "beginner" segment that signed up for the open. I really don't know what you'll see, but I bet you could pull out some interesting trends.

It would also be cool to compare the 11.1 distribution to 14.1. Crossfit has an awesome data set to play with, and Castro has mentioned on several occasions how like they like to gather that kind of data and play around with it. I've just never seen the data analyzed at the next level yet. I am hoping someone at HQ has an interest.

Chuck Golden 03-30-2014 11:33 AM

I agree it would be a very interesting breakdown based on the huge amount of data available. I just have no idea how you'd gather it without a ton of manual copy and pasting. Pulling the top 60 from each region (basically the group that would go to Regionals) wouldn't he too terribly difficult. I've thought about doing that once the leader boards are finalized. That would at least give some indication as to which regions are the hardest to get to Regionals/Games. Having that middle group though, say 100-800 though would be pretty good too

Greg Geis 03-31-2014 06:29 AM

It would take someone with access to the database to create an export of the relevant information and send it to me via a file transfer site. I don't know if HQ views the entire data set data as proprietary or would have issues sending it out. It would be very easy to do for whoever manages their database. Everything is accessible, just not usable as is. Or HQ could do it themselves pretty easily if anyone with access to the database has a basic statistics background and a curiosity to know what to look for.

Mark E. Wallace 03-31-2014 07:14 AM

HQ almost certainly has -- or could easily acquire -- the resources to do this analysis if they care to have it done. Pretty unlikely that they're going to just turn the raw data over to some random dude on the Internet.

- Mark

Luke Sirakos 03-31-2014 09:18 AM

If all you wanted to do was see the fittest region I think a simple median of all those who submitted a score for each workout would be your best bet. In my head, the idea of fittest region would be, if you had a huge dartboard for each region with every participant on it with an equal amount of space, if you threw a dart randomly at each region, which would have the highest probability of winning.

I think the analysis could be far more interesting if you could get additional data on the participants such as age, weight, benchmark lifts/wod times, level of competitive seriousness, etc. That would make for some really interesting data.

What could be more possible is to take the games athletes from prior years, get their basic stats at the time of the games, and build a model to predict what place they would finish. Then take this years pool once it is determined and run it against the model. It could tell you some pretty interesting information such as what attributes are the strongest predictors of success in the games. I think that could be pretty fascinating.

Greg Geis 03-31-2014 08:36 PM

Don't get me wrong, I am not holding my breath for a response. I figured it couldn't hurt to ask. At the very least it would get the conversation going and possibly spark an interest in someone who would want to write a journal article somewhat along those lines. The raw data isn't really proprietary or anything, but not many people are going to spend 6 hours or so copying all the data manually.

It's only worth it to do the work if there is a big enough interest to see it. I wanted to throw the idea out there right at the end of the open when the interest would be highest. On the off chance someone in the right position found it intriguing I'd be happy to play with the numbers and throw together some graphics. It would be even better if someone with better access to the data (the entire data set like Luke mentioned) wanted to tinker with it. Anyway, if they don't have enough interest, it should probably die right here and in all likelihood will.

