I'm trying a new way of handling data questions, the Insights website. The
water-project tag identifies questions specific to this project, but there are also questions on the site that are not part of the project. Feel free to comment on or answer any of them.
I'll be promoting this site to journalists, nonprofits and other who are interested in getting data answers, and I hope that you'll follow the questions and answer some of them. You can get immediate notification of new questions and answers from the Slack channel.
After the next meeting, I'll probably move the Water quality project into this format; rather than having meeting dedicated to just water quality, we'll have Data Library meetings that are focused on answering a few of the open questions.
Please visit the site and review a few of the questions in with the
water-project tag, and if you have analysis, link them in as answers. If you are working on an analysis that is different from the current questions, you can create and answer your own question. Just be sure to apply the
Maybe it isn't just that there are higher bacteria counts after a rain, but maybe earlier rains have more pollutants to wash into the oceans than later rains? Ticket #156 asks you to explore the seasonality of the bacteria counts with the new dataset.
I've just released a new version of the Beachwatch data which include "features", new columns specifically designed for analysis:
These features include a "measure code" that groups together each unique combination of analyte/methodname/unit, so you can easily select a specific group for analysis. As shown in the example notebook, this will usually be 24, the Enterolert measurements of Enterococcus.
The dataset also has mean, median and quantile groups for each of the combinations of station and method code, so rather than analyzing the raw result measurements, you can work with "high" and "low" values, scaled to a particular measurement and station. Some of examples of this sort of analysis, using Logistic Regression, is in this example notebook:
Here is a new notebook I created that looks at the geographic distribution of measurement stations in the Beachwatch bacteria counts and how station results are correlated with each other over time:
For the next meeting ( August 1 ) I'd like to have a few other analysts present extensions to this analysis, looking at:
- Compare the correlations between stations to the distance between them. Is the correlation a function of distance?
- Look at the correlations over time. Are they getting stronger or weaker?
- This analysis looks at only one analyte, 'Coliform, Total'. Are there similar correlations for the others?
- With additional work, you may be able to find the creeks, stream, or drainage pipes that are close to these stations. Are there relationships between the station results and the locations of these water outlets?
I (think) I've invited everyone into the github organization. If so, you should either already be a member, or have an invitation waiting for you, which you can accept at:
I added everyone from the "github account" entry in your Redmine account profile. So, if you didn't get a Github invite and don't have a redmine account, please create one. If you didn't get a github invite and you do have a redmine account, email me with your Github user id and I'll add you.
We have three new data packages in the data repository, under the 'water-project' tag. See them all at:
To use these packages, click into the resource for the .csv or .zip package file, and you'l' see the resource documentation includes some python code. For instance, for the CSV package for the Beachwatch data you'll see:
import metapack as mp
pkg = mp.open_package('http://library.metatab.org/ceden.waterboards.ca.gov-beachwatch-sandiego-1.csv')
Then, you can use the
pkg to get a Pandas dataframe:
df = pkg.read_csv()
Later today or tomorrow I'll be posting some examples and challenges using these data packages.