Answer Some Questions on the New Insights Site

Added by Eric Busboom over 4 years ago

I'm trying a new way of handling data questions, the Insights website. The water-project tag identifies questions specific to this project, but there are also questions on the site that are not part of the project. Feel free to comment on or answer any of them.

I'll be promoting this site to journalists, nonprofits and other who are interested in getting data answers, and I hope that you'll follow the questions and answer some of them. You can get immediate notification of new questions and answers from the Slack channel.

After the next meeting, I'll probably move the Water quality project into this format; rather than having meeting dedicated to just water quality, we'll have Data Library meetings that are focused on answering a few of the open questions.

Please visit the site and review a few of the questions in with the water-project tag, and if you have analysis, link them in as answers. If you are working on an analysis that is different from the current questions, you can create and answer your own question. Just be sure to apply the water-project tag.

Interactive Station Map, Creek Monitoring Data

Added by Eric Busboom over 4 years ago

We've got a new dataset for the San Diego Coastkeeper water monitoring program. This data comes from CEDEN, so it has the same format as the Beachwatch bacteria counts, but also includes measurements for temperature, pH, turbidity and nutrients.

I've extracted all of the measurements stations from this dataset and added them to the Beachwatch stations, to produce this map:

You can get the link to the map from the main webpage for the project:

Full Featured Beachwatch Data

Added by Eric Busboom over 4 years ago

I've just released a new version of the Beachwatch data which include "features", new columns specifically designed for analysis:

These features include a "measure code" that groups together each unique combination of analyte/methodname/unit, so you can easily select a specific group for analysis. As shown in the example notebook, this will usually be 24, the Enterolert measurements of Enterococcus.

The dataset also has mean, median and quantile groups for each of the combinations of station and method code, so rather than analyzing the raw result measurements, you can work with "high" and "low" values, scaled to a particular measurement and station. Some of examples of this sort of analysis, using Logistic Regression, is in this example notebook:

New Dataset for Tides, River Flow and Rain

Added by Eric Busboom over 4 years ago

Here is a new data set to make analysis a bit easier:

It combines rain, tides and San Diego river flow, re-sampled to days. It is also a good example of how to combine datasets. See the notebook used to build the package for examples of resampling and joining:

The data packaging system I created, Metapack, now has a new feature to automatically create Exploratory Data Analysis notebooks. The notebook for this package is an example of basic EDA analysis, and it is also important for the analysis of null values -- see the Nulls section at the next of the notebook -- because the three data series in the dataset all have different time ranges for valid values.

If you'd like to do more exploration, combine this dataset with the Beachwatch data to see how bacteria counts at various stations correlate with rain and river flow.

What's Happening at Dog Beach?

Added by Eric Busboom over 4 years ago

I've posted an issue, to look at possible correlations to bacteria counts at dog beach.

If you'd like a challenge, work on the ticket and present your work at the next meeting.

The Issue references a map of water quality stations:

This map was created with a Jupyter notebook, saved to HTML in the notebook, and checked into Github. Then I turned on the web publishing in Github for the repository, and I've got an interactive map!

Here is the notebook:
Here is the repo, showing the map file, pl_stations.html:

Using Folium like this is one of the easiest ways to get an interactive map, with a good basemap and markers.

Correlations Challenge

Added by Eric Busboom over 4 years ago

Here is a new notebook I created that looks at the geographic distribution of measurement stations in the Beachwatch bacteria counts and how station results are correlated with each other over time:

For the next meeting ( August 1 ) I'd like to have a few other analysts present extensions to this analysis, looking at:

  • Compare the correlations between stations to the distance between them. Is the correlation a function of distance?
  • Look at the correlations over time. Are they getting stronger or weaker?
  • This analysis looks at only one analyte, 'Coliform, Total'. Are there similar correlations for the others?
  • With additional work, you may be able to find the creeks, stream, or drainage pipes that are close to these stations. Are there relationships between the station results and the locations of these water outlets?

Did you Get the GitHub Invite?

Added by Eric Busboom over 4 years ago

I (think) I've invited everyone into the github organization. If so, you should either already be a member, or have an invitation waiting for you, which you can accept at:

I added everyone from the "github account" entry in your Redmine account profile. So, if you didn't get a Github invite and don't have a redmine account, please create one. If you didn't get a github invite and you do have a redmine account, email me with your Github user id and I'll add you.


New Packages

Added by Eric Busboom over 4 years ago

We have three new data packages in the data repository, under the 'water-project' tag. See them all at:

To use these packages, click into the resource for the .csv or .zip package file, and you'l' see the resource documentation includes some python code. For instance, for the CSV package for the Beachwatch data you'll see:

import metapack as mp
pkg = mp.open_package('')

Then, you can use the pkg to get a Pandas dataframe:

df = pkg.read_csv()

Later today or tomorrow I'll be posting some examples and challenges using these data packages.


Also available in: Atom