I've just released a new version of the Beachwatch data which include "features", new columns specifically designed for analysis:
These features include a "measure code" that groups together each unique combination of analyte/methodname/unit, so you can easily select a specific group for analysis. As shown in the example notebook, this will usually be 24, the Enterolert measurements of Enterococcus.
The dataset also has mean, median and quantile groups for each of the combinations of station and method code, so rather than analyzing the raw result measurements, you can work with "high" and "low" values, scaled to a particular measurement and station. Some of examples of this sort of analysis, using Logistic Regression, is in this example notebook:
Here is a new data set to make analysis a bit easier:
It combines rain, tides and San Diego river flow, re-sampled to days. It is also a good example of how to combine datasets. See the notebook used to build the package for examples of resampling and joining:
The data packaging system I created, Metapack, now has a new feature to automatically create Exploratory Data Analysis notebooks. The notebook for this package is an example of basic EDA analysis, and it is also important for the analysis of null values -- see the Nulls section at the next of the notebook -- because the three data series in the dataset all have different time ranges for valid values.
If you'd like to do more exploration, combine this dataset with the Beachwatch data to see how bacteria counts at various stations correlate with rain and river flow.
I've posted an issue, to look at possible correlations to bacteria counts at dog beach.
If you'd like a challenge, work on the ticket and present your work at the next meeting.
The Issue references a map of water quality stations:
This map was created with a Jupyter notebook, saved to HTML in the notebook, and checked into Github. Then I turned on the web publishing in Github for the repository, and I've got an interactive map!
Here is the notebook: https://github.com/san-diego-water-quality/ericbusboom/blob/master/Stream%20Flow.ipynb
Here is the repo, showing the map file, pl_stations.html: https://github.com/san-diego-water-quality/ericbusboom
Using Folium like this is one of the easiest ways to get an interactive map, with a good basemap and markers.
Here is a new notebook I created that looks at the geographic distribution of measurement stations in the Beachwatch bacteria counts and how station results are correlated with each other over time:
For the next meeting ( August 1 ) I'd like to have a few other analysts present extensions to this analysis, looking at:
- Compare the correlations between stations to the distance between them. Is the correlation a function of distance?
- Look at the correlations over time. Are they getting stronger or weaker?
- This analysis looks at only one analyte, 'Coliform, Total'. Are there similar correlations for the others?
- With additional work, you may be able to find the creeks, stream, or drainage pipes that are close to these stations. Are there relationships between the station results and the locations of these water outlets?
I (think) I've invited everyone into the github organization. If so, you should either already be a member, or have an invitation waiting for you, which you can accept at:
I added everyone from the "github account" entry in your Redmine account profile. So, if you didn't get a Github invite and don't have a redmine account, please create one. If you didn't get a github invite and you do have a redmine account, email me with your Github user id and I'll add you.
We have three new data packages in the data repository, under the 'water-project' tag. See them all at:
To use these packages, click into the resource for the .csv or .zip package file, and you'l' see the resource documentation includes some python code. For instance, for the CSV package for the Beachwatch data you'll see:
import metapack as mp pkg = mp.open_package('http://library.metatab.org/ceden.waterboards.ca.gov-beachwatch-sandiego-1.csv')
Then, you can use the
pkg to get a Pandas dataframe:
df = pkg.read_csv()
Later today or tomorrow I'll be posting some examples and challenges using these data packages.
We've installed a JupyterHub/JupyterLab system at https://jupyter.civicknowledge.com/. You can log in with your Redmine account credentials, and do data analysis without installing anything.
For more details, see the blog post at:
Cheng Li created a graph of the precipitation and river flow rate in a Jupyter Notebook, similar to
Neal Jander's flow rate Shiny App.
Additionally, I've created a Jupyter Notebook for exploring the TDML measurements from CEDEN. This notebook will give you some insight in to the structure of the dataset and how to use it, since I have found good documentation.
For Cheng's notebook, you'll need to download the data first; look for the links for "Precipitation" and "Flow Rate". On the TGML notebook, the data is loaded from our data repository, so you can download the notebook and run it immediately.
Shiny is a web application platform for R which makes it really easy to create web-based data visualizations. Neal Jander used the data for flow rate though Mission Valley in this basic Shiny app.
Suggested next step: add a trace for the total amount of rainfall in the previous 24 hour, or add markers for coliform counts at a downstream test location.
Also available in: Atom