Recently Gabor Szabo wrote an article to answer a question he’s often asked, “How many people are in the Perl community?” In his response, he mentions and analyzes a number of proxies including conference participation, commits, open source contributions, mailing list posts, etc. John Napiorkowski agreed with the importance of analyzing this data, “Understanding this I think is going to become more important as we try to do a better job of promoting and discussing Perl.” Given the importance of this topic and my interest in analytics, I wanted to see if I could help by developing some community analytics for the mailing lists archived by Grokbase.
Grokbase already generates some metadata for lists including posts per month and users per month on a per-list basis so the basic structure was there, albeit not yet aggregated into reports or charts for easy analysis. The first step was to create an aggregate report from the existing data. Then I added some additional data I thought would be useful to come up with this initial list:
- posts per month
- users per month
- new users per month
- user domains per month
- TLD distribution
The new statistics were added because they might be useful for the community, specifically the new users per month figure provides an indication of community growth over time. I also added user domains per month which can be used as a proxy for the number of organizations using a technology. Finally, I added email address top-level domain (TLD) distribution to get an idea of geographic distribution for people using email addresses with country-code TLDs (ccTLDs). While many people use corporate and free email services with generic TLDs such as .com, a number of people continue to use country-code TLDs which might be a useful indicator for conference or workshop potential. Once the reports were done, I added some charts for easy trend analysis and visualization.
Check out the pages for perl5-porters, python-list, android-developers, and mongodb-user. To get to any of these pages, navigate to the group page and click the “Analytics” tab in the top right corner.
To get an idea of the larger community, I also aggregated the data across groups by list host. This is useful for seeing the activity for an overall community like perl.org, python.org, php.net or an Apache project like Cassandra or Hadoop. You can see these pages by navigating to the groups by host page, clicking a a host and then clicking the Analytics tab.
There’s obviously more that can be done but it was quick to get this out and I hope it will prove useful. Please let me know what you think and what could make it more useful.
This project was developed and deployed in ~1.5 days, with ~55-60% of that time spent on refactoring the front-end to ensure good code maintainability. The analytics are built off of a normalized PostgreSQL data store and denormalized into a JSON-document store. From there, web pages were developed to access the denormalized data for groups and list hosts. This initially resulted in further code duplication (then 4 duplicates due to earlier technical debt), so some refactoring was done to address that. A number of services are already deployed this way so it was relatively fast and easy to do. The charts are provided by Google’s Chart API using a custom wrapper to automate rendering multiple charts on a page.
This is just the start and I would like to add additional analytics that are of use to the community. It’s only been a couple of days since I read Gabor’s blog post but I’ve already come up with some possible enhancements:
- Extensions on existing analytics
- Aggregate by subject (done, see below): default group navigation is done by subject (vs. host), however aggregate analytics are currently only provided by host. Support for subject-base aggregations would help for Google-hosted groups like Android, MongoDB, Scala, etc.
- Aggregate groups report (done, see below): the ability to list groups by subject or list host to enable comparison of group activity by posts, users or new users. For example, if we see a lot of new users for perl.org this month but do not know which lists they are participating in, a quick report would let us zero in on those lists and see what is happening.
- Compare groups/hosts: compare arbitrary groups or sets of groups so we can more easily determine how one technology is faring against another.
- API access: more generic data availability to support custom analytics
- Additional data
- New users list: a list of new users can be useful for the community to ensure those users are getting their issues addressed and to see how they are engaging with the community.
I’ve already found the analytics to be very useful for myself so my thanks go to Gabor and John for inspiring me to put this together. Please let me know what you think and how it could be made more useful for you and others.
After receiving positive initial feedback, I decided to see to see if any of the additional features could be implemented quickly and ended up adding the following two features (online as of Aug 20):
- Aggregate by subject: it’s now possible to view aggregate list reports by subject which is useful when the subject matter doesn’t have a dedicated subject-based list server, e.g. Android, MongoDB, and Scala. For these lists specifically, Grokbase, may not yet have the full list archives so take that into account when viewing the charts. If you have data that does not appear to be archived, please consider sending it to us.
- Aggregate groups report: in addition to viewing aggregate reports by month and TLD, it’s now possible to view the reports by group for both the life of the archived data as well as on a per-month basis. This enables comparision of group activity by posts, users and new users. This can be useful for seeing which groups have the most posts vs. the most users and the most new users.
These features took an additional 1 day with the primary difference being that they were built to consider and take advantage of the now-existing denormalized data.
I’d like to thank Gabor for mentioning the analytics feature in the latest Perl Weekly. As noted, automated emails are currently included in the aggregate reports. By segregating automated emails from the reports, visibility into real user activity could be increased. Work has begun on identifying automated email lists and users so stay tuned.
Update (Aug 92): segregating automated posts is now completed.