I recently added mailing list analytics to Grokbase in response to Gabor Szabo‘s post on estimating the size of the Perl community. This provides some useful insight into community participation; however, both Gabor and I noticed there are a significant number of automated posts from computer programs and it would be useful to segregate these to get a better look at actual user participation. In light of this, I added some reports that break out highly automated email lists and users.
The posts are segregated by identifying ‘Automated posts’ using:
- a set of mailing lists
- a set of email sender addresses
Everything that is not currently flagged as an automated list or email address is classified as a ‘User post.’
This is a first approximation and can still use some refinement because (a) some email lists have both automated and non-automated emails and (b) some sender email addresses are used to send both automated and non-automated emails. In the future, it would be nice to identify automated emails at the per-email level. For example, there are quite a few automated git emails on the perl5-porters list which is comprised of both user and automated emails. Further, the git emails can be sent from email addresses that are also used to send non-automated emails. For now, emails from these users to non-flagged lists are classified as User posts though some should be flagged as automated in the future.
An Example Comparison
To get a feel for this feature, check out the analytics for Perl and Python, both of which have a number of automated lists and users. The numbers show that both Perl.org and Python.org communities are making greater use of automated systems to communicate so segregating this out will give us better insight into user participation.
By mailing list domain:
By subject across domains:
One advantage of looking at reports by subject is that they can (a) aggregate lists across domains, e.g. perl.org and Catalyst lists and (b) they can filter lists within a domain, e.g. Django and MongoDB both hosted by Google Groups.
When looking at the Python subject stats, you will see a spike in activity around the May 2012 time frame. This is not due to actual new users but due to new archived users, because the Django archive starts around that time and is aggregated into the Python subject.
There was very little actual development for this as the feature consisted of combining the analytics system with the automated user/list classification system. Specifically, more automated lists were identified and the analytics system was refactored to support more flexibility.
The automated user/list classification system was initially developed to identify automated users, aka agents, which are segregated in the user list and badge lists. After the automated user/list system was updated, some additional APIs were added and the analytics system was refactored to easily accommodate these filters. Generic interfaces are used to enable fast addition of new features, when applicable.
Please check it out and let me know what you think and how it could be of more use.