Estimating Gender Diversity in your Organization with the Azure AD Graph
Your directory data holds a treasure trove of insights, which are now exceptionally easy to access thanks to the Graph API layer on top of it.
Few weeks ago I was wondering what question I could answer about my own organization with the Directory Graph alone, and I quickly landed on a great candidate: gender ratios.
I *love* working in our industry, but if there’s something I loathe is the dramatic gender imbalance that is almost an absolute invariant everywhere I look. In my past as a consultant I often worked for non-IT shops, where the ratio wasn’t as skewed, and loved the atmosphere – I can’t quite put my finger on it, but activities and collaboration seemed to take a on a more balanced quality… healthier is the adjective that comes to mind.
I know that the industry leaders are hard at work to correct this and many other imbalances, and the growing awareness of the problem gives me hope for the future. Still, I thought it would be fun to play with data and verify if some conjecture of mine were actually backed by numbers: are female managers more likely to lead orgs with more balanced gender ratios? Is it true that non technical disciplines have better ratios? Are there specific business functions where the ratios are reversed? And so on, and so forth.
It goes without saying that I will NOT be sharing any data whatsoever about Microsoft here. Besides the fact that the estimates might be wildly inaccurate, it is absolutely not my place to talk about the company. Microsoft does an excellent job in maintaining transparency on workforce demographic.
What I am going to share, however, is the source code and the methodology I used for running my little experiments. If you have Office365 or any other Microsoft cloud services, you can run the code in your organization and get back an estimate of the gender mix of the reports of any user of your choice. You just need to download the code, compile and run – that works even if you are not an administrator. For the time being you need Windows, but if there’s interest I might port the app to .NET Core – which would allow you to run on Mac and Linux as well.
Let me stress that I make no guarantees about the precision of the resulting estimates, not I guarantee that my code is bug-free. Please consider this simply as the chronicle of an afternoon spent geeking out with Azure AD, and my modest contribution for raising awareness around gender imbalance in tech.
Ready? Let’s dive!
The methodology
Where to begin? Let’s see. With the Directory Graph, I know I can crawl through the entire report structure of anybody. I can get the User object of the manager of the org I want to analyze, for example via his/her userPrincipalName: then I can recursively analyze all the User objects in the /directReports property of all subtrees. So I can get all the users in the entire sub-org – that part is easy.
How to tell the gender of a User? There is no Gender property readily available in the default schema. The obvious alternative is… the user’s first name, naturally. We do have that, under the /GivenName property.
But wait, that’s not that simple! There are certain names that are male in some countries, and female in others. For example, Andrea is THE most common name for a boy in Italy. At the same time, it is a super common female name in Germany – in the 60s it was in the top 10. Hence, we better include country information in our estimate of gender from first name. The /country property exists in the Directory Graph, however it is often not populated. However there is another country-dependent property that is way more likely to be populated, and that’s telephoneNumber. Never mind that Americans often tend to omit the country code!
Let’s take a step back and see where we’ve got so far. Our estimate of whether a User is male or female is going to be based on the User’s GivenName and Country(telephoneNumber). Sounds promising, but this is by no means perfect.
Even if we manage to obtain the country’s information from the User’s telephone, all we are getting is the country from where the user is operating. An Andrea working in USA might be a male migrant from Italy, or a female from 2nd generation German migrant family. Any estimate should take into account what cases are most prevalent for each name and given country, which brings us straight in the realm of frequencies (and saddles us with the task of finding a source of info for those figures).
Compound this with the fact that there are some names which stubbornly defy classification: Robin, Kim, Casey, Yi, Rama, Jamie and so on. Those cases, too, point to the need to base our estimates on probabilities rather than static classifications.
Here there’s the idea that made me feel very clever, at least for few minutes: what if I’d just crawl Facebook’s public data and build a database of first names per country, tracking the frequency with which users self-declare their gender? That would by no means carry any guarantee of significance, but it would definitely be an improvement over static analysis!
Hitting the internet for inspiration, I soon stumbled on https://genderize.io/ – an awesome public API that already did the crawling for us, across major social networks, and helpfully exposes its database through a super convenient API. The API offers a very generous daily allowance of 1000 free name queries per day. I wanted to do some heavy duty work (and a lot of debugging ) hence I bought the PLUS package (and I am now super worried about looking silly by leaking the key on GitHub!) but I am sure you can do a lot of experimentation with the free tier.
The API also offers a helpful probability associated to each gender estimate, plus the number of entries from which the estimate was extracted. That allows you to place confidence thresholds in your own evaluations if you so choose. Here there’s an example. Say that we want the estimate for Andrea in Italy. The call is very simple:
[sourcecode language='text' ] GET https://api.genderize.io/?name=andrea&country_id=it [/sourcecode]
looks like the following:
[sourcecode language='javascript' padlinenumbers='true'] {"name":"andrea","gender":"male","probability":"0.99","count":1070,"country_id":"it"} [/sourcecode]
That’s a pretty high estimate! That’s not a 1.0 probability given that we do have various Andrea from Germany or other countries living in Italy – I know a few myself. Let’s check Andrea in US tho:
[sourcecode language='javascript' ] {"name":"andrea","gender":"female","probability":"0.97","count":2308,"country_id":"us"} [/sourcecode]
Yep. It looks like the “Andrea” in the US are mostly from countries where the name is female… or at least, that’s what people self-declare on social networks. Of course, for being precise we should also take into account any gender differences in social network usage… and compare the count to the total sample per country. But let’s not go too far what we have now seems good enough for playing.
Before I move to describe the (simple) app that implements the above, I want to call out one last detail. Parsing phone numbers is a surprisingly complicated task, given that every country has its own formatting rules. Luckily there are a number of libraries that can help, the most famous probably being Google’libphonenumber. Patrick Mézard and Aidan Bebbington nicely ported it to C# and made it available as a NuGet, which I promptly used in the project. This NuGet is the main reason for which I did not write this directly in .NET Core – out of laziness, really. But, if there’s interest we can always course-correct and port it!
The app
The application is a simple console app, which can be found in this repo. At launch, it takes in input the userPrincipalName (which can be different from the email, beware) of the manager whose org you want to analyze. If it’s the first time you launched the app (or it’s some time you don’t run it) you’ll get prompted for credentials – make sure you use an account that belongs to the directory you want to work with. I chose the console app format for 2 reasons:
– It can be modeled as a native client, which is automatically multitenant and can access the directory as the signed in user. That means that the app requires no setup in your directory, and does NOT require an admin to function. That’s pretty much equivalent to a user running an LDAP read query in a classic onprem AD. Modeling the app as a multitenant web app would have required admin consent for gaining directory read rights, which would have greatly limited the number of people that can run this analysis.
– It has no UX requirement, which makes it runnable on headless boxes – and above all, makes it easy to be ported to Linux and Mac via .NET Core. You are going to do your analysis in Excel anyway, hence even if I would have thrown in a couple of pie charts I would not have added much real value to the insights you can get from the textual output.
Once it gets a valid token, the app caches it for future uses (in a file, token.dat, that can be decrypted only on the machine it’s been generated on – but be careful with it anyway). Then it passes it to a factory for the class Account, a wrapper that uses the Graph API to retrieve the first name and telephone number (hence, the country) of the manager.
That done, the app calls the Account class method GetGenderMix – which
- Assesses the gender of the Account, by passing the first name and the country to genderize.io. This is all done via a proxy class, GenderizeProxy, which handles indefinite cases and, above all, caches results so that subsequent estimates of a known name-country couple won’t result in a network hit. Before doing that I verified that the ToS of genderize.io does not prevent that, and those guys are super chill about any use. Of course I would not do this if the cache would be shared across multiple users, as it would be the case in a web app, but here every machine running he console app would build its own cache – hence it still seems fair.
- Retrieve all the reports of the current Account via Graph API and the navigation property /directReports. Then, call itself on each report.
- Once the recursive calls have exhausted their run, aggregate the gender figures in the Males/Females/Undefined/Contacts accumulators. “Undefined” is anuthing that didn’t lead to an estimate, whether because genderize.io didn’t have any record of the name or because the estimate quality didn’t meet the bar (see below about thresholds). “Contacts” is a special case in which a Contact entity (as opposed to a User) is returned as a report –given that the properties there are different and the case isn’t all that frequent, I don’t use it in the final male/female tallies.
Note that for the way in which the classes are structured today (an access token is passed in, instead of being obtained via AcquireToken* in Account) the execution time of GetGenderMix cannot exceed 1 hour, the validity window of an Azure AD issued access token. Again an easy fix that I didn’t put in out of sheer laziness and desire to see things working ASAP
One thing to notice about GenderizeProxy is that it allows you to specify confidence thresholds, like ProbabilityThreshold (capping the confidence level beyond which an estimate will be considered indefinite instead of the proposed gender guess) and CountThreshold (establishing the minimun number of entries from social networks an estimate must be based on to be considered reliable). Playing with those thresholds can change the numbers pretty significantly, although in my experience the ratios often remain surprisingly stable.
Once GetGenderMix finished, the app spits out some generic results on the console, and saves a CSV file with all the gender mixes for all the managers in the org you analyzed. Just double click on it, and have a ball in Excel to find correlations and interesting numbers (hint: I found AVERAGEIF super useful).
Give it a Spin
As it should be abundantly clear at this point, this little app is by no means guaranteed to offer a precise assessment of the gender mix in your organization. It certainly doesn’t hold a candle to what your HR already knows with zero uncertainty margin, and always keeping that in mind is certainly a healthy thing.
That said, I *love* how empowering this thing is. As mentioned, I personally used it for verifying some theories I had – in some cases they did pan out, in some others I was surprised to be proven wrong, but the metapoint here is that they got me to think about the problem. I hope this will get you to think more deeply about gender inbalance too – and if you learn how to play with Azure AD in the process, I can’t say I’ll be disappointed