内容简介:A well-known best practice when writing commit messages in Git is to use theDescribe your changes in imperative mood, e.g. "make xyzzy do frotz" instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy to do frotz", as if you are giving orders
Introduction
A well-known best practice when writing commit messages in Git is to use the imperative mood . This can be traced back to Git's documentation . To summarize it here:
Describe your changes in imperative mood, e.g. "make xyzzy do frotz" instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy to do frotz", as if you are giving orders to the codebase to change its behavior.
Some examples of commit messages written in the imperative mood are:
- Bump version to 1.0
- Add .gitignore
- Refactor product repository for functional isolation and clarity
- Merge branch 'master'
- Remove unneeded tests
- Fix bug preventing menu from sliding out on mobile
Notice how each commit message starts with a verb in the present tense. This helps describe the purpose of each commit in a clear and concise way. It also helps standardize the format of commit messages in general.
In this article, we'll explore how frequently developers adhere to this rule by estimating the percentage of commit messages that use the imperative mood.
We will do this by combining the forces of two powerful public datasets from Google BigQuery. The first is the GitHub Activity Data dataset that contains data from almost 3 million Git repositories. The second is the GDELT Web Part of Speech dataset, which contains more than 101 billion language tokens extracted, analyzed, and tagged from global web activity using Google's Natural Language API. We will link these two datasets to roughly estimate the percentage of Git commits that use the imperative mood.
For a primer on using Google BigQuery to analyze a simpler problem, check out my previous article What is the most popular initial commit message in Git? before reading this one.
Dataset #1: GitHub Activity Data
In my previous article, I used the GitHub Activity Data dataset to find the most popular initial commit messages in Git. This was quite simple because all of the required data lives in a single table (the commits
table) in a single database (the bigquery-public-data.github_repos
database).
As a refresher, the public bigquery-public-data.github_repos
database contains data from millions of public GitHub repositories. This data includes repository names, committed file names, commit messages, author names, timestamps, and more. In this article, we will again make use commit message data from the commits
table for our analysis. Our goal will be to extract the commit messages from the message
field of the commits
table, and try to determine what percentage of the commits use the imperative mood.
To get things started, we can easily get the total number of non-empty commit messages in the dataset (between January 1st 2000 and April 22nd 2020) by running the following query:
SELECT COUNT(*) FROM bigquery-public-data.github_repos.commits WHERE author.date.seconds >= 946684800 AND author.date.seconds <= 1585800000 AND LENGTH(TRIM(LOWER(message))) > 0;
This yields a result of 237,447,598 total commits.
Dataset #2: The GDELT Web Part of Speech Dataset
At this point we need a way to identify whether or not each commit message in the commits
table uses the imperative mood. This is where the GDELT Web Part of Speech dataset comes in. This dataset includes a table called web_pos
, in which each record represents a language token extracted from an online source between 2016 and 2020. The records come from sources in dozens of languages. For our purposes, a language token is a single word such as a noun, verb, or adjective.
Here are a few of the most useful fields in the web_pos
table, many of which we will make use of:
- The date that the source of the token was published
- The token text itself (in our case a single word)
- The language of the token
- A tag representing the token type (
VERB
,NOUN
,ADJ
,NUM
,PUNCT
, etc) - The tense of the token (
PAST
,PRESENT
,FUTURE
,PLUPERFECT
) - The mood of the token (
INDICATIVE
,IMPERATIVE
,SUBJUNCTIVE
,INTERROGATIVE
) - The URL of the token's source
Assumptions and Method
We will make the imperfect assumption that for a commit message to be of the imperative mood, the first word in the commit message must be a present tense, imperative verb. Luckily, Google BigQuery allows the joining of data from multiple unrelated datasets in a single SQL query. This allows us to write the following query which accesses both datasets and returns a count of commit messages that have a present tense, imperative verb as the first word:
SELECT COUNT(*) FROM bigquery-public-data.github_repos.commits WHERE author.date.seconds >= 946684800 AND author.date.seconds <= 1585800000 AND LENGTH(TRIM(LOWER(message))) > 0 -- Regular expression to match the first word of each commit message AND LOWER(REGEXP_EXTRACT(message, r'\w+')) in ( SELECT LOWER(token) FROM `gdelt-bq.gdeltv2.web_pos` WHERE lang='en' -- Only match English tokens AND posTag = 'VERB' -- Only match VERBs AND posMood = 'IMPERATIVE' -- Only match IMPERATIVE mood AND posTense = 'PRESENT' -- Only match PRESENT tense -- Filter out plural tokens, unless they end in a double S AND (LOWER(SUBSTR(token, -1)) != 's' OR LOWER(SUBSTR(token, -2)) = 'ss') GROUP BY LOWER(token) );
Results
The resulting output of the above query is 104,057,902 commits. Dividing this by 237,447,598 (the total number of commits we calculated above) yields 43.8% . Therefore, we can estimate that approximately 44% of commit messages in the GitHub dataset use the imperative mood.
Keep in mind, there are several aspects of this method that introduce error in the calculation. Oftentimes, the beginning of a commit message contains noise such as a ticket number, story ID, build tool stamp, or some other arbitrary tag data. In these cases, the REGEXP_EXTRACT(message, r'\w+')
function will pick out the first word it comes across in that tag, even if the intended starting point of an imperative mood verb appears later in the commit message. I suspect this will lead to a noticeable under-counting of the actual number of imperative mood commit messages in the dataset.
Furthermore, the natural language database has about 4,000 unique present tense verbs labelled as imperative. After doing a quick Google search I believe there are significantly more verbs that can be used in the imperative mood, so its possible that with more words in that list, more matches would occur with the commit message data. However, I have a feeling that imperative verbs typically used by programmers in commit messages (like fix
, merge
, bump
, add
, modify
, etc) are relatively common ones that are well represented by the current set of 4,000.
If you have any thoughts to make my query more accurate, feel free toshoot me an email.
Conclusion
In this article, we used Google BigQuery to access two public datasets that enabled us to estimate the percentage of Git commits that use the imperative mood.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Writing Windows VxDs and Device Drivers, Second Edition
Karen Hazzah / CMP / 1996-01-12 / USD 54.95
Software developer and author Karen Hazzah expands her original treatise on device drivers in the second edition of "Writing Windows VxDs and Device Drivers." The book and companion disk include the a......一起来看看 《Writing Windows VxDs and Device Drivers, Second Edition》 这本书的介绍吧!