内容简介:A well-known best practice when writing commit messages in Git is to use theDescribe your changes in imperative mood, e.g. "make xyzzy do frotz" instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy to do frotz", as if you are giving orders
Introduction
A well-known best practice when writing commit messages in Git is to use the imperative mood . This can be traced back to Git's documentation . To summarize it here:
Describe your changes in imperative mood, e.g. "make xyzzy do frotz" instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy to do frotz", as if you are giving orders to the codebase to change its behavior.
Some examples of commit messages written in the imperative mood are:
- Bump version to 1.0
- Add .gitignore
- Refactor product repository for functional isolation and clarity
- Merge branch 'master'
- Remove unneeded tests
- Fix bug preventing menu from sliding out on mobile
Notice how each commit message starts with a verb in the present tense. This helps describe the purpose of each commit in a clear and concise way. It also helps standardize the format of commit messages in general.
In this article, we'll explore how frequently developers adhere to this rule by estimating the percentage of commit messages that use the imperative mood.
We will do this by combining the forces of two powerful public datasets from Google BigQuery. The first is the GitHub Activity Data dataset that contains data from almost 3 million Git repositories. The second is the GDELT Web Part of Speech dataset, which contains more than 101 billion language tokens extracted, analyzed, and tagged from global web activity using Google's Natural Language API. We will link these two datasets to roughly estimate the percentage of Git commits that use the imperative mood.
For a primer on using Google BigQuery to analyze a simpler problem, check out my previous article What is the most popular initial commit message in Git? before reading this one.
Dataset #1: GitHub Activity Data
In my previous article, I used the GitHub Activity Data dataset to find the most popular initial commit messages in Git. This was quite simple because all of the required data lives in a single table (the commits table) in a single database (the bigquery-public-data.github_repos database).
As a refresher, the public bigquery-public-data.github_repos database contains data from millions of public GitHub repositories. This data includes repository names, committed file names, commit messages, author names, timestamps, and more. In this article, we will again make use commit message data from the commits table for our analysis. Our goal will be to extract the commit messages from the message field of the commits table, and try to determine what percentage of the commits use the imperative mood.
To get things started, we can easily get the total number of non-empty commit messages in the dataset (between January 1st 2000 and April 22nd 2020) by running the following query:
SELECT COUNT(*) FROM bigquery-public-data.github_repos.commits WHERE author.date.seconds >= 946684800 AND author.date.seconds <= 1585800000 AND LENGTH(TRIM(LOWER(message))) > 0;
This yields a result of 237,447,598 total commits.
Dataset #2: The GDELT Web Part of Speech Dataset
At this point we need a way to identify whether or not each commit message in the commits table uses the imperative mood. This is where the GDELT Web Part of Speech dataset comes in. This dataset includes a table called web_pos , in which each record represents a language token extracted from an online source between 2016 and 2020. The records come from sources in dozens of languages. For our purposes, a language token is a single word such as a noun, verb, or adjective.
Here are a few of the most useful fields in the web_pos table, many of which we will make use of:
- The date that the source of the token was published
- The token text itself (in our case a single word)
- The language of the token
- A tag representing the token type (
VERB,NOUN,ADJ,NUM,PUNCT, etc) - The tense of the token (
PAST,PRESENT,FUTURE,PLUPERFECT) - The mood of the token (
INDICATIVE,IMPERATIVE,SUBJUNCTIVE,INTERROGATIVE) - The URL of the token's source
Assumptions and Method
We will make the imperfect assumption that for a commit message to be of the imperative mood, the first word in the commit message must be a present tense, imperative verb. Luckily, Google BigQuery allows the joining of data from multiple unrelated datasets in a single SQL query. This allows us to write the following query which accesses both datasets and returns a count of commit messages that have a present tense, imperative verb as the first word:
SELECT COUNT(*)
FROM bigquery-public-data.github_repos.commits
WHERE author.date.seconds >= 946684800
AND author.date.seconds <= 1585800000
AND LENGTH(TRIM(LOWER(message))) > 0
-- Regular expression to match the first word of each commit message
AND LOWER(REGEXP_EXTRACT(message, r'\w+')) in (
SELECT LOWER(token)
FROM `gdelt-bq.gdeltv2.web_pos`
WHERE lang='en' -- Only match English tokens
AND posTag = 'VERB' -- Only match VERBs
AND posMood = 'IMPERATIVE' -- Only match IMPERATIVE mood
AND posTense = 'PRESENT' -- Only match PRESENT tense
-- Filter out plural tokens, unless they end in a double S
AND (LOWER(SUBSTR(token, -1)) != 's' OR LOWER(SUBSTR(token, -2)) = 'ss')
GROUP BY LOWER(token)
);
Results
The resulting output of the above query is 104,057,902 commits. Dividing this by 237,447,598 (the total number of commits we calculated above) yields 43.8% . Therefore, we can estimate that approximately 44% of commit messages in the GitHub dataset use the imperative mood.
Keep in mind, there are several aspects of this method that introduce error in the calculation. Oftentimes, the beginning of a commit message contains noise such as a ticket number, story ID, build tool stamp, or some other arbitrary tag data. In these cases, the REGEXP_EXTRACT(message, r'\w+') function will pick out the first word it comes across in that tag, even if the intended starting point of an imperative mood verb appears later in the commit message. I suspect this will lead to a noticeable under-counting of the actual number of imperative mood commit messages in the dataset.
Furthermore, the natural language database has about 4,000 unique present tense verbs labelled as imperative. After doing a quick Google search I believe there are significantly more verbs that can be used in the imperative mood, so its possible that with more words in that list, more matches would occur with the commit message data. However, I have a feeling that imperative verbs typically used by programmers in commit messages (like fix , merge , bump , add , modify , etc) are relatively common ones that are well represented by the current set of 4,000.
If you have any thoughts to make my query more accurate, feel free toshoot me an email.
Conclusion
In this article, we used Google BigQuery to access two public datasets that enabled us to estimate the percentage of Git commits that use the imperative mood.
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Spring揭秘
王福强 / 人民邮电出版社 / 2009.8 / 99.00元
没有教程似的训导,更多的是说故事般的娓娓道来,本书是作者在多年的工作中积累的第一手Spring框架使用经验的总结,深入剖析了Spring框架各个模块的功能、出现的背景、设计理念和设计原理,揭开了Spring框架的神秘面纱,使你“知其然,更知其所以然”。每部分的扩展篇帮助读者活学活用Spring框架的方方面面,同时可以触类旁通,衍生出新的思路和解决方案。 本书内容全面,论述深刻入理,必将成为每......一起来看看 《Spring揭秘》 这本书的介绍吧!