What % Of Git Commit Messages Use The Imperative Mood?

栏目: IT技术 · 发布时间: 4年前

内容简介:A well-known best practice when writing commit messages in Git is to use theDescribe your changes in imperative mood, e.g. "make xyzzy do frotz" instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy to do frotz", as if you are giving orders

What % Of Git Commit Messages Use The Imperative Mood?

Introduction

A well-known best practice when writing commit messages in Git is to use the imperative mood . This can be traced back to Git's documentation . To summarize it here:

Describe your changes in imperative mood, e.g. "make xyzzy do frotz" instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy to do frotz", as if you are giving orders to the codebase to change its behavior.

Some examples of commit messages written in the imperative mood are:

  • Bump version to 1.0
  • Add .gitignore
  • Refactor product repository for functional isolation and clarity
  • Merge branch 'master'
  • Remove unneeded tests
  • Fix bug preventing menu from sliding out on mobile

Notice how each commit message starts with a verb in the present tense. This helps describe the purpose of each commit in a clear and concise way. It also helps standardize the format of commit messages in general.

In this article, we'll explore how frequently developers adhere to this rule by estimating the percentage of commit messages that use the imperative mood.

We will do this by combining the forces of two powerful public datasets from Google BigQuery. The first is the GitHub Activity Data dataset that contains data from almost 3 million Git repositories. The second is the GDELT Web Part of Speech dataset, which contains more than 101 billion language tokens extracted, analyzed, and tagged from global web activity using Google's Natural Language API. We will link these two datasets to roughly estimate the percentage of Git commits that use the imperative mood.

For a primer on using Google BigQuery to analyze a simpler problem, check out my previous article What is the most popular initial commit message in Git? before reading this one.

Dataset #1: GitHub Activity Data

In my previous article, I used the GitHub Activity Data dataset to find the most popular initial commit messages in Git. This was quite simple because all of the required data lives in a single table (the commits table) in a single database (the bigquery-public-data.github_repos database).

As a refresher, the public bigquery-public-data.github_repos database contains data from millions of public GitHub repositories. This data includes repository names, committed file names, commit messages, author names, timestamps, and more. In this article, we will again make use commit message data from the commits table for our analysis. Our goal will be to extract the commit messages from the message field of the commits table, and try to determine what percentage of the commits use the imperative mood.

To get things started, we can easily get the total number of non-empty commit messages in the dataset (between January 1st 2000 and April 22nd 2020) by running the following query:

SELECT COUNT(*)
FROM bigquery-public-data.github_repos.commits
WHERE author.date.seconds >= 946684800
  AND author.date.seconds <= 1585800000
  AND LENGTH(TRIM(LOWER(message))) > 0;

This yields a result of 237,447,598 total commits.

Dataset #2: The GDELT Web Part of Speech Dataset

At this point we need a way to identify whether or not each commit message in the commits table uses the imperative mood. This is where the GDELT Web Part of Speech dataset comes in. This dataset includes a table called web_pos , in which each record represents a language token extracted from an online source between 2016 and 2020. The records come from sources in dozens of languages. For our purposes, a language token is a single word such as a noun, verb, or adjective.

Here are a few of the most useful fields in the web_pos table, many of which we will make use of:

  • The date that the source of the token was published
  • The token text itself (in our case a single word)
  • The language of the token
  • A tag representing the token type ( VERB , NOUN , ADJ , NUM , PUNCT , etc)
  • The tense of the token ( PAST , PRESENT , FUTURE , PLUPERFECT )
  • The mood of the token ( INDICATIVE , IMPERATIVE , SUBJUNCTIVE , INTERROGATIVE )
  • The URL of the token's source

Assumptions and Method

We will make the imperfect assumption that for a commit message to be of the imperative mood, the first word in the commit message must be a present tense, imperative verb. Luckily, Google BigQuery allows the joining of data from multiple unrelated datasets in a single SQL query. This allows us to write the following query which accesses both datasets and returns a count of commit messages that have a present tense, imperative verb as the first word:

SELECT COUNT(*)
FROM bigquery-public-data.github_repos.commits

WHERE author.date.seconds >= 946684800
    AND author.date.seconds <= 1585800000
    AND LENGTH(TRIM(LOWER(message))) > 0

    -- Regular expression to match the first word of each commit message
    AND LOWER(REGEXP_EXTRACT(message, r'\w+')) in (

        SELECT LOWER(token)
        FROM `gdelt-bq.gdeltv2.web_pos`
        WHERE lang='en'  -- Only match English tokens
            AND posTag = 'VERB'  -- Only match VERBs
            AND posMood = 'IMPERATIVE'  -- Only match IMPERATIVE mood
            AND posTense = 'PRESENT'  -- Only match PRESENT tense

            -- Filter out plural tokens, unless they end in a double S
            AND (LOWER(SUBSTR(token, -1)) != 's' OR LOWER(SUBSTR(token, -2)) = 'ss')

        GROUP BY LOWER(token)

    );

Results

The resulting output of the above query is 104,057,902 commits. Dividing this by 237,447,598 (the total number of commits we calculated above) yields 43.8% . Therefore, we can estimate that approximately 44% of commit messages in the GitHub dataset use the imperative mood.

Keep in mind, there are several aspects of this method that introduce error in the calculation. Oftentimes, the beginning of a commit message contains noise such as a ticket number, story ID, build tool stamp, or some other arbitrary tag data. In these cases, the REGEXP_EXTRACT(message, r'\w+') function will pick out the first word it comes across in that tag, even if the intended starting point of an imperative mood verb appears later in the commit message. I suspect this will lead to a noticeable under-counting of the actual number of imperative mood commit messages in the dataset.

Furthermore, the natural language database has about 4,000 unique present tense verbs labelled as imperative. After doing a quick Google search I believe there are significantly more verbs that can be used in the imperative mood, so its possible that with more words in that list, more matches would occur with the commit message data. However, I have a feeling that imperative verbs typically used by programmers in commit messages (like fix , merge , bump , add , modify , etc) are relatively common ones that are well represented by the current set of 4,000.

If you have any thoughts to make my query more accurate, feel free toshoot me an email.

Conclusion

In this article, we used Google BigQuery to access two public datasets that enabled us to estimate the percentage of Git commits that use the imperative mood.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

Effective STL中文版

Effective STL中文版

[美] Scott Meyers / 潘爱民、陈铭、邹开红 / 电子工业出版社 / 2013-5 / 59.00元

《Effective STL中文版:50条有效使用STL的经验》是EffectiveC++的第3卷,被评为“值得所有C++程序员阅读的C++书籍之一”。《Effective STL中文版:50条有效使用STL的经验》详细讲述了使用STL的50条指导原则,并提供了透彻的分析和深刻的实例,实用性极强,是C++程序员必备的基础书籍。C++的标准模板库(STL)是革命性的,要用好STL并不容易。《Effe......一起来看看 《Effective STL中文版》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

MD5 加密
MD5 加密

MD5 加密工具

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具