What % Of Git Commit Messages Use The Imperative Mood?

栏目: IT技术 · 发布时间: 5年前

内容简介:A well-known best practice when writing commit messages in Git is to use theDescribe your changes in imperative mood, e.g. "make xyzzy do frotz" instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy to do frotz", as if you are giving orders

What % Of Git Commit Messages Use The Imperative Mood?

Introduction

A well-known best practice when writing commit messages in Git is to use the imperative mood . This can be traced back to Git's documentation . To summarize it here:

Describe your changes in imperative mood, e.g. "make xyzzy do frotz" instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy to do frotz", as if you are giving orders to the codebase to change its behavior.

Some examples of commit messages written in the imperative mood are:

  • Bump version to 1.0
  • Add .gitignore
  • Refactor product repository for functional isolation and clarity
  • Merge branch 'master'
  • Remove unneeded tests
  • Fix bug preventing menu from sliding out on mobile

Notice how each commit message starts with a verb in the present tense. This helps describe the purpose of each commit in a clear and concise way. It also helps standardize the format of commit messages in general.

In this article, we'll explore how frequently developers adhere to this rule by estimating the percentage of commit messages that use the imperative mood.

We will do this by combining the forces of two powerful public datasets from Google BigQuery. The first is the GitHub Activity Data dataset that contains data from almost 3 million Git repositories. The second is the GDELT Web Part of Speech dataset, which contains more than 101 billion language tokens extracted, analyzed, and tagged from global web activity using Google's Natural Language API. We will link these two datasets to roughly estimate the percentage of Git commits that use the imperative mood.

For a primer on using Google BigQuery to analyze a simpler problem, check out my previous article What is the most popular initial commit message in Git? before reading this one.

Dataset #1: GitHub Activity Data

In my previous article, I used the GitHub Activity Data dataset to find the most popular initial commit messages in Git. This was quite simple because all of the required data lives in a single table (the commits table) in a single database (the bigquery-public-data.github_repos database).

As a refresher, the public bigquery-public-data.github_repos database contains data from millions of public GitHub repositories. This data includes repository names, committed file names, commit messages, author names, timestamps, and more. In this article, we will again make use commit message data from the commits table for our analysis. Our goal will be to extract the commit messages from the message field of the commits table, and try to determine what percentage of the commits use the imperative mood.

To get things started, we can easily get the total number of non-empty commit messages in the dataset (between January 1st 2000 and April 22nd 2020) by running the following query:

SELECT COUNT(*)
FROM bigquery-public-data.github_repos.commits
WHERE author.date.seconds >= 946684800
  AND author.date.seconds <= 1585800000
  AND LENGTH(TRIM(LOWER(message))) > 0;

This yields a result of 237,447,598 total commits.

Dataset #2: The GDELT Web Part of Speech Dataset

At this point we need a way to identify whether or not each commit message in the commits table uses the imperative mood. This is where the GDELT Web Part of Speech dataset comes in. This dataset includes a table called web_pos , in which each record represents a language token extracted from an online source between 2016 and 2020. The records come from sources in dozens of languages. For our purposes, a language token is a single word such as a noun, verb, or adjective.

Here are a few of the most useful fields in the web_pos table, many of which we will make use of:

  • The date that the source of the token was published
  • The token text itself (in our case a single word)
  • The language of the token
  • A tag representing the token type ( VERB , NOUN , ADJ , NUM , PUNCT , etc)
  • The tense of the token ( PAST , PRESENT , FUTURE , PLUPERFECT )
  • The mood of the token ( INDICATIVE , IMPERATIVE , SUBJUNCTIVE , INTERROGATIVE )
  • The URL of the token's source

Assumptions and Method

We will make the imperfect assumption that for a commit message to be of the imperative mood, the first word in the commit message must be a present tense, imperative verb. Luckily, Google BigQuery allows the joining of data from multiple unrelated datasets in a single SQL query. This allows us to write the following query which accesses both datasets and returns a count of commit messages that have a present tense, imperative verb as the first word:

SELECT COUNT(*)
FROM bigquery-public-data.github_repos.commits

WHERE author.date.seconds >= 946684800
    AND author.date.seconds <= 1585800000
    AND LENGTH(TRIM(LOWER(message))) > 0

    -- Regular expression to match the first word of each commit message
    AND LOWER(REGEXP_EXTRACT(message, r'\w+')) in (

        SELECT LOWER(token)
        FROM `gdelt-bq.gdeltv2.web_pos`
        WHERE lang='en'  -- Only match English tokens
            AND posTag = 'VERB'  -- Only match VERBs
            AND posMood = 'IMPERATIVE'  -- Only match IMPERATIVE mood
            AND posTense = 'PRESENT'  -- Only match PRESENT tense

            -- Filter out plural tokens, unless they end in a double S
            AND (LOWER(SUBSTR(token, -1)) != 's' OR LOWER(SUBSTR(token, -2)) = 'ss')

        GROUP BY LOWER(token)

    );

Results

The resulting output of the above query is 104,057,902 commits. Dividing this by 237,447,598 (the total number of commits we calculated above) yields 43.8% . Therefore, we can estimate that approximately 44% of commit messages in the GitHub dataset use the imperative mood.

Keep in mind, there are several aspects of this method that introduce error in the calculation. Oftentimes, the beginning of a commit message contains noise such as a ticket number, story ID, build tool stamp, or some other arbitrary tag data. In these cases, the REGEXP_EXTRACT(message, r'\w+') function will pick out the first word it comes across in that tag, even if the intended starting point of an imperative mood verb appears later in the commit message. I suspect this will lead to a noticeable under-counting of the actual number of imperative mood commit messages in the dataset.

Furthermore, the natural language database has about 4,000 unique present tense verbs labelled as imperative. After doing a quick Google search I believe there are significantly more verbs that can be used in the imperative mood, so its possible that with more words in that list, more matches would occur with the commit message data. However, I have a feeling that imperative verbs typically used by programmers in commit messages (like fix , merge , bump , add , modify , etc) are relatively common ones that are well represented by the current set of 4,000.

If you have any thoughts to make my query more accurate, feel free toshoot me an email.

Conclusion

In this article, we used Google BigQuery to access two public datasets that enabled us to estimate the percentage of Git commits that use the imperative mood.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

C和指针

C和指针

Kenneth A.Reek / 徐波 / 人民邮电出版社 / 2008 年4月 / 65.00元

本书提供与C语言编程相关的全面资源和深入讨论。本书通过对指针的基础知识和高级特性的探讨,帮助程序员把指针的强大功能融入到自己的程序中去。 全书共18章,覆盖了数据、语句、操作符和表达式、指针、函数、数组、字符串、结构和联合等几乎所有重要的C编程话题。书中给出了很多编程技巧和提示,每章后面有针对性很强的练习,附录部分则给出了部分练习的解答。 本书适合C语言初学者和初级C程序员阅读,也可作......一起来看看 《C和指针》 这本书的介绍吧!

图片转BASE64编码
图片转BASE64编码

在线图片转Base64编码工具

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换