Hacking on Clang is surprisingly easy

栏目: IT技术 · 发布时间: 4年前

内容简介：Date: 2020-01-27Git:I happen to think that the current lambda syntax for C++ is kind of verbose. I'm not the only one to have thought that, and there has already been a paper discussing a possible abbreviated lambda syntax (though it

Date: 2020-01-27

Git: https://gitlab.com/mort96/blog/blob/published/content/00000-home/00012-clang-compiler-hacking.md

I happen to think that the current lambda syntax for C++ is kind of verbose. I'm not the only one to have thought that, and there has already been a paper discussing a possible abbreviated lambda syntax (though it was rejected ).

In this blog post, I will detail my attempt to implement a sort of simplest possible version of an abbreviated lambda syntax. Basically, this:

[](auto &&a, auto &&b) => a.id() < b.id();

should mean precisely :

[](auto &&a, auto &&b) { return a.id() < b.id(); };

I will leave a discussion about whether that change is worth it or not to the end. Most of this article will just assume that we want that new syntax, and discuss how to actually implement it in Clang.

If you want to read more discussion on the topic, I wrote a somewhat controversial post on Reddit discussing why I think it might be a good idea.

Here's the implementation I will go through in this post: https://github.com/mortie/llvm/commit/e4726dc9d9d966978714fc3d85c6e9c335a38ab8 - 28 additions, including comments and whitespace, across 3 files.

Getting the Clang code

This wasn't my first time compiling Clang, but it was my first time downloading the source code with the intent to change it.

LLVM has a nice page which details getting and building Clang , but the tl;dr is:

git clone https://github.com/llvm/llvm-project.git
cd llvm-project && mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=`pwd`/inst -DLLVM_ENABLE_PROJECTS=clang -DCMAKE_BUILD_TYPE=Release ../llvm
make -j 8
make install

A few points to note:

The build will take a long time. Clang is big.
I prefer -DCMAKE_BUILD_TYPE=Release because it's way faster to build. Linking Clang with debug symbols and everything takes ages and will OOM your machine.
This will install your built clang to inst (short for "install"). The clang binary itself will be in inst/bin/clang .

Now that we have a clang setup, we can have a look at how the project is laid out, and play with it.

Changing the Clang code

The feature I want to add is very simple: Basically, I want [] => 10 to mean the exact same thing as [] { return 10; } . In order to understand how one would achieve that, an extremely short introduction to how compilers work is necessary:

Our code is just a sequence of bytes, like [] => 10 + 20 . In order for Clang to make sense of that, it will go through many steps. We can basically divide a compiler into two parts: the "front-end", which goes through many steps to build a thorough understanding of the code as a graph structure , and the "back-end" which goes through many steps to remove information, eventually ending up with a simple series of bytes again, but this time as machine code instead of ASCII.

We'll ignore the back-end for now. The front-end basically works like this:

Split the stream of bytes into a stream of tokens. This step turns [] => 10 + 20 into something like (open-brace) (close-brace) (fat-arrow) (number: 10) (plus) (number: 20) .
Go through those tokens and construct a tree. This step turns the sequence of tokens into a tree: (lambda-expression (body (return-statement (add-expression (number 10) (number 20))))) (Yeah, this looks a lot like Lisp. There's a reason people say Lisp basically has no syntax; you're just writing out the syntax tree by hand.)
Add semantic information, such as types.

The first phase is usually called lexical analysis, or tokenization, or scanning. The second phase is what we call parsing. The third phase is usually called semantic analysis or type checking.

Well, the change I want to make involves adding a new token, the "fat arrow" token => . That means we'll have to find out how the lexer (or tokenizer) is implemented; where it keeps its list of valid tokens types, and where it turns the input text into tokens. After some grepping, I found the file clang/include/clang/Basic/TokenKinds.def , which includes a bunch of token descriptions, such as PUNCTUATOR(arrow, "->") . This file seems to be a "supermacro" ; a file which exists to be included by another file as a form of macro expansion.

I added PUNCTUATOR(fatarrow, "=>") right below the PUNCTUATOR(arow, "->") line.

Now that we have defined our token, we need to get the lexer to actually generate it.

After some more grepping, I found clang/lib/Lex/Lexer.cpp , where the Lexer::LexTokenInternal function is what's actually looking at characters and deciding what tokens they represent. It has a case statement to deal with tokens which start with an = character:

case '=':
	Char = getCharAndSize(CurPtr, SizeTmp);
	if (Char == '=') {
		// If this is '====' and we're in a conflict marker, ignore it.
		if (CurPtr[1] == '=' && HandleEndOfConflictMarker(CurPtr-1))
			goto LexNextToken;

		Kind = tok::equalequal;
		CurPtr = ConsumeChar(CurPtr, SizeTmp, Result);
	} else {
		Kind = tok::equal;
	}
	break;

Given that, the change to support my fatarrow token is really simple:

case '=':
	Char = getCharAndSize(CurPtr, SizeTmp);
	if (Char == '=') {
		// If this is '====' and we're in a conflict marker, ignore it.
		if (CurPtr[1] == '=' && HandleEndOfConflictMarker(CurPtr-1))
		goto LexNextToken;

		Kind = tok::equalequal;
		CurPtr = ConsumeChar(CurPtr, SizeTmp, Result);

	// If the first character is a '=', and it's followed by a '>', it's a fat arrow
	} else if (Char == '>') {
		Kind = tok::fatarrow;
		CurPtr = ConsumeChar(CurPtr, SizeTmp, Result);

	} else {
		Kind = tok::equal;
	}
	break;

Now that we have a lexer which generates a tok::fatarrow any time it encounters a => in our code, we can start changing the parser to make use of it.

Since I want to change lambda parsing, the code which parses a lamba seems like a good place to start (duh). I found that in a file called clang/lib/Parse/ParseExprCXX.cpp , in the function ParseLambdaExpressionAfterIntroducer . Most of the function deals with things like the template parameter list and trailing return type, which I don't want to change, but the very end of the function contains this gem:

// Parse compound-statement.
if (!Tok.is(tok::l_brace)) {
	Diag(Tok, diag::err_expected_lambda_body);
	Actions.ActOnLambdaError(LambdaBeginLoc, getCurScope());
	return ExprError();
}

StmtResult Stmt(ParseCompoundStatementBody());
BodyScope.Exit();
TemplateParamScope.Exit();

if (!Stmt.isInvalid() && !TrailingReturnType.isInvalid())
	return Actions.ActOnLambdaExpr(LambdaBeginLoc, Stmt.get(), getCurScope());

Actions.ActOnLambdaError(LambdaBeginLoc, getCurScope());
return ExprError();

If the next token isn't an opening brace, error.
Parse a compound statement body (i.e consume a { , read statements until the } ).
After some housekeeping, act on the now fully parsed lambda expression.

In principle, what we want to do is to check if the next token is a => instead of a { ; if it is, we want to parse an expression instead of a compound statement, and then somehow pretend that the expression is a return statement. Through some trial, error and careful copy/pasting, I came up with this block of code which I put right before the if (!Tok.is(tok::l_brace)) :

// If this is an arrow lambda, we just need to parse an expression.
// We parse the expression, then put that expression in a return statement,
// and use that return statement as our body.
if (Tok.is(tok::fatarrow)) {
	SourceLocation ReturnLoc(ConsumeToken());

	ExprResult Expr(ParseExpression());
	if (Expr.isInvalid()) {
		Actions.ActOnLambdaError(LambdaBeginLoc, getCurScope());
		return ExprError();
	}

	StmtResult Stmt = Actions.ActOnReturnStmt(ReturnLoc, Expr.get(), getCurScope());

	BodyScope.Exit();
	TemplateParamScope.Exit();

	if (!Stmt.isInvalid() && !TrailingReturnType.isInvalid())
		return Actions.ActOnLambdaExpr(LambdaBeginLoc, Stmt.get(), getCurScope());

	Actions.ActOnLambdaError(LambdaBeginLoc, getCurScope());
	return ExprError();
}

// Otherwise, just parse a compound statement as usual.
if (!Tok.is(tok::l_brace)) ...

This is really basic; if the token is a => instead of a { , parse an expression, then put that expression into a return statement, and then use that return statement as our lambda's body.

And it works! Lambda expressions with fat arrows are now successfully parsed as if they were regular lambdas whose body is a single return statement:

Hacking on Clang is surprisingly easy

Was it worth it?

Implementing this feature into Clang was definitely worth it just to get more familiar with how the code base works. However, is the feature itself a good idea at all?

I think the best way to decide if a new syntax is better or not is to look at old code which could've made use of the new syntax, and decide if the new syntax makes a big difference. Therefore, and now that I have a working compiler, I have gone through all the single-expression lambdas and replaced them with my fancy new arrow lambdas in some projects I'm working on.

Before:

std::erase_if(active_chunks_, [](Chunk *chunk) { return !chunk->isActive(); });

After:

std::erase_if(active_chunks_, [](Chunk *chunk) => !chunk->isActive());

This code deletes a chunk from a game world if the chunk isn't currently active. In my opinion, the version with the arrow function is a bit clearer, but a better solution could be C++17's std::invoke . If I understand std::invoke correctly, if C++ was to adopt std::invoke for algorithms, this code could be written like this:

std::erase_if(active_chunks_, &Chunk::isInactive);

This looks nicer, but has the disadvantage that you need to add an extra method to the class. Having both isActive and its negation isInactive as member functions just because someone might want to use it as a predicate in an algorithm sounds unfortunate. I prefer lambdas' fleixibilty.

Before:

return map(begin(worldgens_), end(worldgens_), [](auto &ptr) { return ptr.get(); });

After:

return map(begin(worldgens_), end(worldgens_), [](auto &ptr) => ptr.get());

This code maps a vector of unique pointers to raw pointers. This is yet another case where I think the arrow syntax is slightly nicer than the C++11 alternative, but this time, we could actually use the member function invocation if I changed my map function to use std::invoke:

return map(begin(worldgens_), end(worldgens_), &std::unique_ptr<Swan::WorldGen::Factory>::get);

Well, this illustrates that invoking a member function doesn't really work with overly complex types. Imagine if the type was instead something more elaborate:

return map(begin(worldgens_), end(worldgens_),
	std::unique_ptr<Swan::WorldGen<int>::Factory, Swan::WorldGen<int>::Factory::Deleter>::get);

This also happens to be unspecified behavior, because taking the address of a function in the standard library is generally not legal. From https://en.cppreference.com/w/cpp/language/extending_std :

The behavior of a C++ program is unspecified (possibly ill-formed) if it explicitly or implicitly attempts to form a pointer, reference (for free functions and static member functions) or pointer-to-member (for non-static member functions) to a standard library function or an instantiation of a standard library function template, unless it is designated an addressable function.

Before:

bool needRender = dirty || std::any_of(widgets.begin(), widgets.end(),
	[](auto &w) { return w.needRender(); });

After:

bool needRender = dirty || std::any_of(widgets.begin(), widgets.end(),
	[](auto &w) => w.needRender());

Again, the short lambda version looks a bit better to me. However, here again, we could replace the lambda with a member reference if algorithms were changed to use std::invoke :

bool needRender = dirty || std::any_of(widgets.begin(), widgets.end(),
	&WidgetContainer::needRender);

Overall, I see a short lambda syntax as a modest improvement. The biggest readability win mostly stems from the lack of that awkward ; }); at the end of an expression; foo([] => bar()) instead of foo([] { return bar(); }); . It certainly breaks down a bit when the argument list is long; neither of these two lines are particularly short:

foo([](auto const &&a, auto const &&b, auto const &&c) => a + b + c);
foo([](auto const &&a, auto const &&b, auto const &&c) { return a + b + c; });

I think, considering the minimal cost of implementing this short lambda syntax, the modest improvements outweigh the added complexity. However, there's also an opportunity cost associated with my version of the short lambda syntax: it makes a better, future short lambda syntax either impossible or more challenging. For example, accepting my syntax would mean we couldn't really adopt a ratified version of P0573R2 's short lambda syntax in the future, even if the issues with it were otherwise fixed.

Therefore, I will argue strongly that my syntax makes code easier to read, but I can't say anything about whether it should be standardized or not.

Aside: Corner cases

If we were to standardize this syntax, we would have to consider all kinds of corner cases, not just accept what clang with my changes happens to to do. However, I'm still curious about what exactly clang happens to do with my changes.

How does this interact with the comma operator?

The comma operator in C++ (and most C-based languages) is kind of strange. For example, what does foo(a, b) do? We know it calls foo with the arguments a and b , but you could technically decide to parse it as foo(a.operator,(b)) .

My arrow syntax parses foo([] => 10, 20) as calling foo with one argument; a function with the body 10, 20 (where the comma operator means the 10 does nothing, and 20 is returned). I would probably want that to be changed, so that foo is called with two arguments; a lambda and an int.

This turns out to be fairly easy to fix, because Clang already has ways of dealing with expressions which can't include top-level commas. After all, there's precedence here; since clang parses foo(10, 20) without interpreting the , a top-level comma as a comma operator, we can use the same infrastructure for arrow lambdas.

In clang/lib/Parser/ParseExpr.cpp , Clang defines a function ParseAssignmentExpression , which has this comment:

Parse an expr that doesn't include (top-level) commas.

Calling ParseAssignmentExpression is also the first thing the ParseExpression function does. It seems like it's just the general function for parsing an expression without a top-level comma operator, even though the name is somewhat misleading. This patch changes arrow lambdas to use ParseAssignmentExpression instead of ParseExpression : https://github.com/mortie/llvm/commit/c653318c0056d06a512dfce0799b66032edbed4c

How do immediately invoked lambdas work?

With C++11 lambdas, you can write an immediately invoked lambda in the obvious way; just do [] { return 10; }() , and that expression will return 10. With my arrow lambda syntax, it's not quite as obvious. Would [] => foo() be interpreted as immediately invoking the lambda [] => foo , or would it be interpreted as creating a lambda whose body is foo() ?

In my opinion, the only sane way for arrow lambdas to work would be that [] => foo() creates a lambda with the body foo() , and that creating an immediately invoked lambda would require extra parens; ([] => foo())() . That's also how my implementation happens to work.

How does this interact with with explicit return types, specifiers, etc?

Since literally all the code before the arrow/opening brace is shared between arrow lambdas and C++11 lambdas, everything should work exactly the same. That means that all of these statements should work:

auto l1 = []() -> long => 10;
auto l2 = [foo]() mutable => foo++;
auto l3 = [](auto a, auto b) noexcept(noexcept(a + b)) => a + b;

And so would any other combination of captures, template params, params, specifiers, attributes, constraints, etc, except that the body has to be a single expression.

以上所述就是小编给大家介绍的《Hacking on Clang is surprisingly easy》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Hacking on Clang is surprisingly easy

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

实用语义网

2009-2 / 59.00元

《实用语义网RDFS与OWL高效建模(英文版)》是语义网的入门教程，详细讲述语义网的核心内容的语言，包括语义网的概念、语义建模等。语义网的发展孕育着万维网及其应用的一场革命，作为语义网核心内容的语言：RDF和OWL，逐渐得到广泛的重视和应用。《实用语义网RDFS与OWL高效建模(英文版)》对于任何对语义网感兴趣的专业技术人员都是十分难得的参考书。一起来看看《实用语义网》这本书的介绍吧!

码农工具

Hacking on Clang is surprisingly easy

Getting the Clang code

Changing the Clang code

Was it worth it?

Before:

After:

Before:

After:

Before:

After:

Aside: Corner cases

How does this interact with the comma operator?

How do immediately invoked lambdas work?

How does this interact with with explicit return types, specifiers, etc?

实用语义网

随机密码生成器

SHA 加密

Markdown 在线编辑器