How does it work in Mono’s C# compiler?

图片 32

原文地址:

Introduction

Mono is an Open Source free programming language project. It is an
implementation of Microsoft’s .NET Framework based on the European
association for standardizing information and communication systems
(ECMA)
standards for C# language and Common Language Runtime
(CLR). The Mono
C# compiler
was started by Miguel de Icaza. In Table 1, I have tried to show the
different components of Mono and a brief description of those components
to show what they do.

Table 1: Mono source code components

Component
Description

C# Compiler
Mono’s C# compiler is an implementation of the C# language based on
the ECMA specificiation. It is now with C# 1.0, 2.0, 3.0, 4.0.

Mono Runtime
The runtime implements the ECMA Common Language Infrastructure (CLI).
The runtime provides a Just-in-Time (JIT) compiler, an Ahead-of-Time
compiler (AOT), a library loader, the garbage collector, a threading
system, and interoperability functionality.

Base Class Library
The Mono platform provides a comprehensive set of classes that provide a
solid foundation to build applications on. These classes are compatible
with Microsoft’s .NET Framework classes.

Mono Class Library
Mono also provides many classes that go above and beyond the Base Class
Library provided by Microsoft. These provide additional functionality
that are useful, especially in building Linux applications. Some
examples are classes for Gtk+, Zip files, LDAP, OpenGL, Cairo, POSIX,
etc.

Note: The information shown in the above table has been retrieved
from .

There are many version of the Mono compiler. Table 2 shows the different
versions of the Mono compiler and the framework each supports.

Table 2: Mono compiler version and related frameworks

Compiler Version
Target Framework

mcs
1.1

gmcs
2.0

smcs
2.1

dmcs
4.0

Note: The information shown in the table has been retrieved from
.

The compiler mcs now defaults to the 3.x language specification,
starting with Mono 2.8.

译者:史宁宁(snsn1984)

Getting the Mono source code

Mono is a freely available Open Source C# programming language project.
If you want to download the Mono C# compiler project’s source code,
there are many places to do so. We can use gitHub for instance. The URL
for the Mono source code in gitHub is
. Or we can download from other
places such as . I
downloaded the Mono source code from gitHub site (Figure 4.1). There are
a few branches of Mono for example, as seen in Figure 4.1. Mono has
mono-2-10, mono-2-10-8, mono-2-6, mono-2-8, etc. In the following Table
3, I show the different directories of Mono and a short description of
each.

Table 3: Mono source code directory

docs
Technical documents about the Mono runtime.

data
Configuration files installed as part of the Mono runtime.

How does it work in Mono’s C# compiler?。mono
The core of the Mono Runtime.

metadata
The object system and metadata reader.

mini
The Just in Time Compiler.

dis
CIL executable Disassembler

cli
Common code for the JIT and the interpreter.

io-layer
The I/O layer and system abstraction for emulating the .NET IO model.

cil
Common Intermediate Representation, XML definition of the CIL byte
codes.

interp
Interpreter for CLI executable (obsolete).

arch
Architecture specific portions.

mcs
The core of the Mono Compiler code

mcs

mcs
Compiler source code

jay
Parser generator

man
Manual pages for the various Mono commands and programs.

samples
Some simple sample programs on uses of the Mono runtime as an embedded
library.

scripts
Scripts used to invoke Mono and the corresponding program.

runtime
A directory that contains the Makefiles that link the mono/ and mcs/
build systems.

../olive
If the directory ../olive is present (as an independent checkout) from
the Mono module, that directory is automatically configured to share the
same prefix than this module gets.

Note: The above Mono source code directory has been retrieved from
.

The Mono compiler source code resides inside the mcs folder of
/mono/mcs.

 

Jay

Jay is an Open Source Compiler-Compiler tool derived from Berkeley Yacc.
It is used in the Mono project as a Compiler-Compiler tool to generate
the parser of the Mono C# compiler. Jay reads the grammar specification
from a grammar file and generates an LR parser for it.
Thiscs-parser.jay file is used by Jay to turn into cs-parser.cs file
which will be consumed by the Mono C# compiler as the parser.

                                                                                         
“Clang”C语言前端内部手册

cs-parser.jay to cs-parser.cs conversion

Cygwin is a set of Open Source tools which provide a Linux like
environment for Windows where Linux applications, for example, Shell can
be used in Windows. So now we assume we have a working Cygwin
environment in our desktop. We will open the Cygwin terminal by clicking
on Start > Program Files > Cygwin > Cygwin Terminal. When we
open the Cygwin terminal, it will be like Figure 1.

图片 1

Figure 1: Cygwin Open mode

Please copy the Mono source code inside the /usr/src directory of the
Cygwin installation directory. And now open the Cygwin terminal and
write the following command listed in Code-Listing 1.

Code-Listing 1: Bash Command to convert cs-parser.jay to cs-parser.cs

图片 2 Collapse | Copy
Code

$cd /usr/src/Mono/mcs
$cd jay
$make
$cd ..
$cd mcs
$../jay/jay.exe -ctv < ../jay/skeleton.cs cs-parser.jay > cs-parser.cs

Please see the following figure:

图片 3

Figure 2: Bash Command output

So after executing Jay.exe with the appropriate argument, it will
convert the cs-parser.jay file into cs-parser.cs which is the parser
for Mono.

简介

这个文档描述了比较重要的API中的一部分API,还表述了Clang
C语言前端中的一些内部设计想法。这个文档的目的是既把握住高层次的信息的同时也对这些高层次的信息背后的设计思路进行一些介绍。这篇文档面向的是打算hacking(这个词的具体含义在这里实在不好把握,感觉英文更容易表达具体含义)Clang的人,而不是Clang的最终用户。接下来的描述是按照库分类的,并且没描述任何库的使用者。

Mono source code relationship

According to the documentation supplied with the Mono source code
(mcs\mcs\compiler.txt or
), the
entire source code file for the Mono C# compiler has been divided into
five categories: Infrastructure, Parsing, Expressions, Statements and
Declarations, classes, structs, Enumerations. If we look into the
following Mono C# compiler source code classification table 4, then it
will be easier to understand all the types used in the compiler
construction in Mono.

Table 4: Mono Source code classification

Mono Compiler Source code classification

Infrastructure
Parsing
Expressions
Statements
Declarations, Classes, Structs, Enumerations

driver.cs
cs-tokenizer.cs
ecore.cs
statement.cs
decl.cs

codegen.cs
cs-parser.jay, cs-parser.cs
expression.cs
iterators.cs
class.cs

attribute.cs
location.cs
assign.cs
delegate.cs

rootcontext.cs
constant.cs
enum.cs

typemanager.cs
literal.cs
interface.cs

report.cs
cfold.cs
parameter.cs

support.cs
pending.cs

Note: The above Mono source code classification has been retrieved
from .

LLVM支持库

LLVM的libSupport库提供了很多相关库和数据结构,包括命令行选项处理,各种容器和一个可以用来接入文件系统的系统抽象层。

Mono compilation in depth

The Mono C# compiler starts compilation from the driver.cs file. By
calling the public bool Compile () method, Mono starts its compilation
process. It then initializes the TopLevelTypes variable of the
RootContext class. After doing that, it calls the Parse() method of
the driver class. The Parse() method then calls
void Parse (CompilationUnit file) to start reading from the source
code file. After reading from the source code file, thedriver.cs file
starts the parsing process by calling the:

图片 4 Collapse | Copy
Code

void Parse (SeekableStreamReader reader, CompilationUnit file)

method. It will create an instance of the Mono parser by creating an
instance of an object of CSharpParser by calling the

图片 5 Collapse | Copy
Code

public CSharpParser(SeekableStreamReader reader, CompilationUnit file, CompilerContext ctx) 

constructor. If we look into the partial code of the Compile method
from the driver.cs file listed in code-listing 1, we can see the main
flow of the compilation process.

Code-Listing 1: Partial source code of the Compile method

图片 6 Collapse | Copy
Code

public bool Compile()
{
   // TODO: Should be passed to parser as an argument
   RootContext.ToplevelTypes = new ModuleContainer(ctx, RootContext.Unsafe);
   Parse();
   ProcessDefaultConfig();
   GlobalRootNamespace.Instance.AddModuleReference(RootContext.ToplevelTypes.Builder);
   //
   // Load assemblies required
   //
   LoadReferences();
   TypeManager.InitOptionalCoreTypes(ctx);
   //
   // The second pass of the compiler
   //
   RootContext.ResolveTree();
   if (!RootContext.StdLib)
     RootContext.BootCorlib_PopulateCoreTypes();
   RootContext.PopulateTypes();
   //
   // Verify using aliases now
   //
   NamespaceEntry.VerifyAllUsing();
   if (Report.Errors > 0)
   {
     return false;
   }
   CodeGen.Assembly.Resolve();
  if (RootContext.VerifyClsCompliance)
  {
    //......
  }
  RootContext.EmitCode();
  RootContext.CloseTypes();
  CodeGen.Save(output_file, want_debugging_support, Report);
}

From Code-Listing 1, we can see the Mono parser starts parsing by
callingpublic void parse() of the driver class which will call the
following Parse method listed in Code-Listing 2 to continue parsing.

Code-Listing 2: Source code of Parse method

图片 7 Collapse | Copy
Code

void Parse (SeekableStreamReader reader, CompilationUnit file)
{
   CSharpParser parser = new CSharpParser (reader, file, ctx);
   try {
     parser.parse ();
   } catch (Exception ex) {
   Report.Error(589, parser.Lexer.Location,
      "Compilation aborted in file `{0}', {1}", file.Name, ex);
   }
}

This Parse method will create an instance of the CSharpParser class
by calling the constructor listed in Code-Listing 3. This will make an
instance of the Mono parser generated by Compiler-Compiler tool Jay as
we discussed earlier.

Code-Listing 3: Constructor of the CSharpParser class

图片 8 Collapse | Copy
Code

public CSharpParser (SeekableStreamReader reader, CompilationUnit file, CompilerContext ctx)

The constructor listed in Code-Listing 3 will take the file reader
stream and a readonly value of the compiler context defined in the
driver class.

Code-Listing 4: CSharpParser declaration in the cs-parser.jay file

图片 9 Collapse | Copy
Code

%{
using System.Text;
using System.IO;
using System;
namespace Mono.CSharp
{
using System.Collections;
/// <summary>
///    The C# Parser
/// 
    public class CSharpParser
    {

The parser object created from CSharpParser will call the internal
parse() method of the CSharpParser class which will call the
Compiler-Compiler generated yyparse method. To call yyparse, it
needs to pass the lexer or the tokenizer in it as a parameter. The code
listed in Code-Listing 5 shows the method signature of the yyparse
method which takes the lexer as a parameter.

Code-Listing 5: Signature of the yyparse method

图片 10 Collapse | Copy
Code

internal Object yyparse (yyParser.yyInput yyLex)

The parsing will take place in this yyparse method. This yyparse
method will parse each of the tokens generated by the lexer.

Code-Listing 6: Lexer initialization in the CSharpParser constructor

图片 11 Collapse | Copy
Code

public CSharpParser (SeekableStreamReader reader, CompilationUnit file, CompilerContext ctx)
{
  // Code has been removed
  lexer = new Tokenizer (reader, file, ctx);
}

The yyparse method will call the xToken() method of the lexer to
generate tokens for it. The lexer will return a token by doing lexical
analysis of the source code of a program. Before we move ahead, we need
to understand the token generation process. In Mono, token generation is
an interesting process, the Tokenizer class reads each character from
the source code (for example, in this instance, ClassToParse.cs listed
in Code-Listing 13) one by one and will match with keywords stored in
the tokenizer class to find out whether it has an associate keyword or
find if it is a literal. The lexer will perform this operation by
calling the Is_identifier_start_character(int c) method, to find out
whether the character is an identifier or not by calling the
get_char() method. If it is an identifier, then the lexer will call
the consume_identifier(int s) method to consume the identifier. The
logic is shown in code-listing 7.

Code-Listing 7: Identifier check in tokenizer.cs

图片 12 Collapse | Copy
Code

if (is_identifier_start_character (c)) {
   tokens_seen = true;
   return consume_identifier (c);
}

While the lexer tries to consume the identifier, it will try to find out
if there is any keyword match as shown in Code-Listing 8 by calling the
GetKeyword method.

Code-Listing 8: consume_identifier of tokenizer.cs

图片 13 Collapse | Copy
Code

private int consume_identifier (int c, bool quoted)
{
    while ((c = get_char ()) != -1) {
    // code has been removed from above for simplicity
    if (id_builder [0] >= '_' && !quoted) {
        int keyword = GetKeyword (id_builder, pos);
        if (keyword != -1) {
            // TODO: No need to store location for keyword, required location cleanup
            val = loc;
            return keyword;
        }
    }
    //......
    CharArrayHashtable identifiers_group = identifiers [pos];
    if (identifiers_group != null) {
        val = identifiers_group [id_builder];
        if (val != null) {
            val = new LocatedToken (loc, (string) val);
            if (quoted)
            AddEscapedIdentifier ((LocatedToken) val);
            return Token.IDENTIFIER;
        }
    }
    //.................
    val = new String (id_builder, 0, pos);
    identifiers_group.Add (chars, val);
    //................
    val = new LocatedToken (loc, (string) val);
    if (quoted)
    AddEscapedIdentifier ((LocatedToken) val);
    return Token.IDENTIFIER;
}

If the token is not a keyword, then the lexer will mark it as an
identifier and return IDENTIFIER as the token type and the word it
consumes from the stream will be stored in the val object of the lexer
class, i.e., the Tokenizer class (cs-toeknizer.cs). The val is an
object type private variable defined in the lexer (which is
cs-tokenizer.cs). The val object of the Tokenizer.cs file is
accessible via a property called value in the Tokenizer class listed
in Code-Listing 9.

Code-Listing 9: value property of tokenizer.cs

图片 14 Collapse | Copy
Code

public Object value ()
{
  return val;
}

After finish the checking process, the lexer will return the token to
the parser to continue the parsing process. So when the parser finds a
token value returned by the lexer (for example, 418 for IDENTIFIER for
the example listed in Code-Listing 5.18; see appendix for details of the
Mono tokens), then the parser will treat it as an identifier and try to
access the val associated with the identifier. To access the val of
the Tokenizer class, the parser will call the value property of the
lexer and assign the value into yyVal (yyVal object type local
variable defined in the yyparse method) of the parser. Each of this
yyVal will be stored in the yyVals array inside the yyparse method
in the CSharpParser class. The value stored in the yyVals array will
be used later as a substitute variable for the grammar. In the following
Code-Listing 5.13, we can see how yyVal is being stored in the
yyVals array. Note: Substitution is the grammar-parser communication,
i.e., to pass a value into a grammar file. For example, if we want to
substitute variable $1 or $2 or $3 defined in the grammar from the
parsing, we have to pass the substitute value from the parser. In here,
yyVals will store that entire substitute variable for the grammar as
per the tokens.

Code-Listing 10: Source code of the yyparse method of cs-parser.cs

图片 15 Collapse | Copy
Code

internal Object yyparse(yyParser.yyInput yyLex)
{
   /*……….*/
   for (int yyTop = 0; ; ++yyTop)
   {
   /*……….*/
   yyVals[yyTop] = yyVal;
   if (debug != null) debug.push(yyState, yyVal);
   /*……….*/
}

When the parser accesses the value from the Tokenizer using the
yyVal=yyLex.value() statement, it then assigns back yyVal to
yyVals as shown in Code-Listing 5.13. This communication between the
lexer and the parser is like Just in Time, i.e., whenever the parser
requires a token, it will ask for it by calling the xToken() method of
the lexer and the lexer will execute the xtoken() method to perform
the operation for the parser. So when the parser gets a token from the
lexer, it will calculate the yyN value. The usage of the yyN value
in Mono is to match with the appropriate grammar action block. yyN is
one of the important variables in the parser because it is actually used
to do the mapping between the token value returned from the source code
file by the lexer and the grammar (language specification, for example,
cs-parser.jay). Using the token value, the parser will match with the
grammar action block defined in theyyparse method (the switch case
statements generated by the Compiler-Compiler Jay). If it matches any
case statement, then the parser will execute the related code block
defined in the matched case condition. This code block will initialize
the related abstract syntax tree node type, for example, a type of
Statement object or Expression object and it will add the type object
into theTypeContainer.

In short, parser will execute the action block of the grammar when any
token value matched with the value yyNN, for example from the grammar
file cs-parser.jay file of the Mono has following grammar showing in
the code-Listing 11, in the line 1265 for the method declaration.

Code-Listing 11: Grammar declaration of the Method in cs-parser.jay

图片 16 Collapse | Copy
Code

method_declaration
: method_header {
if (RootContext.Documentation != null)
Lexer.doc_state = XmlCommentState.NotAllowed;
}
method_body
{
Method method = (Method) $1;
method.Block = (ToplevelBlock) $3;
current_container.AddMethod (method);
if (current_container.Kind == Kind.Interface && method.Block != null) {
Report.Error (531, method.Location, "`{0}': interface members cannot have a definition", method.GetSignatureForError ());
}
current_generic_method = null;
current_local_parameters = null;
if (RootContext.Documentation != null)
Lexer.doc_state = XmlCommentState.Allowed;
};

The method grammar specified in the grammar specification file in this
case cs-parser.jay file also defined in the cs-parser.cs file within
a case statement. In this case, the case condition value is 159 (159 is
given by Compiler-Compiler tool in this case Jay while converting
cs-parser.jay into cs-parser.cs) as shown in the code-listing 12.
The code block defined for the method in the grammar will execute
whenever lexer return a token value which become 159 as yyN value (yyN
value is generate based on the token). This code block actually add
instance of Method class into the Type container.

Code-Listing 12: Partial code block from the yyparser method

图片 17 Collapse | Copy
Code

switch (yyN)
{
case 159:
#line 1265 "cs-parser.jay"
{
Method method = (Method)yyVals[-2 + yyTop];
method.Block = (ToplevelBlock)yyVals[0 + yyTop];
current_container.AddMethod(method);
if (current_container.Kind == Kind.Interface && method.Block != null)
{
Report.Error(531, method.Location, 
    "`{0}': interface members cannot have a definition", 
    method.GetSignatureForError());
}
current_generic_method = null;
current_local_parameters = null;
if (RootContext.Documentation != null)
Lexer.doc_state = XmlCommentState.Allowed;
}
break;
}

From the above code listed in Code-Listing 12, we can see there is
communication between this grammar and cs-parser.jay file using
substitute variable. In the Code-Listing 11 there are two substitute
value $1 which will be replaced by the value returned from
(Method)yyVals[-2 + yyTop] and $3 by the return value of
(ToplevelBlock)yyVals[0 + yyTop] from the code listed in the
Code-Listing 12. This is how whole grammar will match with the token
return by the lexer and the action block will be executed based on the
grammar. The same process will continue until the end of the source code
file, i.e., finalizes the token searching from the source code.

Clang“基本”库

这个库很明显需要一个更好的名字。这个“基本”库包含了跟踪和操作代码缓存,源码缓存区中的定位,诊断,序列,目标抽取,和被编译的编程语言的子集的相关信息这一系列的底层公共操作。

这个库的一部分是特别针对C语言的(比如TargetInfo类),剩下的部分可以被其他的不是基于C的编程语言重用(SourceLocation, SourceManager, Diagnostics, FileManager)。如果未来有需求的话,我们可以指出是否需要介绍一个新库,把这些公用的类移动到别的地方,或者介绍一些其他的解决方案。

我们依据这些类的依赖关系介绍这些类的角色。

Debug Mono compilation

We will experiment using Source Code Listing 13 and try to understand
the following two basic things by debugging the Mono compiler using
Visual Studio 2010 as IDE:.

  • How does Mono retrieve tokens and parse source code.
  • How does it build the AST.

The following ClassToParse class listed in Code-Listing 13 is written
using C# and will be used as the source code for this experiment.
ClassToParse is a simple program which has a using statement and a
namespace declaration. It also defines a class and inside of the class,
it has a Main method which is the entry point.

Code-Listing 13: Source code to display Hello! world on the Console.

图片 18 Collapse | Copy
Code

using System;
namespace gmcs
{
public class ClassToParse
{
    public static int Main (string[] args)
    {
         Console.WriteLine("Hello! World.");
         return 1;
    }
}
}

The above ClassToParse program will be used to do this experiment. To
start this debug we require to do bit of ground work such as we need to
do, modify the Main (string[] args) method of the driver.cs file of
the Mono source code as shown in the Code-Listing 14.

Code-Listing 14: Source code Main method of the driver.cs

图片 19 Collapse | Copy
Code

public static int Main(string[] args)
{
    Location.InEmacs = Environment.GetEnvironmentVariable("EMACS") == "t";
    args = new string[] { @"C:\Temp\ClassToParse.cs", @"-out:C:\Temp\Otu.exe" };

    Driver d = Driver.Create(args, true, new ConsoleReportPrinter());
    if (d == null)
        return 1;
    if (d.Compile() && d.Report.Errors == 0)
    {
        if (d.Report.Warnings > 0)
        {
            Console.WriteLine("Compilation succeeded - {0} warning(s)", d.Report.Warnings);
        }
        Environment.Exit(0);
        return 0;
    }
    Console.WriteLine("Compilation failed: {0} error(s), {1} warnings",
        d.Report.Errors, d.Report.Warnings);
    Environment.Exit(1);
    return 1;
}

In the above code listed in the 14, I added the ClassToParse.cs file
path into the args[] array (which is C:\Temp\ClassToParse.cs) and
set the parsing option along with output filename for instance in here
Out.exe. If we put a break point on the
if (d.Compile() && d.Report.Errors == 0) line:

When it starts debugging, the Mono compiler will call the Compile()
method of the Driver object for instance, d.Compile() starts calling
another method to start compiling as below. If we look into the
Compile() method of the driver.cs class, we can see the major
functionality is as below:

Code-Listing 15: Compile method of the driver.cs

图片 20 Collapse | Copy
Code

public bool Compile ()
{
    RootContext.ToplevelTypes = new ModuleContainer (ctx, RootContext.Unsafe);
    Parse ();
    //....
    ProcessDefaultConfig ();
    //
    // Load assemblies required
    //
    LoadReferences ();
    // The second pass of the compiler
    RootContext.ResolveTree ();
    //...
    RootContext.PopulateTypes ();
    //
    // Verify using aliases now
    //
    NamespaceEntry.VerifyAllUsing ();
    //....
    CodeGen.Assembly.Resolve ();
    //
    // The code generator
    //
    RootContext.CloseTypes ();
    //....
    CodeGen.Save (output_file, want_debugging_support, Report);
    //....
}

Depending on the tokenize status inside the parser method, it will go
further, i.e., it will starttokenize_file.

Code-Listing 16: Parse method of the cs-parser.cs

图片 21 Collapse | Copy
Code

public void Parse()
{
    Location.Initialize();

    ArrayList cu = Location.SourceFiles;
    for (int i = 0; i < cu.Count; ++i)
    {
        if (tokenize)
        {
            // MoRe: Step 3
            tokenize_file((CompilationUnit)cu[i], ctx);
        }
        else
        {
            Parse((CompilationUnit)cu[i]);
        }
    }
}

and finally the parser will call parse() of the CSharpParser class
which has been generated by Jay.

Code-Listing 17: Parse method of the Mono

图片 22 Collapse | Copy
Code

void Parse(SeekableStreamReader reader, CompilationUnit file)
{
    CSharpParser parser = new CSharpParser(reader, file, ctx);
    try
    {
        parser.parse();
    }
    catch (Exception ex)
    {
        Report.Error(589, parser.Lexer.Location,
            "Compilation aborted in file `{0}', {1}", file.Name, ex);
    }
}

All the grammar specified in cs-parser.jay has an action block against
the rule and also in the parser method there is a mapping between this
grammar and their associate action (please see the Appendix for full
listing of grammar for the Mono C# compiler) as a case of the
switch statement. Depending on (the token value converted into) yyN
value, the related action will be executed to build the abstract syntax
tree. If we look into Figure 3, we can see how Mono consumes a token as
it calls Parse() of driver.cs and then yyparse of the
CSharpParser class. yyparse will consume the token from the input
stream by calling the xtoken() method of the Tokenizer class.

图片 23

Figure 3: Stack trace of the Compile() method

Before we go ahead, we will a have a look at the process that takes
place inside the xtoken() method. In the first phase of file reading,
Tokenizer will read the first character from the stream which will be

  1. In here, 117 is the representation of u in ASCII (please see the
    Appendix for the complete list of ASCII and decimal value tables). If we
    look at Figure 4, it shows the current value of c (character refers to
    the token) is 117 which is u and it is the first character of the
    using statement used in the ClassToParse class.

图片 24

Figure 4: Tokenizing the ClassToParse class

As 117 is not a standard token, so it will be validated as the
identifier and tokenizer will start consuming the identifier as showing
in the Figure 5,

图片 25

Figure 5: Tokenizeing using statement

After consume the identifier it will match with the stored keyword
inside the tokenizer class and try to find out whether it is a Keyword
or not as showed in the Figure 6.

图片 26

Figure 6: Keyword matching

It will be identified as a keyword as Mono has a keyword with value 335
(please see the Appendix for a full list of tokens). And finally, the
lexer will return the token value 335 which is the using statement.
Figure 7 shows the return statement of the token method of the lexer
which returns 335 as the current token value.

图片 27

Figure 7: Current Token value from the token() of the cs-tokenizer.cs

The parser will now try to find out whether there is any condition which
is equal to this token value, if so, it will execute the related action
block defined as part of the grammar.

Before parser can execute the action block it has to calculate the yyN
value as we see earlier yyN is the mapping between token value and
grammar. The bit of code which parser uses to calculate yyN is listed in
the Code-Listing 18.

Code-Listing 18: yyN calculation based on yyTable

图片 28 Collapse | Copy
Code

if ((yyN = yyRindex[yyState]) != 0 && (yyN += yyToken) >= 0
     && yyN < yyTable.Length && yyCheck[yyN] == yyToken)
          yyN = yyTable[yyN];

If we see the Figure 8, we can see the watch window while debugging the
compiler with the token value 374. In this calculation process, the
parser will retrieve the yyN value from the yyTable (yyTable was created
while generating the parser using Jay) array.

图片 29

Figure 8: yyN value in the watch list

The yyN value calculation is another interesting bit of work in the
Mono compiler. So based on the given value yyState = 33, yyToken = 374,
we get the value of 33th position of yyRinedx[33] which will be 450.
The current value of the yyN will be 450 and second part of the if((yyN
+= yyToken) >= 0) condition will add yyToken value with 450(current
value of yyN) as yyN += yyToken.

Finally, the latest yyN value will be 824 (current token is 374 +
previous yyN value which is 450). This 824 will be used as index to
retrieve the value stored into the yyTable (yyTable created by the
Compiler-Compiler Jay while converting the cs-parser.jay to
cs-parser.cs) in that position. And this value will be used as the new
value of yyN which will be used as the switch case selector to execute
the action block. I would like introduce here following arrays listed in
the Code-Listing 5.23. All these arrays have been created by the
compiler-compiler Jay while converted grammar file into the parser.

Note: Array generated by the Jay for the Mono Parser.

Code-Listing 19: array generation by yacc

图片 30 Collapse | Copy
Code

static short[] yyLhs
static short[] yyLen
static short[] yyDefRed
protected static short[] yyDgoto
protected static short[] yySindex
protected static short[] yyRindex
protected static short[] yyGindex
protected static short[] yyTable
protected static short[] yyCheck

In Figure 9, I have tried to show how yyN maps the case statement
defined in theyyparse method to execute the related code block defined
in the grammar, i.e., the cs-parser.jay file.

图片 31

Figure 9: Token Mapping

We can see from Figure 10 how Mono constructs the abstract syntax tree
while parsing the source code of a program. Each time the parser finds a
valid token and a yyN value, it will match with the condition to run
the related action block which adds the related type (based on the
grammar specification, please see the Appendix for a full listing of
grammar for the Mono C# compiler) into the TypeContainer which is
later on used to resolve the Types.

图片 32

Figure 10: Watching

The Compile() method will call the ResolveTree method of the
RootContext type from the rootContext.cs file.

Code-Listing 20: ResolveTree method of RootContext

图片 33 Collapse | Copy
Code

RootContext.ResolveTree ();

The ResolveTree method will generate the hierarchy tree or parse tree.
And later on, the Compile() method calls PopulateTypes of the
RootContext class. So far we have seen how Mono tokenizes the source
code, parses the source code, and based on it how it constructs the
Abstract Syntax Tree. In the next section, we will see how Mono
generates Intermediate Language (IL) code to generate the assembly.

诊断子系统

Clang的诊断子系统是编译器和人交流的很重要的一部分。诊断信息就是在代码不正确或者有可能不正确的时候,产生的一些警告或者错误提示。在Clang里,每一条诊断信息都有(至少有)一个唯一的一个ID,还有一个与之相关的英文翻译,一个SourceLocation去插入符号,和一个严重性级别(比如:警告或者错误)。它们还有一系列的和诊断相关的源码范围相关的可选择的诊断参数(在字符串的哪个部分插入%0)。

In this section, we’ll be giving examples produced by the Clang command
line driver, but diagnostics can be rendered in many different
ways
 depending on how the DiagnosticClient interface is implemented.
A representative example of a diagnostic is:

在这部分,我们给出的例子都是通过Clang的命令行驱动产生的,但是DiagnosticClient的不同实现方式可以通过不同的方法来给出诊断信息。一个具有代表性的诊断信息的例子:

t.c:38:15: error: invalid operands to binary expression ('int *' and '_Complex float')  P = (P-42) + Gamma*4;      ~~~~~~ ^ ~~~~~~~

在这个例子里,你可以看到英语翻译,严重性级别(错误),你可以看到源码的位置(插入符号”^”和文件/行/列信息),源码范围”~~”,对应与诊断信息(“int*
and
_Complex float”)的参数。你将不得不相信我,每个诊断信息背后都有一个唯一的ID。

让所有的这些发生需要好几个步骤,并且包括了很多moving
pieces,这部分将描述它们,并且讨论一下当新加一个新的诊断信息的最好实践办法。

References

  • CodeProject – Introduction to Mono – Your first Mono
    app
  • CodeProject – MONO: an alternative for the .NET
    framework
  • CodeProject – Hacking the Mono C#
    Compiler
  • CodeProject – Dynamic Type Using
    Reflection.Emit
  • CodeProject – Introduction to IL Assembly
    Language
  • Gough J. (2002). Compiling for the .NET Common Language Runtime
    (CLR). Prentice Hall PTR.
  • man jay (Commandes) – an LALR(1) parser generator for Java and
    C#,
  • Yacc: Yet Another
    Compiler-Compiler,
  • Rahman M. (2012). Understanding Mono C#
    Compiler.

Diagnostic*Kinds.td文件

诊断信息通过在clang/Basic/Diagnostic*Kinds.td中的某个文件添加一个入口来创建,依赖于即将使用的库。tblgen根据这个文件生成诊断信息唯一的ID,诊断信息的严重程度和英语翻译+格式字符串。

目前唯一的ID的名字很不合理。一些是以err_, warn_, ext_开头的,以此来给名字中加入严重程度的信息。因此使用C++代码中的枚举来生成这些诊断信息,目前因为比较短所以还是有一定的实用之处的。

诊断信息的严重程度来自集合{NOTE, WARNING, EXTENSION, EXTWARN, ERROR}。ERROR严重程度用用于诊断信息表明这个程序是在任何情况下都不能接受的。当一个错误生成的时候,输入代码的AST有可能没有完全构建。EXTENSION

EXTWARN严重程度用于Clang可以接受的扩展。这意味着Clang可以完全理解并且可以用AST表示它们(译者注:输入代码),但是我们产生诊断信息告诉用户他们的代码是不可移植的。两者不同的是,前者可以默认被忽略,但是后者默认警告。WARNING严重程度用来构建那些在目前编程语言中是正确的但是却在某种程度上具有可疑的代码。NOTE这个层次的严重程度用来在前面的诊断信息基础上提供更多的信息。

这些严重程度根据诊断系统基于各种配置的选项的子系统,被映射到一个输出层次的更小的集合(the Diagnostic::Level enum,
{Ignored, Note, Warning, Error, Fatal})。Clang内部支持完全的细粒度的映射机制,可以允许你映射几乎所有的诊断信息到你想要的输出层面。不能被映射的严重程度就是NOTE,它总是跟随着前面生成的诊断信息的严重程度;还有ERROR,它只能被映射到Fatal(比如:你不可能把一个错误转化成一个警告)。

诊断信息的映射可以有很多中方法去使用。比如,如果用户指定了-pedantic,那么EXTENSION将被映射到Warning;如果他们指定-pedantic-errors,它(译者注:EXTENSION)将变为Error。这可以用来实现像
-Wunused_macros, -Wundef等类似的选项。

只有在错误十分严重,并且无法恢复的情况下才会把诊断信息映射到Fatal(这时候将有成吨的错误信息喷涌而出)。这类错误的一个例子就是 #include一个文件失败。

License

This article, along with any associated source code and files, is
licensed under The Code Project Open License
(CPOL)

原文:

The Format String

The format string for the diagnostic is very simple, but it has some
power. It takes the form of a string in English with markers that
indicate where and how arguments to the diagnostic are inserted and
formatted. For example, here are some simple format strings:

"binary integer literals are an extension"  "format string contains '\\0' within the string body"  "more '%%' conversions than data arguments"  "invalid operands to binary expression (%0 and %1)"  "overloaded '%0' must be a %select{unary|binary|unary or binary}2 operator"       " (has %1 parameter%s1)"  

These examples show some important points of format strings. You can use
any plain ASCII character in the diagnostic string except “%” without
a problem, but these are C strings, so you have to use and be aware of
all the C escape sequences (as in the second example). If you want to
produce a “%” in the output, use the “%%” escape sequence, like the
third diagnostic. Finally, Clang uses the “%...[digit]” sequences to
specify where and how arguments to the diagnostic are formatted.

Arguments to the diagnostic are numbered according to how they are
specified by the C++ code that produces them, and are referenced
by %0 .. %9. If you have more than 10 arguments to your diagnostic,
you are doing something wrong :). Unlike printf, there is no
requirement that arguments to the diagnostic end up in the output in the
same order as they are specified, you could have a format string with
%1 %0” that swaps them, for example. The text in between the percent
and digit are formatting instructions. If there are no instructions, the
argument is just turned into a string and substituted in.

Here are some “best practices” for writing the English format string:

  • Keep the string short. It should ideally fit in the 80 column limit
    of the DiagnosticKinds.td file. This avoids the diagnostic
    wrapping when printed, and forces you to think about the important
    point you are conveying with the diagnostic.
  • Take advantage of location information. The user will be able to see
    the line and location of the caret, so you don’t need to tell them
    that the problem is with the 4th argument to the function: just
    point to it.
  • Do not capitalize the diagnostic string, and do not end it with a
    period.
  • If you need to quote something in the diagnostic string, use single
    quotes.

Diagnostics should never take random English strings as arguments: you
shouldn’t use “you have a problem with %0” and pass in things like
your argument” or “your return value” as arguments. Doing this
prevents translating the Clang diagnostics to other languages (because
they’ll get random English words in their otherwise localized
diagnostic). The exceptions to this are C/C++ language keywords
(e.g., auto,constmutable, etc) and C/C++ operators (/=). Note
that things like “pointer” and “reference” are not keywords. On the
other hand, you caninclude anything that comes from the user’s source
code, including variable names, types, labels, etc. The “select
format can be used to achieve this sort of thing in a localizable way,
see below.

Formatting a Diagnostic Argument

Arguments to diagnostics are fully typed internally, and come from a
couple different classes: integers, types, names, and random strings.
Depending on the class of the argument, it can be optionally formatted
in different ways. This gives the DiagnosticClient information about
what the argument means without requiring it to use a specific
presentation (consider this MVC for Clang :).

Here are the different diagnostic argument formats currently supported
by Clang:

“s” format

Example:
"requires %1 parameter%s1"

Class:
Integers

Description:
This is a simple formatter for integers that is useful when producing
English diagnostics. When the integer is 1, it prints as nothing. When
the integer is not 1, it prints as “ s”. This allows some simple
grammatical forms to be to be handled correctly, and eliminates the need
to use gross things like   "requires %1 parameter(s)".

“select” format

Example:
"must be a %select{unary|binary|unary or binary}2 operator"

Class:
Integers

Description:
This format specifier is used to merge multiple related diagnostics
together into one common one, without requiring the difference to be
specified as an English string argument. Instead of specifying the
string, the diagnostic gets an integer argument and the format string
selects the numbered option. In this case, the “ %2” value must be an
integer in the range [0..2]. If it is 0, it prints “unary”, if it is 1
it prints “binary” if it is 2, it prints “unary or binary”. This allows
other language translations to substitute reasonable words (or entire
phrases) based on the semantics of the diagnostic instead of having to
do things textually. The selected string does undergo formatting.

“plural” format

Example:
"you have %1 %plural{1:mouse|:mice}1 connected to your computer"

Class:
Integers

Description:
This is a formatter for complex plural forms. It is designed to handle
even the requirements of languages with very complex plural forms, as
many Baltic languages have. The argument consists of a series of
expression/form pairs, separated by ”:”, where the first form whose
expression evaluates to true is the result of the modifier.

An expression can be empty, in which case it is always true. See the
example at the top. Otherwise, it is a series of one or more numeric
conditions, separated by ”,”. If any condition matches, the expression
matches. Each numeric condition can take one of three forms.

  • number: A simple decimal number matches if the argument is the same
    as the number. Example: "%plural{1:mouse|:mice}4"
  • range: A range in square brackets matches if the argument is within
    the range. Then range is inclusive on both ends.
    Example:"%plural{0:none|1:one|[2,5]:some|:many}2"
  • modulo: A modulo operator is followed by a number, and equals sign
    and either a number or a range. The tests are the same as for plain
    numbers and ranges, but the argument is taken modulo the number
    first.
    Example: "%plural{%100=0:even hundred|%100=[1,50]:lowerhalf|:everything else}1"

The parser is very unforgiving. A syntax error, even whitespace, will
abort, as will a failure to match the argument against any expression.

“ordinal” format

Example:
"ambiguity in %ordinal0 argument"

Class:
Integers

Description:
This is a formatter which represents the argument number as an ordinal:
the value   1  becomes   1st,   3  becomes   3rd, and so on.
Values less than   1  are not supported. This formatter is currently
hard-coded to use English ordinals.

“objcclass” format

Example:
"method %objcclass0 not found"

Class:
DeclarationName

Description:
This is a simple formatter that indicates the   DeclarationName
 corresponds to an Objective-C class method selector. As such, it prints
the selector with a leading “ +”.

“objcinstance” format

Example:
"method %objcinstance0 not found"

Class:
DeclarationName

Description:
This is a simple formatter that indicates the   DeclarationName
 corresponds to an Objective-C instance method selector. As such, it
prints the selector with a leading “ -”.

“q” format

Example:
"candidate found by name lookup is %q0"

Class:
NamedDecl *

Description:
This formatter indicates that the fully-qualified name of the
declaration should be printed, e.g., “ std::vector” rather than “
vector”.

“diff” format

Example:
"no known conversion %diff{from $ to $|from argument type to parameter type}1,2"

Class:
QualType

Description:
This formatter takes two   QualTypes and attempts to print a template
difference between the two. If tree printing is off, the text inside the
braces before the pipe is printed, with the formatted text replacing the
$. If tree printing is on, the text after the pipe is printed and a type
tree is printed after the diagnostic message.

It is really easy to add format specifiers to the Clang diagnostics
system, but they should be discussed before they are added. If you are
creating a lot of repetitive diagnostics and/or have an idea for a
useful formatter, please bring it up on the cfe-dev mailing list.

Producing the Diagnostic

Now that you’ve created the diagnostic in
the Diagnostic*Kinds.td file, you need to write the code that detects
the condition in question and emits the new diagnostic. Various
components of Clang (e.g., the preprocessor, Sema, etc.) provide a
helper function named “Diag”. It creates a diagnostic and accepts the
arguments, ranges, and other information that goes along with it.

For example, the binary expression error comes from code like this:

if (various things that are bad)    Diag(Loc, diag::err_typecheck_invalid_operands)      << lex->getType() << rex->getType()      << lex->getSourceRange() << rex->getSourceRange();  

This shows that use of the Diag method: it takes a location
(a SourceLocation object) and a diagnostic enum value (which matches
the name from Diagnostic*Kinds.td). If the diagnostic takes arguments,
they are specified with the << operator: the first argument
becomes %0, the second becomes %1, etc. The diagnostic interface
allows you to specify arguments of many different types,
including int and unsigned for integer
arguments, const char* and std::string for string
arguments, DeclarationName and const IdentifierInfo * for
names, QualType for types, etc. SourceRanges are also specified with
the << operator, but do not have a specific ordering requirement.

As you can see, adding and producing a diagnostic is pretty
straightforward. The hard part is deciding exactly what you need to say
to help the user, picking a suitable wording, and providing the
information needed to format it correctly. The good news is that the
call site that issues a diagnostic should be completely independent of
how the diagnostic is formatted and in what language it is rendered.

Fix-It Hints

In some cases, the front end emits diagnostics when it is clear that
some small change to the source code would fix the problem. For example,
a missing semicolon at the end of a statement or a use of deprecated
syntax that is easily rewritten into a more modern form. Clang tries
very hard to emit the diagnostic and recover gracefully in these and
other cases.

However, for these cases where the fix is obvious, the diagnostic can be
annotated with a hint (referred to as a “fix-it hint”) that describes
how to change the code referenced by the diagnostic to fix the problem.
For example, it might add the missing semicolon at the end of the
statement or rewrite the use of a deprecated construct into something
more palatable. Here is one such example from the C++ front end, where
we warn about the right-shift operator changing meaning from C++98 to
C++11:

test.cpp:3:7: warning: use of right-shift operator ('>>') in template argument                         will require parentheses in C++11  A<100 >> 2> *a;        ^    (       )

Here, the fix-it hint is suggesting that parentheses be added, and
showing exactly where those parentheses would be inserted into the
source code. The fix-it hints themselves describe what changes to make
to the source code in an abstract manner, which the text diagnostic
printer renders as a line of “insertions” below the caret line. Other
diagnostic clients
 might choose to render the code differently (e.g.,
as markup inline) or even give the user the ability to automatically fix
the problem.

Fix-it hints on errors and warnings need to obey these rules:

  • Since they are automatically applied if -Xclang -fixit is passed
    to the driver, they should only be used when it’s very likely they
    match the user’s intent.
  • Clang must recover from errors as if the fix-it had been applied.

If a fix-it can’t obey these rules, put the fix-it on a note. Fix-its on
notes are not applied automatically.

All fix-it hints are described by the FixItHint class, instances of
which should be attached to the diagnostic using the << operator in
the same way that highlighted source ranges and arguments are passed to
the diagnostic. Fix-it hints can be created with one of three
constructors:

  • FixItHint::CreateInsertion(Loc, Code)

    Specifies that the given Code (a string) should be inserted
    before the source location Loc.

  • FixItHint::CreateRemoval(Range)

    Specifies that the code in the given source Range should be
    removed.

  • FixItHint::CreateReplacement(Range, Code)

    Specifies that the code in the given source Range should be
    removed, and replaced with the given Code string.

The DiagnosticClient Interface

Once code generates a diagnostic with all of the arguments and the rest
of the relevant information, Clang needs to know what to do with it. As
previously mentioned, the diagnostic machinery goes through some
filtering to map a severity onto a diagnostic level, then (assuming the
diagnostic is not mapped to “Ignore”) it invokes an object that
implements the DiagnosticClient interface with the information.

It is possible to implement this interface in many different ways. For
example, the normal
Clang DiagnosticClient (namedTextDiagnosticPrinter) turns the
arguments into strings (according to the various formatting rules),
prints out the file/line/column information and the string, then prints
out the line of code, the source ranges, and the caret. However, this
behavior isn’t required.

Another implementation of the DiagnosticClient interface is
the TextDiagnosticBuffer class, which is used when Clang is
in -verify mode. Instead of formatting and printing out the
diagnostics, this implementation just captures and remembers the
diagnostics as they fly by. Then -verify compares the list of produced
diagnostics to the list of expected ones. If they disagree, it prints
out its own output. Full documentation for the -verify mode can be
found in the Clang API documentation for VerifyDiagnosticConsumer.

There are many other possible implementations of this interface, and
this is why we prefer diagnostics to pass down rich structured
information in arguments. For example, an HTML output might want
declaration names be linkified to where they come from in the source.
Another example is that a GUI might let you click on typedefs to expand
them. This application would want to pass significantly more information
about types through to the GUI than a simple flat string. The interface
allows this to happen.

Adding Translations to Clang

Not possible yet! Diagnostic strings should be written in UTF-8, the
client can translate to the relevant code page if needed. Each
translation completely replaces the format string for the diagnostic.

The SourceLocation and SourceManager classes

Strangely enough, the SourceLocation class represents a location
within the source code of the program. Important design points include:

  1. sizeof(SourceLocation) must be extremely small, as these are
    embedded into many AST nodes and are passed around often. Currently
    it is 32 bits.
  2. SourceLocation must be a simple value object that can be
    efficiently copied.
  3. We should be able to represent a source location for any byte of any
    input file. This includes in the middle of tokens, in whitespace, in
    trigraphs, etc.
  4. SourceLocation must encode the current #include stack that was
    active when the location was processed. For example, if the location
    corresponds to a token, it should contain the set of #includes
    active when the token was lexed. This allows us to print
    the #include stack for a diagnostic.
  5. SourceLocation must be able to describe macro expansions,
    capturing both the ultimate instantiation point and the source of
    the original character data.

In practice, the SourceLocation works together with
the SourceManager class to encode two pieces of information about a
location: its spelling location and its instantiation location. For most
tokens, these will be the same. However, for a macro expansion (or
tokens that came from a _Pragma directive) these will describe the
location of the characters corresponding to the token and the location
where the token was used (i.e., the macro instantiation point or the
location of the _Pragma itself).

The Clang front-end inherently depends on the location of a token being
tracked correctly. If it is ever incorrect, the front-end may get
confused and die. The reason for this is that the notion of the
“spelling” of a Token in Clang depends on being able to find the
original input characters for the token. This concept maps directly to
the “spelling location” for the token.

发表评论

电子邮件地址不会被公开。 必填项已用*标注