Contributing to the native compiler backend is a great way to help MethodScript achieve its goals as
a general purpose programming language. Natively compiled programs help code break out of the confines
of the JVM, in principal are faster, and can integrate with code in other languages.

Working on the native compiler backend is unfortunately more complex than contributing to the interpreter
however, and so it is recommended that you have a complete understanding of how interpreted MethodScript
works before embarking on the native framework, even if you're already familiar with LLVM or other assembly
programming. This page is intended to help jumpstart someone who is otherwise familiar with the general
concepts of contributing to MethodScript, and has some pointers to resources that can help with learning
the basics of LLVM.

Before covering technical details, it's important to understand some underlying goals of the project.

* The Java Interpreter is the reference implementation of MethodScript.
* While it's possible to diverge from \
the Java Interpreter when absolutely necessary, "the implementation is hard" is not a good enough reason \
to do so. Differences in implementation will be strongly challenged before they are allowed. If the implementation \
must vary for technical reasons, it should be made exceedingly clear in the documentation, and mechanisms must be \
added to allow user code to account for these differences at compile time.
* If a new feature is added to the native \
backend, it must first be added to the Java Intepreter, unless this is also technically impossible.
* Native code may have different implementations per platform, but the effect of each should be the same.
* The Java Interpreter does the compilation process first, so features that exist in the compiler will \
inherently take precedence.
* Functions and other features are allowed to be missing from the native compiler while parity is reached, \
but this should result in compiler errors. Once parity is reached, it will be required to implement on all \
platforms at once, but this is a future concern.
* Core functions (marked by the @core annotation in Java) are particularly sensitive to these requirements.
* Foundational concepts cannot be changed under any circumstances. For instance, variables that are untyped \
(implied or explicit auto) must continue to be supported.
* The C Runtime Library is the primary runtime library, though others can be added as needed.
* Windows, Ubuntu, and MacOS are the supported OSes. More may be added later if sponsorship is obtained. In general, \
Debian is expected to work, but it is not a supported OS, and no bugs filed against Debian will currently be accepted.

Native MethodScript uses LLVM (which stands for Low Level Virtual Machine) to create the native binaries.
Thus, it's important to eventually understand various concepts within LLVM in order to successfully understand
the native compiler. In theory, LLVM should be able to be swapped out for any given assembly language, by simply
creating a new platform resolver, however, it's unlikely that for native binaries LLVM would ever be swapped out.
Nonetheless, the fact that LLVM is used should never be exposed to the end user, as this is not a guarantee from
a user code or tooling perspective, hence why, for instance, the command to compile to native is "asm", not "llvm".

== Architecture ==

In general, the user code goes through a series of passes, moving closer to outputting native code.
Each of these passes contains a lot of steps, which are detailed below.

=== Java Compilation ===
First, the existing Java frameworks compile the user code into an AST. This is the same compilation that happens when
using MethodScript in Java. The standard optimizations are run on this
code, thus the optimizations made by the Java Interpreter (JI) take precedence. In the future, mechanisms will be added
to allow the LLVM platform functions to preempt this optimization, but in general this should be avoided, to prevent
divergence of functionality. Most of the JI optimizations are fairly generic anyways, and deal more with eliminating
dead code, and running various logic that is platform agnostic anyways. Once the AST is compiled, it is traversed,
at which point the LLVM functions are run.

=== LLVM conversion ===
While the JI is focused on providing the runtime actions for the code, the LLVM code is only run at compile time, and
instead generates the LLVM IR (Intermediate Representation), which can be thought of as a slightly higher level
assembly language. Because of this, some concepts, which are easier to grasp in a runtime environment (types and
values are known for sure at runtime, and may not be known at compile time), are more difficult to comprehend.
Additionally, programming happens in a linear/flattened way. Jump statements are used instead of traditional high
level branches, and code to evaluate parameters of a function is written before the evaluation of the function call
itself. This paradigm shift takes some getting used to for high level developers, but it can help to view the output
of already written programs, as well as output from simple C programs compiled to LLVM with clang.

Learning LLVM or assembly concepts is beyond the scope of this article, as are things like using the C runtime library,
though resources for doing so are provided at the bottom.

The functions are responsible for outputting the necessary IR. Input parameters are passed in as meta structures, TODO

Once the IR is generated, it is written out to .ll files. This file contains the human readable LLVM IR, including
comments indicating where in the original MethodScript the code came from.

<%NOTE|In general, it is possible to output either IR or bitcode, which is a 1:1 mapping of IR. When using the LLVM library
bindings, it is possible to output the bitcode directly, however a conscious decision was made to output IR instead of
bitcode. When using clang, you may see references to .bc files, which contain the bitcode. This can be converted to
IR using the llvm-dis tool, which is included in the toolchain if installed through MethodScript. For a discussion on
why IR was chosen instead of bitcode, see the section below.%>

=== IR to O(bj) ===

The next step is to compile the IR into object files. Object files are basically the native assembly, which has not
been linked into a final binary yet. Starting from this stage, the process is handled entirely with the LLVM toolset.
"llc" is the program responsible for compiling the IR into the object files.

=== O(bj) to Executable ===
TODO: Linking is more complex than this, go into more detail. "lld" converts the object file(s) into the executable.

== Toolchain ==
"llc" and "lld" are the core tools that are used when compiling code, however, additional tools are installed by default
when using the MethodScript toolchain installer. llvm-as, llvm-dis, and opt are included. These tools are useful for
developers, and may be used as part of the core process if certain flags (such as debug flags) are set by the user, in
the future.

== Testing and Debugging ==
When adding additional functions, it's important to make sure that everything works as intended. Unfortunately, since
some functions have very platform specific code, it makes it more difficult to test. Luckily, it's easy to get Virtual
Machines for both Ubuntu and Windows. (Newer Mac OS is difficult, and thus support for Macs is currently limited.)
See the section below for information and steps on setting up these VMs.

The IR is written out to .ll files, and this can be read to see the code in full context.

To reverse engineer executable binaries, it might be helpful to use [https://ghidra-sre.org/ Ghidra], which is the premier open
source reverse engineering toolchain, though reading the .ll files is probably easier, and should be attempted first.

One technique that is very useful is to reverse engineer equivalent C programs that have been compiled with clang. To
do this, write an equivalent program in C, then compile it with <code>clang -Weverything -S -emit-llvm code.c</code>.
This will emit the .ll file instead of compiling a binary. Clang is included with the LLVM installation.
This is also useful for helping to map common high level concepts into a lower level format, so that you can
interactively compare the two. Note that -Weverything is specified, this is generally important to have, so that all
warnings are emitted. You may choose to ignore certain errors if they are not relevant (many aren't, and some simply
aren't useful), but you should make sure to resolve all relevant and useful warnings before copying code, to make it
less likely that you've written code that isn't a good example.

Another technique is hand coding IR, then compiling that. This can result in faster development cycles as you figure
out various concepts. To do this, you can hand write an IR file (you can use an empty MethodScript program to set up
the boilerplate), then simply run <code>mscript -- asm handwritten.ll</code>. Because this is a common use case, the
compiler can instead of taking MethodScript, directly accept a LLVM IR file. If this is done, the "output directory"
becomes the same as the .ll file, and the object file and executable will be placed beside it.

== IR vs Bitcode ==

To compile LLVM, one can choose to either write human readable LLVM IR, or LLVM bitcode, which is the binary format of
the IR. Certainly, the LLVM compiler has to do more work to compile the IR compared to the bitcode, so generating the
bitcode directly should in theory make compilation faster. However, a number of pros come from using the IR instead.
First of all, the libraries that exist for generating bitcode are written in C/C++. Wrappers for Java do exist, but
then this has to be installed on top of the normal toolchain. Development of the IR should be easier for new
contributors. It also gives us more flexibility in generating the IR, which makes it easier to debug.

It may be that eventually, we find a good reason to switch to bitcode, but for now, the pros of using the IR seem to
outweigh the cons.

== Development Environment ==

Due to the inherently complex nature of the task, setting up a dev environment is also somewhat complicated. To a
large degree, most features can be done on whatever computer you're currently using, but it's possible that you
may need additional OSes to test on. In all cases, OSes are available for test purposes as virtual machines, using
any virtualization software you want, but in general, VirtualBox is the preferred solution.

In all cases, setting up the VMs is free, though for non-free OSes, there are limitations that make them unusable
for daily use, however, for our purposes, they are good enough. For Windows installations in particular, the VMs can
be rather annoying, so it is recommended to use an actual copy of Windows, rather than the development VMs.

If you aren't willing to install other development environments when adding new features, don't assume that new features
that you're adding will work on other OSes. If you have reason to believe they will (for instance, only uses LLVM
specific constructs and makes no system calls) then it's fine to make this assumption, but it's better to
add a compile error when used in an untested system than it is to assume your feature will work. PRs that do OS specific
tasks will not be accepted without proof of testing on all OSes that are supported in the code.

=== Installation ===

First, download the latest version of VirtualBox. https://www.virtualbox.org/ We will then download and configure
each guest OS one by one. On Windows hosts, for installing the MacOS guest, you additionally need to download Cygwin,
from https://cygwin.com/install.html, see the MacOS instructions for more details and installation options.

Once the guest OS is installed, it is useful to set up a shared folder, from where you install MethodScript, allowing
you to continue development on your standard computer, yet also not have to constantly copy files over. For Windows and
Linux guests, this is straightforward. Add the target/ folder of the build output as a shared folder in VirtualBox.
Open the Settings for the VM, go to Shared Folders, select the target/ folder in the host, and set it to auto mount,
then start the VM. From inside the VM, this should show up as a shared network folder. This requires the VirtualBox
guest additions to be installed in the guest.

For MacOS Big Sur, Guest Additions doesn't currently work. Once it works, the above instructions should be relevant, but
in the meantime, a decent enough workaround is discussed below in the MacOS instructions.

==== Windows Guest ====
Download a Windows 10 or higher VM from here: https://developer.microsoft.com/en-us/microsoft-edge/tools/vms/
Select VirtualBox for the VM platform. Unzip the file, and place the .vdi file in a reasonable location. Open
VirtualBox, and select File->Import Appliance. Select the .vdi file, and follow the prompts.

IMPORTANT! *BEFORE* booting the VM for the first time, it's important to take a snapshot that can be reset easily once
the Windows trial expires. Select the additional options button in the right side of the VM entry, and select Snapshots.
Click "Take" in the top, and name the snapshot something reasonable, like "Pristine". You can then boot the VM like
normal, and once the trial expires, or for whatever reason if you want to return to a brand new OS state, restore the
snapshot.

==== Ubuntu Guest ====

Download Ubuntu 20.04 LTS from here https://ubuntu.com/download/desktop. This is the ISO file, which is like a disk
image, and will need to be manually installed and imported. Go to Machine -> New, and follow the prompts. Select the
disk drive, and insert the ISO file into the drive, and follow the normal Ubuntu installation procedures. Like the
Windows installation, you may wish to create a pristine snapshot, but Ubuntu is free, so will not require resets due to
expired trials.

==== Mac OS Guest ====

These instructions use https://github.com/myspaghetti/macos-virtualbox to download a MacOS guest. For Windows hosts,
this requires Cygwin from https://cygwin.com/install.html. When installing Cygwin, install the latest versions
of the following packages which are not included in the default installation, but are required for this script:
* wget
* unzip
* xxd

Download the script, and run it:
<%SYNTAX|bash|
curl https://raw.githubusercontent.com/myspaghetti/macos-virtualbox/master/macos-guest-virtualbox.sh > macos-guest-virtualbox.sh
sh macos-guest-virtualbox.sh
%>

This will run for a while, and requires a bit of interaction, but the instructions are obvious and easy to follow.

Once Catalina is installed, you can use Software Update to install Big Sur.

You will also need to allow unsigned code to run. From a terminal, run `sudo spctl --master-disable` to allow any
software to run.

Since Guest Additions don't currently work, in order to use shared file systems with the host, this requires a different
approach. Follow the directions below on how to set up the host only adapter in the "Debugging the compiler on a VM"
section first. Once the host only adapter is configured, use your host OS's standard mechanism for creating a shared
folder. From the Mac guest, open Finder, Go, Connect to Server, and configure the shared folder through the IP address.

Once successfully mounting the shared folder, it should show up in /Volumes/target or whatever you named the shared
folder, and can be used as normal from there. On reboot, you may need to reconnect to the share.

=== Debugging the compiler on a VM ===

In general, the mscript wrapper supports debugging the JVM out of the box. Simply set the environment variable
"DEBUG_MSCRIPT" to 1, (bash: <code>export DEBUG_MSCRIPT=1</code>, cmd.exe: <code>set DEBUG_MSCRIPT=1</code>, powershell:
<code>$env:DEBUG_MSCRIPT = 1</code>) and run the mscript command. This will start the java process with the debugger
enabled, listening on all addresses, on port 9001. However, to access the network on the guest, you need to first
configure a new network adapter on both the host and the guest.

Shut down the guest, and open the VM's settings in VirtualBox, and go to the Network tab. Go to adapter 2,
and enable it, and change the attached to: to Host-only Adapter. This also requires properly configuring the host
system. Go to File -> Host Network Manager. Make sure that the Adapter is configured automatically, and the DHCP Server
is enabled. (For brand new installations of VirtualBox, this is the default.) In your host, open the network connections
and find the network adapter labeled something like "VirtualBox Host-Only Ethernet Adapter". Run ipconfig/ifconfig from
the host, and note the IP address of this adapter. This is the IP address that you can access the host from the guest.
Vice versa, run ipconfig/ifconfig from the guest, and note the IP address of the second ethernet adapter. This is the
IP you can connect to the guest from the host. In your host system's IDE, start a remote debugging session, and enter
the IP from the guest, and port 9001 here to connect.

== Key Concepts ==

It's important to understand a few key concepts about native code, which may not be immediately obvious to a Java
developer.

Intermediate files have different typical extensions on different platforms:
<%PRE|
UNIX    Mac OS X  Windows
*.o     *.obj     *.obj    object file
*.a     *.a       *.lib    static library
*.so    *.dylib   *.dll    shared object (ELF targets)
*       *         *.exe    binary (no suffix on UNIX)
%>

Largely, each type has similar functionality though. TODO: Discuss static vs dynamic linking, and what object files
contain.

== Research Scenarios ==

While developing, you may need to research various scenarios.

=== Find symbols defined in a binary ===

For Windows, you can use dumpbin.exe to find details about an object file, such as a dll or lib file. This must
be run from a visual studio developer prompt.

* dumpbin.exe /headers "C:\Program Files (x86)\Windows Kits\10\Lib\10.0.19041.0\ucrt\x64\ucrt.lib"

For linux/mac, you can use nm or objdump

* nm /usr/lib/x86_64-linux-gnu/libc.a
* objdump -x /usr/lib/x86_64-linux-gnu/libc.a

=== Adjusting Linker Options ===

At the early stages of development, it's possible that the linker options are very wrong, and will need to be adjusted.
Certainly, as new native code is integrated with, we will need to add more files to the linker. In order to get a good
template for this, it is easiest to run clang against an equivalent C program, and view the linker options it uses.
Use <code>clang -v test.c</code> to see the verbose output for clang, including the linker options, and use that as
your template.

Note that the linker option format varies fairly widely from OS to OS, so any change to these options will almost
certainly require testing on all supported OSes, unfortunately.

== Common Errors ==

* error: instruction expected to be numbered %X

This can happen when an local variable was expected, but wasn't provided, so it was automatically inserted by the
LLVM compiler. This usually means the previous instruction was wrong, rather than the instruction it's pointing to.
There can be different reasons for this, for instance if a "call i32 @function" return value is being ignored. Since
call returns a value here, even if it isn't used, it should still be assigned to a value, to make it explicit, and
to increment the counter. In fact, this probably needs to be used anyways, many C functions return -1 or something
to indicate failure, which actually should be converted into a MethodScript exception, and thrown, rather than being
silently ignored.

Another one is implied blocks. This mostly occurs with function definitions, which is why main looks like:

<%PRE|
define dso_local i32 @main(i32 %0, i8** %1) {
	%3 = ...
}
%>

Because there's an implied block named %2.

<%PRE|
define dso_local i32 @main(i32 %0, i8** %1) {
%2:
	%3 = ...
}
%>
For functions, this is accounted for automatically, but other block
creation can cause this to happen in unexpected places, for instance when using a block terminating statement, such
as ret, br, unreachable, etc. Not accounting for these implied variables is often a sign of a bug, so it's good that
this error is caught, though unfortunate that the error message isn't terribly helpful.

* error: '%X' defined with type 'Y*' but expected 'Y'
* When %X was defined with alloca

When doing %x = alloca i64, then %x is actually of type i64*. Say you have the following code:

<%PRE|
%1 = alloca i64 ; %1 is i64*
store i64 0, i64* %1
%>

Now %1 is a pointer to memory, which contains the value 0. To get 0 back, we have to load the value.

<%PRE|
%2 = load i64, i64* %1
%>

== Resources ==

* [https://llvm.org/docs/LangRef.html LLVM IR Documentation] - This is the primary source of "official" documentation \
for the various IR commands. Note that this is not necessarily the authoritative source though, reverse engineering \
clang output is.
* [https://borretti.me/article/compiling-llvm-ir-binary Compiling hand written LLVM, for test purposes.]
* [https://mapping-high-level-constructs-to-llvm-ir.readthedocs.io/en/latest/README.html Guide for mapping high level language concepts to LLVM.]