Power PC Assembly Tutorial
In
Prerequisites
You must know either C or C++ to be able to follow this tutorial at all. If you know Java or C#, that’s not good enough, you need to know C or C++. Along with C/C++ you need to have a decent understanding of memory. By this I mean that you need to understand how the stack and heap work, along with a very good understanding of pointers. If you don’t understand the output of the following code, then you really need to learn more. Open Me
Some good places to learn are as follows, in order of difficulty:
- My C Tutorials 99. If you have no experience in C or C++ then this is where you need to start. If you already know C#, then you can skip to the first video about pointers.
- Introduction to the stack and heap. 104 Good place to start if you already know some C/C++ but don’t know that much about memory.
- Programming Paradigms from Stanford. 38 Watch the first 15 or so lectures, once they get into threading, you can stop.
- More in-depth on the stack and heap. 43
If after reading/watching all the above stuff you still don’t understand, then you’re on your own.
Introduction
PowerPC 32 is an assembly language. Basically what that means is that each instruction that you write, is actually an operation that the processor is performing, unlike a high level language like C or C++ where it has to be compiled to assembly or machine code. So with PowerPC you, the programmer, are directly communicating with the processor. Because PowerPC is an assembly language, you cannot run it on just any computer, you have to run it on a machine that uses a PowerPC processor, *cough the Xbox360 *cough cough. The only way that I know how to run PPC is using inline assembly in C++, and then running it on a DevKit. However, if you don’t have a DevKit and you still want to learn PPC, I encourage you to. You’ll still be able to read Xbox360 executables (.xex files), in IDA 21. Chapter OneAnyways, this instruction takes 3 operands. The first is the destination register, so when the multiplying is said and done, the product will be stored here. The second operand is a register that you want to multiply with something, and the third register is that something, it needs to be an immediate value. So let’s take a look at an example.
li r3, 5
mulli r4, r3, 7
The above code is loading r3 with the value 5, then, using the mulli instruction, it multiplies the value in r3 by 7. Finally it stores the product of the operation in r4. Now let’s take a look at multiplying two registers together. The instruction to do that is mullw, which stands for multiply low word. If you’re wondering what a word is, it’s a 32 bit signed integer. This instruction, like mulli, takes 3 operands. The first is the destination register, so the register where the product will be stored. The last two operands are just the registers that you want to multiply together. Let’s look at an example.
li r3, 2
li r4, 9
mullw r5, r4, r3
This code snippet will load r3 with 2, load r4 with 9, and then
multiply the two registers together. So after this code runs, r5 will have a value of 18.
Now we are going to look at dividing. There is only one instruction that we’re going to be looking at for dividing. Sadly, there isn’t a divide instruction that takes an immediate value as an operand, so that’s why we’re only looking at the one instruction. The instruction we’re going to look at is divw, which stands for divide word. The divw instruction takes 3 operands. The first, is the destination register, the location that the quotient will be stored. The second operand is the dividend, and the third operand is the divisor. So let’s try this stuff out.
li r3, 10
li r4, 5
divw r5, r3, r4
So here, we are just loading r3 with 10, and r4 with 5. Then we divide r3 by r4 and store the quotient in r5 . So after this code does it’s thing, r5 will be 2.
So this is the end of Chapter One. Click here 80 for the Chapter One quiz.
[/details] Chapter Two
Summary
In this chapter we are going to be looking at comparing registers with other registers/immediate values. Also we are going to be looking at branching, referred to as jumping in some other assembly languages, and labels. Once you learn all that stuff you will be able to make decisions in your assembly code.
Condition Registers
The condition registers are used to store information about a comparison, such as whether or not two things are equal. There are eight condition registers, cr0-cr7. We are not going to be using cr1 because that is for floating point registers, which is a whole different discussion. So for right now, only use cr0 and cr2-cr7. Each condition register is 4 bits in length, and each bit corresponds to a certain condition. The format of each condition register is as follows:
All the condition registers are right next to one another to form a 32 bit register called CR, which stands for condition register. All eight of the four bit condition registers make up this register. So it looks something like this:
Compare Instructions
Now that we know where comparison data is stored, we now need to take a look at how to compare registers with other registers/immediate values. The first instruction that we are going to be looking at is cmpw, and this stands for compare word. And remember from the first chapter that a word is a signed, 32 bit integer. This instruction takes two operands, two registers. All this instruction will do, is compare the two registers, and store the information about the comparison in cr0. If the first operand’s value is less than the second operand’s value, then the LT, less than, bit will be set in cr0. If the first operand’s value is greater than the second operand’s value, then the GT, greater than, bit will be set in cr0. Otherwise, if both operands’ values are the same, then the EQ, equal, bit will be set in cr0. So let’s take a look at a couple examples.
li r3, 4
li r4, 5
cmpw r3, r4
Condition Register 0 will look like…
The code above will simply load r3 with 4, load r4 with 5, and then compare r3 and r4. Since r3’s value is less than r4’s, the LT, less than, bit is set in cr0. Now let’s look at an example where the GT bit is set.
li r3, 11
li r4, 7
cmpw r3, r4
Condition Register 0 will look like…
This code will load r3 with 11, and load r4 with 7. Then it’s simply going to compare the values in the registers. So after this code runs, the GT, greater than, bit will be set in cr0 because the value in r3 is greater than the value in r4. Finally, we’re going to take a look at an example where both registers have the same value.
li r3, 5
li r4, 5
cmpw r3, r4
Condition Register 0 will look like…
Here, we’re loading both r3 and r4 with the value 5. Then we compare those registers using the cmpw instruction. Since both of the registers have the same value, the EQ, equal, bit is set in cr0.
Up until now, all the information about comparisons has been stored in cr0. With the compare instructions, you can specify what condition register you want the information to be stored in. To do so, you simply make the first operand of the cmpw instruction the condition register you want the information to be put in. Then, you make the second and third operands the registers that you want to compare. So let’s look at an example.
li r3, 6
li r4, 9
cmpw cr6, r3, r4
Now, instead of the LT bit being set in cr0, it will be set in cr6 because we explicitly told it to. We made the first operand of the cmpw instruction the condition register that we wanted the information in.
Now let’s take a look at comparing registers with immediate values. It turns out that it really isn’t that much different from comparing two registers. The instruction to compare a register with an immediate value is cmpwi, which stands for compare word immediate. Just like the cmpw instruction we have the option of specifying the conditional register to store the information about the comparison in. So for this instruction you can provide either two or three operands. If you provide two, the first needs to be a register, and the second needs to be an immediate value, so a signed 16 bit integer. If you provide three operands, then the first needs to be the condition register to store the information in, the second needs to be a register, and the third needs to be an immediate value. First, let’s take a look at the two operand version.
li r3, 15
cmpwi r3, 10
Condition Register 0 will look like…
In this code, we are simply loading r3 with 15, and then we compare the value in r3 with 10. Since 15 is greater than 10, the GT bit is set in cr0. Now let’s look at the three operand version.
li r3, 6
cmpwi cr6, r3, 10
Condition Register 6 will look like…
Here, we are loading r3 with 6, and then we compare the value in r3 with the immediate value 10. Since 6 is less than 10, we set the LT bit in cr6 because we specified that as our first operand.
Labels
Labels in PowerPC serve the same purpose as labels in C++ and C#. Basically a label is something you can use to mark off a place in your code that you want to branch, or jump, to. I think it’d be easier if I just showed you an example, rather than trying to come up with a definition for a label.
addNums:
add r5, r3, r4
To create a label, all you need to do, is write a identifier, of your choosing, followed by a colon. This means that you can call your label anything, as long as it doesn’t start with a number, doesn’t contain spaces, dashes, or symbols. So my_Label1 is a valid label name whereas my Label%$1 isn’t a valid name.So as you can see in the code above, the label, is literally labeling some code. You can have 1, or a million lines of code after your label, it doesn’t really matter. Also, there isn’t really a limit to how many labels you can have. Why are labels useful you may ask, well, look at the next section.
Branching
Branching, referred to as jumping in some other assembly languages, will simply allow you to “branch” to a location in your code. Branching will allow you to go to labels. In order to branch we must use a branch instruction. The most basic branch instruction there is, is b, which just stands for branch. Most branch instructions take one operand, which is the location to branch to, in most cases it’s a label. So let’s take a look at an example of branching.
li r3, 6
li r4, 10
b addStuff
li r4, 2
addStuff:
add r5, r3, r4
The above code is going to load r3 with 6, and then load r4 with 10. Right after that, it’s going to jump to the addStuff label, skipping over the li r4, 2. This means that the value in r4 when the add instruction is being executed will be 10, since we branched directly to the addStuff label, skipping any and all code in between the branch instruction and the label. After the add instruction is done, the next code that will be executed will be whatever, if any, code that follows the add instruction. It never goes back to where it previously left off. So the
li r4, 2 line of code will never be executed.
One other very important form of the branch instruction is blr, which stands for branch to link register. In later chapters we’ll talk about what the link register actually is, but for right now, this instruction is equivalent to the return keyword in C++ and C#. However, we cannot return values with this instruction, all that this instruction will do, is halt execution, and go back to wherever this function was called from. Let’s look at a quick example.
blr
li r3, 4
li r4, 6
This code is pretty much worthless, because all it’s going to do is return. Therefore, r3 and r4s’ values will never be changed, they will never be loaded with 4 or 6, respectively.
Now let’s take a look at something more interesting. The branch instructions will actually allow us to branch depending on the bits set in a condition register. So we can branch if the EQ bit is set, or if it’s not set. We could branch if the GT bit is set, or if it’s not set, and of course you could also branch if the LT bit is set, or if it’s not set. There are corresponding branch instructions for each bit in the condition register. They are as follows.
By default, the condition register that it will look at is cr0, but we’ll look at changing that later. So for now, let’s just look at a few examples.
li r3, 5
li r4, 10
cmpw r3, r4
blt addR3andR4
blr
addR3andR4:
add r3, r3, r4
blr
The first thing this code does, is load r3 and r4 with 5 and 10, respectively. Then, using the cmpw instruction, we compare the values in r3 and r4. The LT bit in cr0 will be set since r3’s value is less than r4’s. The next instruction says to branch to the addR3andR4 label if the LT bit is set in cr0, which it is. So then we branch to that label, next it adds the contents of r3 and r4 together, finally storing the sum in r3. Lastly, it returns from the function call. Let’s take a look at another example.
addi r3, r3, 10
cmpwi r3, 100
bgt sub100
blr
sub100:
subi r3, r3, 100
The first thing that this code does, is add 10 to whatever value is in r3. Then we compare the sum of that operation with 100. If the value in r3 is greater than 100, then it will branch to the sub100 label, and subtract 100 from r3. If r3 was less than 100, then the function call returns. Let’s say that we didn’t have the blr instruction there. So it’d look like this:
addi r3, r3, 10
cmpwi r3, 100
bgt sub100
sub100:
subi r3, r3, 100
So after we add 10 to r3, and compare the sum with 100, it will branch to the sub100 label if the value in r3 is greater than 100. Let’s say it wasn’t, we’ll say that r3 was 62, that means that it wouldn’t branch to the sub100 label. However, since we removed the blr instruction, it’ll just fall through. This means that no matter what, it’ll subtract 100 from r3. So if it doesn’t branch to the sub100 label, execution will fall through. This means that labels have no effect on the flow of execution. The only way to skip over instructions is through a branch. So let’s look at another example.
label1:
li r3, 4
label2:
li r4, 5
label3:
add r5, r3, r4
The above code is the exact same as the code below, it does the same thing.
li r3, 4
li r4, 5
add r5, r3, r4
Since the processor will just ignore labels, the labels in the first example will just be ignored, so it will do the same thing as the code in the second example. However, you should only use labels if you plan on branching to them.
Now let’s tell the branch instructions what condition register to check. All that we need to do is make the first operand the condition register to look at, and then the second operand is the location to branch to, so a label in our case. Let’s take a look at an example of this.
li r3, 11
cmpwi cr6, r3, 11
beq cr6, someLabel
blr
someLabel:
li r3, 0
blr
Here we are loading r3 with the value 11. Then we compare the value in r3 to 11, and store the results of the comparison in cr6. When we go and do a conditional branch, we need to make sure we are checking cr6. To do that, we just made the first operand of the beq instruction cr6, and the second operand the label to jump to.
Alright, so this is the end of chapter two, you can take the quiz here. 29 Chapter Three
Summary
In this chapter we are going to be looking at creating our own functions, and accessing the parameters that are passed to theses functions that we create. Also, we’ll be looking at pointers and structs.
Creating our own Functions
We are going to be using inline assembly for everything, so to create a function, just do what you would do if you were creating one in C or C++. So just do something like what’s shown below. If you don’t have a DevKit, don’t worry. You’ll still be able to easily follow along with this tutorial, you just won’t be able to test anything out .
Okay, so now you’re probably wondering how we can write PowerPC in Visual Studio, well it’s really easy, you can just do inline assembly. So just do this:
So from now on, just write all your code inside of that _asm block. Now that you know how to write PowerPC in Visual Studio, let’s take a look at making a function that does something. The most important thing to be able to do in a function is to be able to use the parameters. The first eight parameters of a function are passed through r3-r10. If you had eight integer parameters, the first would be in r3, the second in r4 and so on, all the way up to r10. We’ll talk more about this later, but for right now let’s get back to that function I had you create, sillyFunction. That function takes one integer parameter. This means that the value passed in as a the parameter is stored in r3. So if I did something like this:
Then, the value in r4 after this code will have ran, would be 15. This is because when I called sillyFunction, I passed 5 in is as the first parameter, loading r3 with 5. Then, I added 10 to the value in r3, which gives us 15.
Now let’s talk about the return value. The return value of a function is stored in r3. So let’s say that we wanted to return the value that we calculated in the previous example. Well, it’s pretty simple. We could do one of two things. We could move the value in r4 to r3, or we could just initially store the sum of r3 and 10 in r3. I’m going to go with the later of the two, because it requires less assembly instructions. So if we were to do this:
The value in someNumber after this code ran, will be 15. Now let’s create a function that is slightly more complex. Let’s make a function that will return the largest of two numbers. It’d look something like this:
The first line of the function compares the second parameter to the first parameter. If the second parameter is greater than the first, then we need to move r4, the second parameter, to r3, the return value. If the second parameter is less than or equal to the first parameter, then we need to return the first parameter. However, we don’t need to do anything, since r3 is already holding the value of the first parameter, so we can just return. In the example above, I pass through 7 and 11. Since 11 is greater than 7, someNumber will holding 11, after the code has run.
Now let’s take a look at a function that doesn’t take an int as a parameter. Let’s look at one that takes a character as a parameter. This function will return true if the character is a number character, so ‘0’, ‘1’, ‘2’ etc., and it’ll return false if it is anything else, for instance, ‘a’, ‘&’, ‘P’ etc. This is the function that I came up with:
Chars, just like ints are stored in registers when passed as parameters. So that means that in the above function, the character, passed as a parameter, would be stored in r3. The first thing that we do is compare the value of the character with 48, which is the value of the ‘0’ character. If the character’s value is less than 48, then we know that it can’t be a number character, so we return false. We load r3, the return value, with 0, which is the value of a bool that is false. If however, the character’s value is greater than 48, then we compare the value in r3 with 57, which the value of the ‘9’ character. If the character’s value is greater than 57, then we know that it cannot be a number character so we return false. If the character’s value is between 48 and 57, inclusive, then we return false. So if I were to call the function, like this:
Then the value of the bool b, would be false, since ‘a’ isn’t a number character.
Now let’s take a look at pointers. So let’s say that we made a function that took a pointer as a parameter. The pointer itself will be stored in r3, so we somehow need to go to the address that is in r3 and read the 4 bytes at that location, assuming the pointer is an integer pointer. The instruction to read memory at a certain address is lwz which stands for load word and zero. This instruction takes 2 operands, the first is the destination register, so the register that the value will be read into, and the second operand is the address of the word that you want to load the first operand with. So let’s assume that r3 is an integer pointer, in the following example.
lwz r4, 0(r3)
So that’s how we’d load r4 with the word, aka the 4 byte integer, at the address in r3. The parenthesis around r3 tell the processor that r3 is a pointer and that we need to go to that address. The number out in front of the parenthesis is the byte offset. So the address of the word we are loading r4 with is r3 + the byte offset. Since we have the byte offset as 0, we will be loading r4 with the word is at the address in r3. Maybe it would help a little bit more if you saw a picture.543×564
Now let’s look at a short pointer. So we’ll say that r3 is a short pointer. Again, the pointer itself will stored in r3. So if we want to load the short into r4, we’d want to use the lhz instruction, which stands for load half word and zero. A word in PowerPC is 32 bits, so that means a half word would be 16 bits, or a short. So the lhz instruction is very similar to the lwz instruction, the only difference is that the lhz will only read 2 bytes, whereas the lwz will read 4 bytes. Let’s get to an example.
As you can see in the above example, the doStuff function takes a short pointer as a parameter. This means that when the function is called, r3 will be loaded with the address of a short. In order to access the short that is at that address we use the lhz instruction. Remember from the lwz examples that we need to put the parenthesis around r3 to tell the processor that it’s a pointer. So that line of PPC, will go the address in r3, and read 2 bytes. It’ll then store those two bytes in r4.
Finally, we’ll look at an example of a character pointer. The instruction to read a byte a certain address is lbz which stands for load byte and zero, this is exactly the same as the lwz and lhz instructions. The only difference is that the lbz will read a single byte. Now we’ll look at this bad boy in action.
I’m not really sure if this needs much of explanation since it’s so similar to the other two instructions, but here we go anyways. Since this function takes a character pointer as it’s only parameter, when the function is called, the pointer itself, so a character’s address, will be put into r3. If we want to get the byte at that address we’d just use the lbz instruction. Like the other two loading instructions, we need to put parenthesis around r3 to tell the processor that it’s a pointer. We also need the 0 out in front to say how many bytes from that address we need to go, since it’s a zero, we read the byte at the address stored in r3.
The next thing that we are going to be looking at is writing data to a certain location in memory. So let’s say that we have a function that takes an integer pointer, and we want to set the integer at the address in the pointer equal to 0. To do such a thing, we’d want to use the stw instruction, which stands for store word. The first operand is the value that you want to store at a certain address. This operand must be a register, it cannot be an immediate value. The second operand is the address at which you want to store the value in the register. You have to set up this operand the same way that you have to in the load instructions. So we need parenthesis around the register holding the address, and in front of the parenthesis we need an immediate value, which serves as the byte offset. To set the integer at the address in r3 equal to 0, we’d need to do the following.
Here, we load r4 with the value 0, which is what we want to set our integer equal to. The next thing that we do, is use the stw instruction to set the integer at the address in r3 equal to zero. If I were to call this function, like this:
After the above code runs, i will be holding the value 0, because the doStuff function changed it’s value.
There are two other storing instructions that we’ll be looking at in this chapter, and they are sth, which stands for store half word, and the other one we’ll look at is stb which is store byte. I’m not going to be making an example for sth just because it’s so similar to the other two. So, let’s look at using the stb instruction.
As you can see, this instruction is very similar to the stw instruction. The above code will overwrite the byte at the address in r3 with 0, pretty simple I hope.
The last thing that we’re going to be looking at in this chapter, is passing the address of a struct as a parameter. So let’s say that we have a fraction struct, with two integer fields, numerator and denominator. Then we have a function that takes the address of a fraction as a parameter. We’ll be looking at accessing the different fields inside of that struct. This is what the prototype and struct look like.
So now, inside of our do stuff function let’s change the fraction’s values around, to find it’s reciprocal. First, we need to solve the simpler problem of loading it’s fields into registers. First let’s load the fraction’s numerator value into r4. This isn’t too hard because the address of the fraction is the address of the first field in the struct, and the first field in the struct is the numerator. So to load the numerator into r4, we could just do…
Now we need to think about getting to the second field, the denominator. Well we know that r3 holds the address of the first field, and we know that the second field comes right after the first in memory. We also know that the size of an integer is 4 bytes. So in memory, the struct looks like this:
This means, that the address of the denominator is the address of the numerator plus 4, this size of an int. So to read the denominator into r5, we could do something like this:
Here we are reading the integer into r5 at the address in r3 plus 4, which is the size of an integer. So now, to finish our function, to get the reciprocal we need to write them back to memory in opposite places, so we’d do this…
Pretty simple really. We just write the denominator value to the address of the numerator, and write the numerator to the address of the denominator.
Now let’s take a look at something that doesn’t work out so nicely. Let’s say we had a struct like this:
This isn’t work out so cleanly because pointers have to be at an address that is divisible by 4, same with integers. So all primitive types in C and C++ need to be at an address that is divisible by it’s size in bytes. So a bool and a character can be at any address since they are only one byte. Shorts must be at an even address since they are 2 bytes, pointers, integers, and longs (on most computers) need to be at an address divisible by 4 since all those types are 4 bytes in size. Lastly, long longs need to be at an address divisible by 8 since they are eight bytes long. So back to our struct. Let’s say that the address of the first field is 0x7004fd20. The first field is a bool, so it only takes up one byte. The next byte in memory that can be written over is 0x7004fd21, and we need to put a char pointer there. Well we can’t because the address 0x7004fd21 isn’t divisible by 4, the size of a character pointer. So what it does, is skip over 3 bytes, so it goes to the next address that is divisible by 4,which would be 0x7004fd24. Then when it needs to put down the next character pointer, the last name pointer, it has no problem because 0x7004fd28 is divisible by 4. Following the last name char* is a single character, which screws things up. It puts that character down at the address 0x7004fd2C, making the next available address 0x7004fd2D, which isn’t divisible by 4. We need to put down an integer, so we need something divisible by 4. Just like before, it’ll skip over 3 bytes so that it can put down the integer at an address divisible by 4. So if we were to create an instance of our struct, it would look like this:
Now let’s just go ahead and write a function that takes a person pointer. This function will return a bool indication whether or not the person can legally buy/consume alcohol in the US. So in other words, if the person’s age is less than 21, then this function will return false, otherwise it’ll return true. So we know from looking at the diagram above that the person’s age is at the address of the struct plus 16 bytes. Creating this function is really simple now that we have this diagram. So the function would look something like this:
This function is really easy to write now that you know how to get to the age in memory. Once we load the age from memory, we compare it to 21. Then we branch to the returnTrue label if the value in r4, the person’s age, is greater than or equal to 21. Otherwise we load r3 with false, and return.
So this is the end of chapter three, you can take the quiz here. 6 Chapter Four
Summary
In this chapter we are going to be looking at looping. I probably should have integrated this into a previous chapter, but whatever. The point is you should learn about looping before we move on to anything too difficult. Most of the stuff associated with looping you have already learned, such as branching, and labels, but there are a few more things that we can add to it.
The Count Register
The count register is 32 bit register that is most commonly used to hold the counter for a loop, although you can use it for anything you want. Using the counter register, or CTR, is easier than using a GPR to hold a loop count, and you’ll see why this is as we move along. You cannot load the CTR, aka the count register, with a value like you would a GPR. You cannot use the li instruction to place a value in the CTR, instead you need to use an instruction called mtctr, which stands for move to CTR. This instruction takes one operand, and it is the value that you want to put in the CTR, and it must be a register, it cannot be an immediate value. So if we wanted to load the CTR with the value 10, then we’d have to do something like this:
li r3, 10
mtctr r3
Since we cannot give the mtctr instruction an immediate value, we must first load the immediate value into a register, and then give that register to the mtctr instruction.
If you ever wanted to perform an operation on the CTR, like adding, subtracting etc., you’d have to do it indirectly. You’d have to read the value from the CTR into a GPR, perform the operation, and then put the value back into the CTR. To get the value in the CTR, we can simply use the mfctr instruction which stands for move from CTR. This instruction takes one operand, it is the destination register, so the register that the CTR’s value will be copied to. If we wanted to add 7, to the CTR, we’d have to do something like this:
mfctr r3
addi r3, r3, 7
mtctr r3
First, we copy the value in the CTR to r3, then we simply add 7 to r3, then we copy r3’s value back into the CTR.
Looping with the CTR
Looping with the count register is really easy thanks to a little instruction called bdnz which stands for branch decrement not zero. This instruction will decrement the count register, and then it will only branch to the address supplied as the first operand if the value of the count register, after being decremented, isn’t 0. Like I mentioned above, the bdnz instruction takes one operand, it is the address that it will potentially branch to. So let’s look at a loop.
li r3, 10
mtctr r3
loopTop:
addi r4, r4, 5
bdnz loopTop
The first thing that we do, is put 10 into the CTR, then it adds 5 to whatever value is in r4, then it comes down to the bdnz instruction, here, it decrements the CTR, so the value of the CTR is now 9, then it will only branch to the label loopTop, if the value in the CTR isn’t 0. Since the value in the CTR is 9. So it will loop through that addi instruction 10 times. Although that code is pointless, I hope that helped you understand looping in PPC.
Let’s take a look at a more practical use of looping in PowerPC. We’ll make a function that will generate the first n numbers in the Fibonacci sequence, and store those numbers in an array, that is passed in as a parameter. So the prototype of the function would look like this:
Now let’s look at the function itself.405×900
The first thing that it does is load r5 with the first value in the sequence, which is 0, and then it loads r6 with the second value in the sequence, which is 1. After this instruction, it does different things depending on the value of n. So I’m going to go through each value of n individually.
If n is 1
The next instruction it has to execute is the cmpwi instruction. Here, we are comparing n with 1. If n is 1, then we only need to put the first number in the sequence in the destination array. The instruction after the comparison will branch if n equals 1, which it does. So it goes down to the store1 label, and it simply stores the first number in the sequence in the array. Since there is nothing else for it to do, it returns.
If n is 2
Just like above we first compare n with 1. Since n isn’t equal to 1, it won’t branch to that label, it’ll just continue on. The next couple instruction will store the first two numbers in the sequence in the array called dest. After storing the first two numbers in the sequence, we check and see if there is anymore work to be done. If n is 2, then there is nothing else for us to do, we already “generated” the first two numbers in the sequence. So we will use the cmpwi instruction to compare n with 2. If they are equal, then we will return, well we’ll branch to a label that will return.
If n is greater than 2
We will say that n is 6 in this example. First, we must compare n with 1, to see if we only need to generate one number, since n is greater than 1, it won’t branch to that label. So just like above, it’ll store the first two numbers in the sequence in the array. Right after that it’ll compare n with 2 to see if anything else needs to be done. Since n is greater than 2, it’ll continue on, because it needs to generate the rest of the numbers. After the comparison, we subtract 2 from n since we already generated and stored the first two numbers in the sequence. After that, we put n into the CTR so that we loop the correct number of times later on. After we set up the count register, we add 8 to r4, which is the destination array. We do this so that r4 now stores the address of the next element in the array, since the first two are filled up with the first two Fibonacci numbers. If this is confusing, take a look at the following diagram.
As you can see from the diagram, all we’re doing is having r4 store the address of the next element in the array. Once we fix the array, we will generate the next number in the sequence, and this is really easy to do. To get the next number, we simply calculate the sum of the previous two. So we just add r5 and r6, to get the next number in the sequence. Then, we make the old first value the new first value, and we make the old first value, the new calculated value. So we are basically just moving along the sequence, generating the next number as we go. Maybe it’d help to look at a picture.
After we generate the next value, we store it in the array. We just store at 0 of r4, since we changed the value of it to point to the next element before we entered the loop. After we store the value, we add 4 to r4 so that it once again will point to the next element in the array. This will just set it up for the next iteration. The only thing left to do is loop back to the top. So it will decrement the CTR, and then branch as long as it’s not 0, which it isn’t. So it will just keep looping and looping through that code until it has filled the array with the first n numbers in the Fibonacci sequence. Here is a video of me stepping through the code.
You don’t have to use the count register if you want loop, however it is the easiest way. Instead, you could keep the counter in a GPR, and increment/decrement it each iteration. So we could do something like this if we wanted to.
li r3, 0
loopStart:
//some code to loop through
addi r3, r3, 1
cmpwi r3, 10
blt loopStart
So first we load r3 with 0. This means that instead of decrementing our counter each time, we are incrementing it. The we enter the loop, and it’ll execute whatever code you want it to run through. After it completes that, we increment our counter, and then we compare it with 10. We will loop back to the top as long as the counter is less than 10. This means that it’ll loop 10 times. You can loop through things this way if you prefer it, but using the CTR is easier in my opinion.
This is the end of chapter four. You can take the quiz here 7.
I’ll add more chapters later. If you find any grammatical/programmatical errors, tell me so I can fix them.
More::
Simplified Mnemonics for PowerPC Assembly
(Click here to view as PDF – one page printable version)
Simplified Mnemonics for PowerPC 555 Assembly | |||
Instruction | Description | Other Registers Altered | Explanation of Operation |
add rD, rA, rB | Add | CR0 (LT, GT, EQ, SO) | rD ¬ rA + rB |
addi rD, rA, value | Add immediate | None | rD ¬ rA + value |
addis rD, rA, value | Add immediate shifted left by 16 bits | None | rD ¬ rA + (value << 16) |
and rA, rS, rB | AND | None | rA ¬ rS & rB |
andi. rA, rS, value | AND Immediate | CR0 (LT, GT, EQ, SO) | rA ¬ rS & value |
andis. rA, rS, value | AND Immediate shifted left by 16 bits | CR0 (LT, GT, EQ, SO) | rA ¬ rS & (value << 16) |
b target_addr | Branch Always | None | Branch to target_addr |
ble target_addr | Branch if less than or equal to (LT or EQ flags of CR0 set) | None | Branch to target_addr if LT = 1 or EQ = 1 |
blt target_addr | Branch if less than (LT of CR0 set) | None | Branch to target_addr if LT = 1 |
beq target_addr | Branch if equal (EQ of CR0 set) | None | Branch to target_addr if EQ = 1 |
bge target_addr | Branch if greater than or equal to (GT or EQ of CR0 set) | None | Branch to target_addr if GT = 1 or EQ = 1 |
bgt target_addr | Branch if greater than (GT of CR0 set) | None | Branch to target_addr if GT = 1 |
blr target_addr | Branch to LR (Link Register) | None | Branch and link to target_addr |
bne target_addr | Branch if not equal (EQ of CR0 not set) | None | Branch to target_addr if EQ = 0 |
cmpw rA, rB | Compare Word | CR0 (LT, GT, EQ, SO) | rA – rB |
cmpwi rA, value | Compare Word Immediate | CR0 (LT, GT, EQ, SO) | rA – value |
la rD, label | Load Address based upon offset value | None | rD ¬ label |
lbz rD, d(rA) | Load Byte and Zero | None | rD ¬ m[rA + d] |
lbzx rD, rA, rB | Load Byte and Zero Indexed | None | rD ¬ m[rA + rB] |
lhz rD, d(rA) | Load Half Word and Zero | None | rD ¬ M[rA +d]15..0 |
lhzx rD, rA, rB | Load Half Word and Zero Indexed | None | rD ¬ M[rA +rB]15..0 |
li rA, value | Load immediate | None | rA ¬ value |
lis rA, value | Load immediate shifted left by 16 bits | None | rA ¬ (value << 16) |
lwz rD, d(rA) | Load Word and Zero | None | rD ¬ M[rA + d] |
lwzx rD, rA, rB | Load Word and Zero Indexed | None | rD ¬ M[rA + rB] |
mr rA, rS | Move Register | None | rA ¬ rS |
not rA, rS | Complement Register (invert) | None | rA ¬ ~rS |
ori rA, rS, value | OR Immediate | None | rA ¬ rS | value |
oris rA, rS, value | OR Immediate shifted left by 16 bits | None | rA ¬ rS | (value << 16) |
slwi rA, rS, value | Shift Left Immediate | None | rA ¬ (rS << value) |
srwi rA, rS, value | Shift Right Immediate | None | rA ¬ (rS >> value) |
stb rS, d(rA) | Store Byte | None | m[rA + d] ¬ rS7..0 |
stbx rS, rA, rB | Store Byte Indexed | None | m[rA + rB] ¬ rS7..0 |
sth rS, d(rA) | Store Half Word | None | M[rA + d]15..0 ¬ rS15..0 |
sthx rS, rA, rB | Store Half Word Indexed | None | M[rA + rB]15..0 ¬ rS15..0 |
stw rS, d(rA) | Store Word | None | M[rA + d] ¬ rS |
stwx rS, rA, rB | Store Word Indexed | None | M[rA + rB] ¬ rS |
sub rD, rA, rB | Subtract | None | rD ¬ rA – rB |
subi rD, rA, value | Subtract Immediate | None | rD ¬ rA – value |
subis rD, rA, value | Subtract Immediate shifted left by 16 bits | None | rD ¬ rA – (value << 16) |
Notes:
M[i] – refers to the contents of the word of memory beginning at location i.
m[i] – refers to the contents of the byte of memory beginning at location i.
7..0 – subscripts indicate bit positions (with 0 being the least significant bit)
Given the following memory dump:
0x00001000 01 23 45 67 89 AB CD EF
0x00001008 FF FF FF FF FF FF FF FF
Given these values:
r11 = 0x00001000
r12 = 0x00001002
IO_DIGITAL_INPUT_DIP_1 = 0x4000000B ; variable defined in code, address of DIP Switch 1
lis r3, IO_DIGITAL_INPUT_DIP_1@h ; r3 = High-order 16 bits of address of DIP switch 1
r3 ¬ (IO_DIGITAL_INPUT_DIP1@h << 16) ; use @h to get the higher 16 bits
r3 ¬ (0x4000 << 16)
r3 = 0x40000000
ori r3, r3, IO_DIGITAL_INPUT_DIP_1@l ; r3 = r3 | Low-order 16 bits of address of DIP switch 1
r3 ¬ r3 | IO_DIGITAL_INPUT_DIP1@l ; use @l (lower case L) to get the lower 16 bits
r3 ¬ r3 | 0x000B
r3 ¬ 0x40000000 | 0x000B
r3 = 0x4000000B
li r7, 0 ; Load r7 with immediate value of 0 (signed 16-bit number)
r7 ¬ value
r7 = 0x0000
lbz r5, 0(r11) ; Load r5 with byte value at memory location (r11 + 0)
r5 ¬ m[d + rA]
r5 ¬ m[0 + r11]
r5 ¬ m[0 + 0x00001000]
r5 = 0x01
lwz r6, 0(r11) ; Load r6 with word value at memory location (r11 + 0)
r6 ¬ M[d + rA]
r6 ¬ M[0 + r11]
r6 ¬ M[0 + 0x00001000]
r6 = 0x0123
lbz r7, 2(r11) ; Load r7 with byte value at memory location (r11 + 2)
r7 ¬ m[d + rA]
r6 ¬ m[2 + r11]
r7 ¬ m[2 + 0x00001000]
r7 = 0x45
stb r5, 8(r11) ; Store byte from r5 into memory at (r11 + 8)
m[d + rA] ¬ r5
m[8 + 0x00001000] ¬ 0x01
m[0x00001008] = 0x01
Before:
0x1008 FF FF FF FF FF FF FF FF
After:
0x1008 01 FF FF FF FF FF FF FF
- Assembly file extensions can be either .s or .asm
- To include another file in your assembly file:
.include „Defines.h“ ; include „Defines.h“ file for use in assembly file
- To label your code so CodeWarrior can debug it:
- First, for CodeWarrior to be able to access your function you need the following line:
.function „MyFile“, PPC_Start_Asm, PPC_End_Asm – PPC_Start_Asm
Where „MyFile“ is the name of your file (without the extension), PPC_Start_Asm is a label at the beginning of your code, and PPC_End_Asm is a label at the end of your code. The „PPC_End_Asm – PPC_Start_Asm“ portion tells CodeWarrior the size of your function in bytes (difference between end and beginning addresses).
- Next, tell the compiler that the following section is your assembly code and not something else. Put .text at the start of your assembly function:
.text
- Finally, put a label for the beginning of your program, following the label for PPC_Start_Asm:
PPC_Start_Asm:
MyFunction:
Both PPC_Start_Asm and MyFunction will contain the same address.
Here is a complete example of a simple program:
.include „defines.h“
.function „MyFile“, PPC_Start_Asm, PPC_End_Asm – PPC_Start_Asm
.text
PPC_Start_Asm:MyFunction:
; To Do: place your assembly code here
PPC_End_Asm:
- To include other Assembly files for use and debugging:
Do Not use .include to include other assembly files that you want toe use and debug, because then you will not be able to run through the file with the step-by-step debugger. The actual program may work but you can not go through the included code line by line. You can include subroutines or any other label (i.e. variables) from one assembly file to be used in another file. Do remember though to add the .function and .text and other required assembly definitions.
Place the .import in the assembly file (at the top) from which you want to call the subroutine. Basically can look at it like placing the prototype of the function at the top of the assembly file.
.import My_Function
Then place the .export in the assembly file (at the top) that actually contains the subroutine that you want to call.
.export My_Function
Example:
In “MainSource.asm”
.import My_Function
; your other code
; call to My_Function subroutine
bl My_Function
; more code
In “Other_File.asm”
.export My_Function
My_Function:
;your My_Function code
blr ; return to MainSource file
CALLEE | ||
REGISTER | USAGE | SAVE |
Integer Registers | ||
r01 | Used in prolog/epilog | NO |
r1 | Stack pointer | YES |
r2 | TOC pointer (reserved) | YES |
r3 | 1st para/return val | NO |
r4-r10 | 3-8th para | NO |
r11 | Environment pointer | NO |
r12 | Used by global linkage | NO |
r13 | reserved system thread ID | N/A |
r14-31 | Global int registers | YES |
Floating Point Registers | ||
f0 | Scratch reg | NO |
f1-13 | 1-13th fp para | NO |
f14-f31 | Global fp regs | YES |
Special Registers | ||
LR | Link register | YES |
CTR | Count register | NO |
XER | Fixed pt exception | NO |
FPSCR | fp status & ctrl | NO |
CR0-CR7 | Condition reg fields, each 4 bits wide | 2, 3, 4 : YES |
Vector Registers | ||
v0-v1 | scratch regs | NO |
v2-v13 | vec para regs | NO |
v14-v19 | scratch regs | NO |
v20-v31 | global vregs | YES |
vrsave | (32 bits) | YES |
- r3 is the 1st argument in functions, r4 is 2nd and so on
- r0 is ram
- r1 is the stack
- r2 holds toc
more:
Unlike x86, there are 32 PowerPC registers, but the registers have refreshingly uninteresting names—they’re called r0 through r31.
- r0, register 0, is a scratch register. It’s often used for a temporary.
- r1 is the stack pointer.
- r2 is reserved for some multithreaded global variable magic.
- r3 is used to return values (like eax), and as the first argument (like rdi).
- r4 is the second argument
- r5-r10 are also argument registers (or generic scratch)
- r13-r31 are saved registers
The names of the instructions are different; „li“ in PowerPC (Load Immediate) is about like a „mov“ in x86; „blr“ (Branch to Link Register) serves the same purpose as „ret“ in x86. So you can return the integer 7 like this:
li r3, 7
blr
(Try this in NetRun now!)Add, like most arithmetic on PowerPC, takes *three* registers: two sources, and a destination.
li r8, 8
li r9, 1000
add r3,r8,r9
blr
(Try this in NetRun now!)There’s a separate instruction named „addi“ (add immediate) to add a constant; plain „add“ only works on registers.
li r8, 8
addi r3,r8,1000
blr
(Try this in NetRun now!)PowerPC machine code always uses four bytes for every instruction (it’s RISC), while x86 uses from one to a dozen bytes per instruction (it’s CISC). Here’s a good but long retrospective article on the RISC-vs-CISC war, which got pretty intense during the 1990’s. Nowadays, RISC machines compress their instructions (like CISC), while CISC machines decode their instructions into fixed-size blocks (like RISC), so the war ended in the best possible way–both sides have basically joined forces!
One effect of fixed-size instructions is you can’t load a 32-bit constant in a single instruction:
li r3, 0xabcdef ; ERROR! out of range!
blr
(Try this in NetRun now!)Instead, you break the 32-bit constant into two 16-bit pieces. They have a dedicated load-and-shift instruction „lis“:
lis r3, 0xab ; "load immediate shifted" (the high half)
ori r3,r3, 0xcdef ; "or immediate" (the low half)
blr
Accessing Memory
Memory is accessed with the „lwz“ (load word) and „stw“ (store word) instructions. Unlike x86, these are the *only* instructions that access memory; you can’t do an „add“ with one operand in memory!
lwz r3, 0(r1) ; load register r3 from the stack
blr
Here I’m writing an integer out to the stack, then reading it in again.
li r7, 123
stw r7, 0(r1) ; store register r7 to the stack
lwz r3, 0(r1) ; load register r3 from the stack
blr
There are „updating“ variants of load and store called „lwzu“ and „stwu“. These actually change the value of the pointer used as an address. For example,this code does two things:
stwu r7, -4(r1)
- Store r7 into memory at address (r1-4).
- Modify r1 = r1-4.
Together, this forms the PowerPC equivalent of a „push“: it stores to memory, and updates the stack pointer.
Here’s an example:
li r7, 123
stwu r7, -16(r1) ; store register r7 to the stack (with push)
lwzu r3, 0(r1) ; load register r3 from the stack
addi r1,r1,16 ; clean up the stack
blr
Array indexing mostly has to be done manually. If r5 is the start of the array, and r6 is the index, you have to do something like this:
ori r5,r1,0 ; array pointer==stack pointer
li r6,2 ; array index
mulli r8,r6,4; array index*4
add r8,r8,r5; add base pointer
lwz r3,0(r8); access memory there
blr
You can combine the add and lwz with a „lwzx“:
ori r5,r1,0 ; array pointer==stack pointer
li r6,2 ; array index
mulli r8,r6,4; array index*4
lwzx r3,r5,r8; access memory at base + index
blr
Calling Functions
You can get into a function pretty easily, with a „b“ (branch, like „jmp“) instruction:
li r3,99
b _print_int
blr
(Try this in NetRun now!)Here, _print_int will end with its own „blr“, which will jump straight back to main, skipping us. Getting control back from a function is much trickier. The problem is a function will end with „blr“ (Branch to the Link Register); the Link Register can only hold one value at a time. So if you just overwrite the Link Register with your own value, you can’t return to main!
So this „bl“ (Branch and Link) will return control back to you, but then *keep* returning control back to you, in an infinite loop:
li r3,99
bl _print_int
blr ; Oops! We trashed LR with the "bl" above!
The sequence of events here is:
- Main calls us with „bl foo“. „bl“ will overwrite LR to point back to main.
- We call „bl print_int“. „bl“ will overwrite LR to point back to us.
- Print_int returns with „blr“. That transfers control back to LR, which is us.
- We return with „blr“, but that just transfers control back to us again!
- … repeat forever …
The fix is to save the link register before calling any functions, and restore main’s value before returning. Let’s try using a preserved register, just to see if it’ll work:
mflr r28 ; save main's link register
li r3,99
bl _print_int ; "bl" will overwrite LR, so print_int can return here
mtlr r28 ; restore main's link register
blr ; now this works... sorta
(Try this in NetRun now!)OK! Everybody returns correctly now, but main complains we overwrote its preserved data (r28 is preserved).
So now we save the old link register onto the stack:
mflr r0 ; save main's link register...
stwu r0,-32(r1); ... onto the stack
li r3,99
bl _print_int ; "bl" will overwrite LR, so print_int can return here
lwz r0,0(r1); grab main's link register from the stack
addi r1,r1,32 ; restore the stack
mtlr r0 ; restore main's link register
blr ; finally, this works correctly!
(Try this in NetRun now!)Whew! The x86 „call“ and „ret“ are looking a lot better now!
More Info
The IBM 32-Bit PowerPC Programming Environment gives all the instructions in chapter 8.1 and a good overview in chapter 4. The IBM Compiler Writer’s Guide gives the calling conventions.
more:
The integer exception register xer contains a bunch of stuff, but the ones that are relevant to us are
Bit | Name |
---|---|
0 | Summary overflow |
1 | Overflow |
2 | Carry |
Some instructions update the overflow and summary overflow bits in the xer register. When those instructions are executed, the overflow bit is updated to represent whether the operation resulted in a signed overflow. The summary overflow bit accumulates all the overflow bits since it was last explicitly reset. This lets you perform a series of arithmetic operations and then test a single bit at the end to see if an overflow occurred anywhere along the way.
Some instructions consume and/or target the carry bit in xer. We’ll discuss how carry works when we get to integer arithmetic.
Each of the cr# condition registers consists of four bits, numbered from most signficant to least significant.
Bit | Name | Mnemonic |
---|---|---|
0 | Less than | lt |
1 | Greater than | gt |
2 | Equal to | eq |
3 | Summary overflow | so |
For convenience, the assembler predefines the constants lt, gt, eq, and so to represent their respective bit numbers.
The cmp
family of instructions compare two values and write the result to a condition register.
cmpw crd, ra, rb ; crd = compare ( int32_t)ra with ( int32_t)rb cmpwi crd, ra, imm16 ; crd = compare ( int32_t)ra with ( int16_t)imm16 cmplw crd, ra, rb ; crd = compare (uint32_t)ra with (uint32_t)rb cmplwi crd, ra, imm16 ; crd = compare (uint32_t)ra with (uint16_t)imm16
You can compare two registers, or you can compare a register with an immediate, and you can choose whether the comparison is signed or unsigned. (Recall that the l
stands for logical.) For example:
cmpw cr3, r0, r1 ; cr3 = compare r0 with r1 as signed values
The lt, gt, and eq bits are set according to the result of the comparison, and the so bit receives a copy of the current summary overflow bit in xer.
If you do not specify a destination comparison register, it defaults to cr0:
cmpw ra, rb ; cr0 = compare ( int32_t)ra with ( int32_t)rb cmpwi ra, imm16 ; cr0 = compare ( int32_t)ra with ( int16_t)imm16 cmplw ra, rb ; cr0 = compare (uint32_t)ra with (uint32_t)rb cmplwi ra, imm16 ; cr0 = compare (uint32_t)ra with (uint16_t)imm16
As we’ll see later, some arithmetic instructions implicitly update cr0 by comparing the computed result against zero. (Similarly, some floating point operations implicitly update cr1.) When performed as part of an arithmetic instruction, the comparison is always performed as a signed comparison, even if the instruction’s underlying operation was unsigned.
If you combine an update of cr0 with an arithmetic operation, the so bit is a copy of the summary overflow bit in the xer register at the end of the instruction. That means that if an arithmetic operation requests both cr0 and xer to be updated, the xer register is updated first, and then the summary overflow bit from xer is copied to the so bit in cr0. That means that the so bit in cr0 captures whether a signed overflow occurred in any overflow-detecting operation up to and including the current one.
The Microsoft compiler tends to prefer to target cr6 and cr7 in its comparison instructions. It doesn’t make much difference to the processor, but I suspect the compiler tries to avoid cr0 so that it doesn’t conflict with the use of cr0 by the arithmetic instructions.
mcrxr crd ; crd = first four bits of xer
The “move to condition register from xer” instruction copies the summary overflow, overflow, and carry bits from the xer register to the specified condition register, and then it clears the bits from xer.
No, I don’t know why they left the “e” out of the opcode.
This is how you reset the summary overflow.¹
mtxer ra ; xer = ra mfxer rd ; rd = xer
These instructions² move to/from the xer register. They are another way to clear the xer register, or to set it to a particular initial state.
There are a good number of bitwise operations that combine two condition register bits and store the result into a third condition register bit. These let you build boolean expressions out of condition registers.
crand bd, ba, bb ; cr[bd] = cr[ba] & cr[bb] cror bd, ba, bb ; cr[bd] = cr[ba] | cr[bb] crxor bd, ba, bb ; cr[bd] = cr[ba] ^ cr[bb] crnand bd, ba, bb ; cr[bd] = !(cr[ba] & cr[bb]) crnor bd, ba, bb ; cr[bd] = !(cr[ba] | cr[bb]) creqv bd, ba, bb ; cr[bd] = !(cr[ba] ^ cr[bb]) crandc bd, ba, bb ; cr[bd] = cr[ba] & !cr[bb] "and complement" crorc bd, ba, bb ; cr[bd] = cr[ba] | !cr[bb] "or complement"
Remember that the PowerPC numbers bits from most significant to least significant, so bit zero is the high-order bit.
To save you from having to memorize all the bit numbers, the assembler lets you write cr0 to mean 0, cr1 to mean 1, and so through cr7 which means 7. Combined with the constants for the four bits in the condition register, this lets you write
crand 4*cr3+eq, 4*cr2+lt, 4*cr6+gt ; cr3[eq] = cr2[lt] & cr6[gt]
instead of the instruction only a processor’s mother could love:
crand 14, 8, 25 ; cr3[eq] = cr2[lt] & cr6[gt]
There are also special instruction for transferring between cr and a general-purpose register.
mfcr rt ; rt = cr mtcrf mask, ra ; cr = ra (selected by mask)
The mask is an 8-bit immediate. If a bit is set, then the corresponding cr# is copied from the corresponding bits of ra. For example, 128 means “Copy the top four bits of ra into cr0, and leave all the other condition registers alone.” Recall that the PowerPC counts bits from most significant to least significant, so cr0 is stored in the highest-order four bits.
The assembler provides synthetic instructions for various special cases of the above operations:
creqv bd, bd, bd ; crset bd ; cr[bd] = 1 crxor bd, bd, bd ; crclr bd ; cr[bd] = 0 cror bd, ba, ba ; crmove bd, ba ; cr[bd] = cr[ba] crnor bd, ba, ba ; crnot bd, ba ; cr[bd] = !cr[ba] mtcr ra ; mtcrf 255, ra ; cr = ra
Here’s an example of how these boolean operations could be used:
cmpw cr2, r4, r5 ; compare r4 with r5, put result in cr2 cmpw cr3, r6, r7 ; compare r6 with r7, put result in cr3 crandc 4*cr0+eq, 4*cr2+gt, 4*cr4+eq ; cr0[eq] = cr2[gt] & !cr4[eq] beq destination ; jump if r4 > r5 && r6 != r7
We perform two comparison operations and put the results into cr2 and cr3. We then perform a boolean “and not” operation that calculates
cr0[eq] = (r4 > r5) & !(r6 == r7) = (r4 > r5) & (r6 != r7)
The result is placed into the eq position of cr0, which makes it a perfect place to be the branch condition of the beq
instruction.
The traditional way of doing this on processors that don’t have these fancy condition register operations is to perform a test and a conditional branch, then another test and another conditional branch. Combining the results of the test and performing a single branch means that the entire sequence consumes only one slot in the branch predictor. This leaves more slots free to predict other branches, and the single slot this sequence does consume can predict the final result, which might be easier to predict than the individual pieces. (For example, the test might be validating that a parameter is one of two valid values. The parameter is almost always valid, even though one might not be able to predict which of the two valid values it is at any particular time.)
Fabian Giesen notes that in practice, you don’t get to perform this optimization as often as you’d like because of short-circuiting rules in many programming languages. Under those rules, this optimization works only if the second term can be evaluated without any risk of taking any exceptions (or if the language permits you to take an exception anyway, say, because any exception would be the result of undefined behavior).
I have yet to see the Microsoft C compiler for PowerPC perform this optimization. It just does things the conventional way. But that may just be because I haven’t encountered a situation where the optimization is even possible. (Also, because I’m studying code from Windows NT 3.51, and compiler technology was not as advanced back then.)
Okay, next time we’ll start doing some arithmetic.
¹ You might have noticed that there are only three interesting bits in xer but room for four bits in a condition register. The last bit is undefined. Usually, you don’t care much about the bits that got transferred; the main purpose of the instruction is its side effect of clearing the summary overflow.
² These instructions are actually special cases of the mtspr
and mfspr
instructions which move to/from a special register. The xer register is formally register spr1, so the mtxer
and mfxer
instructions are technically synthetic instructions.
mtspr 1, ra ; spr1 = ra mfspr 1, rd ; rd = spr1