< February 2009 >
SuMoTuWeThFrSa
1 2 3 4 5 6 7
8 91011121314
15161718192021
22232425262728
Wed, 18 Feb 2009:

The really hard part of APC is the internal locking code it has - it's not that hard to do, just hard to figure out if you've done it wrong. And I'm just about to really mess around with the assembly spin locks and pthread mutex locks to make them cross-process locks which live in shared memory (remember that "volatile" keyword in C?). The other couple of lock modes are already cross-process and slow (because of the syscall). If these work right, I won't really have to cripple the fast part of the code to implement the features I have in mind.

But before I start to go MIA into the locking code, I'd like to get my testing in place. So I've written a small and tiny test app called lockhammer - read the makefile and please run it on every platform you want APC to work. (make APC_DIR=~/apc link; make)

The code in lockhammer.c should be easily understood - basically it allocates some shared memory, creates a lock in it, forks, re-attaches the memory in each process. Every process is a loop of lock, write PID into shm, sleep, check the PID. In case someone has a better idea of how to test locks, I'll also like modifications to it, in case any of you think there's some corner case I missed (yes, random sleep & random fork-order is also on my list of TODOs).

Fundamentally, the information about locks is privately held within the lock type code in APC. The information needs to be moved into a shared mode (or at least, transparent) for multiple un-related processes to be able to share the cache without collisions. Eventually, you should be able to use APC in a standard FastCGI deployment without allocating a cache per-process.

And if you're a user, I'd like read something other than a bugreport, occasionaly.

--
They're gonna lock me up and throw away the key!

posted at: 19:07 | path: /php | permalink | Tags: , ,

Tue, 16 Dec 2008:

Finally, after nearly a year of work, it's into a release. Some new stuff has sneaked into it undocumented, that people might find interesting - apc.preload_path would be one of them. The backend memory allocation has been re-done - the api part by me and the internals by shire. There's a hell of a lot of new code in there, both rewritten and added. Tons of php4 cruft removed, php5 stuff optimized, made more stable, then less stable, made faster, then applied brakes. Made leak-proof, quake-proof and in general, idiot-proof. So, on & so forth.


 apc/ $ cvs diff -u -N -r HEAD -r RELEASE_3_0_19 | diffstat /dev/stdin 
 
 68 files changed, 3255 insertions(+), 5545 deletions(-)

Sorry about the b0rked 3.1.1 release, so please test this one! :)

--
Each new user of a new system uncovers a new class of bugs.
                -- Kernighan

posted at: 15:27 | path: /php | permalink | Tags: , ,

Sat, 18 Oct 2008:

In the development of things, there comes a point when it escapes the vision and control of one man/one mind. PHP frameworks are such ... beasts. But the simplicity a machine took away can be made to return. And such an attempt at zooming out of the complex file structure bureaucracy of most php projects was inclued.

When I hacked up that extension, nearly a year back, I wished that it would shame at least some php programmers into writing better code. And slowly, thanks to a few slides from Rasmus, people are actually slowly realizing how messy their include hierarchy really is. And here's an example of what I'm talking about.

That was the Zend Framework 1.5.2, as blogged by phpimpact - download the big one and look at it. The joomla CMS has also got its very own pretty picture elsewhere. Rasmus has a bunch of inclued traces from various frameworks - CakePHP, Symfony, Drupal, and perhaps the cleanest of them all, CodeIgniter.

Now, all that remains is a php-graphviz + svg mode which renders these in-browser as an iframe - or maybe someone can help me with the graph reduction to take a collection of the inclued dumps & create a "package". There's none an end to the bells and whistles I want to tack onto this.

But as long as people are scrambling head over heels to reduce the number of includes & include_onces, I think I've done my part here.

--
Must I hold a candle to my shames?
                -- William Shakespeare, "The Merchant of Venice"

posted at: 01:27 | path: /php | permalink | Tags: , ,

Mon, 15 Sep 2008:

It's a protest. A protest against all the gags the establishment has put on php functions - big, small and useless alike. No more shall they remain ignored and voiceless. Hear me now, as the day has come for them to shake off their silence and SCREAM!

@Error: Php uses the @ operator to silence errors from functions, so that they fail silently. But while tracing through code which uses it, it becomes nearly impossible to properly figure out what is going wrong. The band-aid that is '@' makes it a complete pain to debug code.

Introducing the SCREAM 0.0.1, which has come out of someone else's frustrations with some pear modules which are liberally peppered with such gag instructions. Essentially, it uses the user opcode functionality to override the silence functions into NOPs (literally).

php -dscream.enabled=1 -r '@foo();'

  Fatal error: Call to undefined function foo() in Command line code on line 1

Dump it into php and hopefully debugging sloppily coded libraries should become much easier. This message was brought to you by the dread of Mondays. It's all over now - End Transmission.

--
You have not convinced a man because you have silenced him.

posted at: 21:09 | path: /php | permalink | Tags: , ,

Mon, 26 May 2008:

That annoying file descriptor leak that snuck into 3.0.17 has finally been laid to rest. A few double free issues were fixed as I spent quite a long time staring at the same code, till enlightenment hit me like a clue bat. Along with that, there are a bunch of quickfixes for 5.3 quirks. I'm not happy with those, but this is the 3_0 stable branch and by the time 5.3 is popular enough the HEAD should be taking care of those problems. The build is broken in VC++ in this release, but excluding apc_pool.c from the build should work.

Expect more changes... as soon as I get back home.

--
Delay always breeds danger and to protract a great design is often to ruin it.
              -- Miguel De Cervantes

posted at: 04:27 | path: /php | permalink | Tags: , ,

Tue, 06 May 2008:

There's a certain cultural bankruptcy which shows itself in sequels. It indicates, that you're reduced to imitating yourself. But this isn't that kind of a sequel. No, not the kind where there are T Rexes in the city, trying to make a living drawing cartoons or Arnie switching from ammo boxes to ballots. This is the kind which gives a New Hope.

Yesterday, I had an outpouring of hate against the linux capability model. But the problem turned out to be that setuid resets all the capabilites. In hindsight that makes a lot of sense, but didn't even strike until the kernel people (y! has those too) got involved (and I didn't RTFM).

Enter Prctl: The solution was to use the prctl() call with PR_SET_KEEPCAPS to ensure that the capabilities are not discarded when the effective user-id of a process is changed. But, even then, only the CAP_PERMITTED flags are retained and the CAP_EFFECTIVE are masked to zeros.

So, with the prctl call and another cap_set_proc to reset CAP_EFFECTIVE, it was on a roll. Here's the patch on top of unnice.c.

 #include <sys/resource.h>
+#include <sys/prctl.h>;
@@ -26,12 +27,14 @@

    if(!fork())
    {
+       prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0);

        /* child */
        if(setuid(nobody_uid) < 0)
        {
            perror("setuid");
        }
+       cap_set_proc(lcap);

        if(setpriority(PRIO_PROCESS, 0, getpriority(PRIO_PROCESS, 0) - 1) < 0)

Thus concludes this adventure and hope that this blog entry serves as warning of things to come. Watch this space for more Tales! Of! INTEREST!.

--
Only great masters of style can succeed in being obtuse.

posted at: 18:34 | path: /php | permalink | Tags: , ,

Mon, 05 May 2008:

Running infinte loops is a tricky challenge. What happens to a process when a programmer writes an infinite loop, should be familiar to all. But the challenge is to not let that affect the *other* processes. There seemed to be a perfect solution to it - setrlimit().

The function lets you set soft and hard limits on CPU, so that if a process does exceed the soft limit CPU usage, a SIGXCPU is raised. The process can catch the signal and do something sensible. Basically, all that was required was for the process to call setpriority and let the linux process scheduler slow it down to a trickle.

But a process can lower its priority, but not raise it - if it is a non-privileged process. But linux capabilities allows you to grant CAP_SYS_NICE to the process which essentially lets a non-privileged process muck around with priority - down and up.

To begin with /proc/sys/kernel/cap-bound is unbelievably confusing to use. It is a 32 wide bit-mask on which the 23rd bit apparently seems to be the CAP_SYS_NICE value. After much mucking around, I came to the conclusion that "-257" would be 0xFFFFFEFF which only disables CAP_SETPCAP. But even then the setpriority call kept failing. Here's my test code.

cap_t lcap;
const unsigned cap_size = 1;
cap_value_t cap_list[] = {CAP_SYS_NICE};

lcap=cap_get_proc();
cap_set_flag(lcap, CAP_EFFECTIVE, cap_size, cap_list, CAP_SET);
cap_set_flag(lcap, CAP_PERMITTED, cap_size, cap_list, CAP_SET);

cap_set_proc(lcap);

if(setuid(nobody_uid) < 0) 
	perror("setuid");

if(setpriority(PRIO_PROCESS, 0, getpriority(PRIO_PROCESS, 0) - 1) < 0) 
	perror("setpriority");

Here's a link to the test case in a more compileable condition. Build it with gcc -lcap and run with sudo to test it. Right now, my ubuntu (2.6.22) errors out with this message.

bash$ gcc -lcap -o unnice unnice.c
bash$ sudo ./unnice 
0: =ep cap_setpcap-ep
setpriority: Permission denied

The core issue has to do with apache child-process lifetimes. The only recourse for me is to kill the errant process after the bad infinite loop and have the parent process spawn a new process with a normal priority. But which means blowing off nearly all the local process cache, causing memory churn and more than that, the annoyance of a documented feature not working.

This story currently has no ending, but if any kernel hackers are reading this and should happen to know an answer, please email gopalv shift+2 php noshift+> net. And thus we prepare for a sequel (hopefully).

--
I use technology in order to hate it more properly.
                -- Nam June Paik

posted at: 22:03 | path: /php | permalink | Tags: , ,

Wed, 26 Mar 2008:

In response to CVE-2008-1488, APC 3.0.17 has just been pushed out with the requisite security fixes. But in the process of producing a php4 compatible release, a significant amount of code has been reverted in the merge into an APC_3_0 branch for future bugfixes.

I've spent a couple of hours unmerging my "bye bye php4" cleanups with the help of Kompare. And my sanity is simply due to the fact that I can "cvs diff -u | kompare -" to look at the resulting huge patch. But it is not unpossible that the new code merged from HEAD has regressions, so you could also apply the unofficial patch onto 3.0.16.

--
I took your advice and did my own thing. Now I've got to undo it.

posted at: 22:03 | path: /php | permalink | Tags: , ,

Sun, 30 Dec 2007:

I got a nice little present for Christmas.

It had +4 lines, a huge mail explaining why and made me feel happy & stupid at the same time (there's some correlation, I think).

The patch fixes a one-off error in the APC shm allocator (read my mail for a shorter paraphrasing) and has triggered the 3.0.16 release. Now, APC should be stable even when running under cache full/heavy load conditions. And I've been barking up the wrong tree of race conditions for months & months. But the important thing is that this is fixed now.

Merry XMas and a happy New Year! [ citation needed ]

--
Patience is a minor form of despair, disguised as virtue.
                -- Ambrose Bierce

posted at: 06:27 | path: /php | permalink | Tags: , ,

Fri, 19 Oct 2007:

APC 3.0.15 has been released - read the release announcement. Not too many changes since 3.0.14, but there's a reason it took this long to make so few changes.

To begin with, I've just been lazy. Just kidding! This release was actually delayed to make sure this could be the very last PHP4 release of APC. And with the amount of major changes coming in, the next release is definitely going to be 3.1.0 rather than a 3.0.16 release. Now I can start working on making some fairly big changes.

Bye Bye PHP4, it was a nice ride while it lasted.

--
While most peoples' opinions change, the conviction of their correctness never does.

posted at: 03:11 | path: /php | permalink | Tags: , ,

Mon, 24 Sep 2007:

PHP programmers don't really understand PHP.

They know how to use PHP - but they hardly know how it works, mainly because it Just Works most of the time. But such wilful ignorance (otherwise known as abstraction) often runs them aground on some issues when their code meets the stupidity that is APC. Bear with me while I explain how something very simple about PHP - how includes work.

Every single include that you do in PHP is evaluated at runtime. This is necessary so that you could technically write an include inside an if condition or a while loop and have it behave as you would expect. But executing PHP in Zend is actually a two step process - compile *and* execute, of which APC gets to handle only the first.

Compilation: Compiling a php file gives a single opcode stream, a list of functions & yet another list of classes. The includes in that file are only processed when you actually execute the code compiled. To simplify things a bit, take a look at how the following code would be executed.

<?php
return;

include_once "a.php";
?>

The PHP compiler does generate an instruction to include file "a.php", but since the engine never executed it, no error is thrown for the absence of a.php. Having understood how includes work, classes & OOP face a unique problem during compilation.

<?php
include_once "parent.php";

class Child extends Parent
{
}
?>

Even though the class Child is created at compile time, its parent class is not available in the class table until the include instruction is actually executed & the parent.php compiled up. So, php generates a runtime class declaration which is an actual pair of opcodes.

ZEND_FETCH_CLASS              :1, 'Parent'
ZEND_DECLARE_INHERITED_CLASS  null, '<mangled>', 'child

But what if the class parent was already in the class table when the file was being compiled? Like the following index.php

include_once "parent.php";
include_once "child.php";

$a = new Child();

Since obviously the parent class is already compiled & ready, Zend does something intelligent by removing the two instructions and replacing them by NOPs. That makes for fewer opcodes and therefore faster execution.

Here's the kicker of the problem. Which of these versions should APC cache? Obviously, the dynamically inherited version is valid for both cases - but APC caches whatever it encounters initially. The static version is obviously incompatible in a dynamic scenario.

So whenever APC detects that it has cached a static version, but this case actually requires a dynamic version, it decides to not cache that file *at* all from that point onwards. That's what the APC autofiltering does.

Now, you ask - how could it appear in perfectly normal code?

Assume child1 and child2 inherits from parent class. And here is how the first hit on index.php looks like from an inclusion perspective. Now, it is obvious that the child2 in this case is actually compiled with the faster static inheritance (marked in orange) while child1 suffers the performance hit of not having Parent available till execution time.

Then we have a profile.php which only requires the child2 class. But while executing this file, APC fetches the copy of child2.php which was in cache - which is the statically inherited one.

As you could've guessed, the cached version is not usuable for this case - and APC drops it out of cache. And for all requests henceforth, even for the index.php case, APC actually ignores the cached version and insists on compiling the file with Zend. If you enable apc.report_autofilter, this information will be printed out into the server error log.

Part of the culprit here is the conditional inclusion using include_once. With mere includes, you get an error whenever parent.php is included multiple times - but that can be annoying too. Where include_once/require_once can be debugged with Inclued, userspace hacks like the rinclude_once or !class_exists() checks make it really hard for me to figure out what's going wrong.

So, if you write One File per Class PHP and use such methods of inclusion, be prepared to sacrifice a certain amount of performance by doing so.

--
Doubt is not a pleasant condition, but certainty is absurd.
              -- Voltaire

posted at: 13:55 | path: /php | permalink | Tags: , ,

Fri, 07 Sep 2007:

After procrastinating for nearly two weeks with the code nearly done, I've managed to find the energy (and some caramel coffee) required to fix it up for the public to use - and here it is. In the process, I also threw out all the ZendEngine2 hacks and started to use zend_user_opcode_set_handler, which should let people use this with the faster CGOTO vm core, though I would advise against using that just yet.

The new & improved inclued can dump out class inheritance dependencies (though not the interfaces, as of now). This gives a slightly bigger picture view of what files depend on what other files and provide a tree of the classes clustered into their own files. For example, this is the graph pulled out from the relatively minimal PEAR::HTML_QuickForm2 library.

The usage is as before, the gengraph.php script now has a -t option which will accept either "classes" or "includes". At the very least, it should help people documenting OOP php code. Next up are interface implementations, the data is already in the dumped files, but not output in any human readable format.

--
It is a very sad thing that nowadays there is so little useless information.
        -- Oscar Wilde

posted at: 06:07 | path: /php | permalink | Tags: , ,

Sat, 11 Aug 2007:

Yak Shaving: So you start out with that simple problem. But half-way through fixing it, it explodes into this whole exercise in pointless dependencies. It is a rather recent wordification (never heard of that word ? it's a perfectly cromulent word). But considering the fate of the "pre-shaved yaks" guy, who ended up saying "It's a band.", I'd say it is not quite popular enough ... yet.

Now. before I start onto the real topic - let me first say that the next release of APC will be the last release compatible with PHP 4.x. Now, what is wrong with just letting the #ifdefs stay ? That's where this snippet of code comes into play.

<?php 
  apc_store("a", array(new stdclass()));
  print_r(apc_fetch("a"));
?>

It doesn't work. Now, the problem is very simple - the original patch by Marcus only checks for objects in a very shallow way. It will detect & serialize objects which are passed to apc_store - but the check does not extend deeper into the recursive copy functions.

Symmetry: But the zval* copy functions were written to be beautifully symmetric. A copy into cache is nearly the same as copying out of it. And when I say "nearly", I actually mean that until the *_copy_for_execution() optimisations were thrown in, they were actually symmetric - in & out. But objects don't play nicely with that - because they are much more than just data.

In & Out: Objects require assymmetric caching. Storing into cache is a serialize operation, while retrieving from storage is a deserialize. This ensures that they end up with the right kind of pointers, class object initialization and that the resources they hold in their opaque boxes are properly handled. The objects have to implement their appropriate magic persistance methods.

And thus begins the Yak Shaving. I need to rewrite most of the cache copy-in and copy-out functions to handle the basic assymetry. But consider this, most of the code in there has been limited for months because of the fact that I cannot optimize on PHP data structures without breaking the symmetry.

A couple of years ago, I sat through a full hour talk by Rusty Russell about talloc(). Built on top of the trusty old malloc() calls, it simplifies memory management a lot for Samba4. So bear with me as I take a brain dump of my idea - for my very intelligent reader to poke holes in (gopalv shift+2 php net).

APC's allocation strategy is a little brain dead. To allocate 4 bytes of data, it actually requires 24 bytes of space. But much more than the space wastage, I'm more concerned about the number of lock() calls required to cache a single php file - a hello world program takes about 22 lock operations (11 locks, 11 unlocks). Yes, that's actually 22 syscalls just to cache echo "hello world";.

I've previously tried to fix it with partitioned locks. The problem with that was actually cleaning up the locks, because the extension code would have to have special cases for every SAPI - because of some bugs in PHP 5.x. So, the "if you don't succeed, destroy all evidence" principle made me throw out that idea. But the cache-copy, zend-copy separation should help me revive another approach to this.

Pools: So, now that I'm officially b0rking up APC, I could as well slap on a new pool allocator, right on top of sma_allocate - ala, talloc(). The allocation speed would skyrocket, because the in-pool allocs are sequential and do not have any fragmentation issues due to blocks in the middle being free'd. As much as allocates are important, the real advantage of this would be that I could basically speed up cache expunges by a magnitude or more. The 22 syscall cache expunge for hello world would be reduced to a potential pair of syscalls - because it would be a single free of the entire pool space.

Right now the pool is actually built up to be of the following structure.

struct apc_pool_t {
	int capacity;
	int avail;
	void *head;
	apc_pool_t *overflow;
	unsigned char data[0];
};

I've yet to run this through an x86_64 build, but an even multiple of int/void* should align data area right into a wordsize. And I think nearly every pool should be around 4k (i.e 4096 - sizeof(apc_pool_t)) for opcode cache and 1k for data cache. I might make the latter a runtime tuneable, just to pad the APC manual up into an entire book (just in case someone asks me to write one .. *heh*).

None of this is included in APC 3.0.15, which will exit out of the gates as soon I'm sure I'm happy with its stability. The new code will probably be an APC 3.1 release, marking the end of php4 compat & opening up the door for php6 compat.

A two line bug report which exploded into nearly two thousand lines of C code - that's just classic yak shaving.

--
10 If it ain't broke, break it;
20 Fix it.
30 Goto 10

posted at: 09:27 | path: /php | permalink | Tags: , ,

Wed, 08 Aug 2007:

Finally, I got bored enough to update my inclued extension (as promised at OSCON). The extension now comes with a nearly completely non-intrusive data dumping mode. The new inclued.dumpdir can be used to dump the inclued data onto a temporary file without ever modifying any of your php scripts. Also included is some php code to transform the dump data into graphviz formatted .dot files.

Pick up your free & complementary copy of the source code on your way out. And stay clued-in about your includes.

--
This quote intentionally not included.

posted at: 04:27 | path: /php | permalink | Tags: , ,

Mon, 09 Jul 2007:

Recently, one of the php lists I'm on was asked how to implement a stable sort using php's sort functions. But since all of php's sort functions eventually seem to land up in zend_qsort, the default sort is not stable. The query on list had this simple example which illustrates the problem clearly.

bash$ php -r  '$a= array(1,1,1); asort($a); print_r($a);'

Array
(
    [2] => 1
    [1] => 1
    [0] => 1
)

The basic problem here is to produce a stable sort which still operates in quicksort O(n*lg(n)) time. Essentially, falling back onto a bubble sort is ... well ... giving up :)

Schwartzian transform: The programming idiom, named after Randal L. Schwartz, is a carry-over of the decorate-sort-undecorate lisp memoization into perl. The real problem here however was putting it into a php syntactic form which was clean and as similar to the original as possible. For example, here's how the python version of the code would look like (despite the fact that the current python sort is stable).

a = [1,1,1]
b = zip(a, range(len(a)))    # decorate
b.sort()                     # sort
a = map(lambda x : x[0], b)  # undecorate

array_walk() magic: Coming from a world of constant iterators, I had read the array_walk documentation and sort of read the "Users may not change array itself ..." as boilerplate. But as it turns out, the callback function is allowed to change the current value *in place* and for that purpose it gets a reference to the value. With that in mind, array_walk becomes a faux in-place map/transform function.

$a = array(1,1,1);

function dec(&$v, $k) { $v = array($v, $k);}
function undec(&$v, $k) { $v = $v[0]; }

array_walk($a, dec);   // decorate
asort($a);             // sort
array_walk($a, undec); // undecorate

And there you have it, a lispism made famous by perl, implemented nearly exactly in php.

--
Whoever knows he is deep, strives for clarity;
Whoever would like to appear deep to the crowd, strives for obscurity.
                        -- Nietzsche

posted at: 08:06 | path: /php | permalink | Tags: ,

Sat, 26 May 2007:

Brian Shire has put up his slides (835k PDF) of his php|tek talk.

Quite interesting procedures followed to prevent the very obvious cache slam issues by firewalling the apache while restarting it, as well as the priming sub-system they use. Also the cross-server (aka site-vars) seem like a good idea as well - a basic curl POST request moving around json data could potentially serve as a half-reliable cross-server config propogator.

Seeing my code used makes happy ... very happy, indeed.

--
I'm willing to make the mistakes if someone else is willing to learn from them.

posted at: 22:27 | path: /php | permalink | Tags: , ,

Mon, 21 May 2007:

Previously while talking about inclusion checks I had included a few helpful digraphs of php includes. Those were drawn with the help of a gdb macro and a bit of sed/awk. But that becomes a real hassle to actually use very quickly while inspecting fairly large php applications.

The solution to repeatable include graphs from php came from the way the include_once override hack was implemented. By overriding the INCLUDE_OR_EVAL opcode in Zend, I could insert something which could potentially log all the various calls being made to it.

That is exactly what inclued does. The extension provides a single function which provides a copy of the collected data, aptly named inclued_get_data().

<?php 
  include("x.php");
  print_r(inclued_get_data());

The above peice of code returns an array with too much information.

Array
(
  [includes] => Array
    (
      [0] => Array
        (
          [operation] => include
          [op_type] => 2
          [filename] => x.php
          [opened_path] => /tmp/x.php
          [fromfile] => /tmp/z.php
          [fromline] => 2
        )
    )
)

Overriding the opcode implies that more information can be collected, like whether the include parameter was a constant (i.e APC can optimize it to a full path), the function the include was in (__autoload *cough*) and the stream wrapper (phar: for instance). The information also includes info about whether the include_once was ignored because of duplication.

Only Data: The extension however does not make any judgements on the code. I have resisted the temptation to put an error log saying "If eval() is the answer, you're almost certainly asking the wrong question" . But there are interesting ways to massage this data into something useful. For example, I just put the following lines into the Wordpress index.php.

$fp = fopen("/tmp/wp.json", "w");
fwrite($fp, json_encode(inclued_get_data());

Loading the JSON data and pushing out a graphviz layout was almost too easy. Here's a clipped section of how my graph ended up looking.

The graph shows about 40-odd includes being hit for a single request, but isn't messy at all. Hopefully, someone finds an actual peice of sphagetti includes which shows off the beauty of this ext (remember, this shows include_once misses & hits).

Hope this helps people debug "Where is this include coming from ?" question I've run into so many times :)

--
It was the kind of mental picture you tried to forget. Unsuccessfully.
            -- Terry Pratchett, "The Light Fantastic"

posted at: 19:43 | path: /php | permalink | Tags: , ,

Thu, 10 May 2007:

APC user cache is cool. It provides an easy way to cache data, in the convenient form of hashes & more hashes within them, to share the data across processes. But it does lend itself to some abuses of shared memory which will leave your pager batteries dead, disk full of cores and your users unhappy. Eventually, the blame trickles down to splatter onto APC land. Maybe people using it didn't quite understand how the system works - but this blog entry is me washing my hands clean of this particular eff-up.

apc_fetch/_store: The source of the problem is how apc_fetch and apc_store are used in combination. For example, take a look at countries.inc from php.net. The simplified version of code looks like this.

if(!($data = apc_fetch('data'))) {
  $data = array( .... );
  apc_store('data', $data);
}

There is something slightly wrong with the above code, but the window for the race condition is too small to be even relevant. But all that changes the moment a TTL is associated with the user cache entry. For a system under heavy load, all hell breaks loose when a user cache entry expires due to TTL. Let me explain that with some pretty pictures and furious hand-waving.

Each of the horizontal lines represent an individual apache process. The whole chain reaction is kicked off by a cache entry disappearing off the user cache. Every single process which hits the apc_fetch line (as above), now falls back into the corresponding apc_store. The apc_store operation does not lock the cache while copying data into shared memory. So all processes are actually allowed to proceed with the copy into shared memory (yellow block) in parallel. The actual insertion into the cache, however is locked. The lock is hit nearly simultaneously by all processes and sort of cascades into blocking the next process waiting on the lock.

Lock, lock, b0rk !: The cascade effect of waiting on the same lock eventually results in one process locking for so long that it hits the PHP execution timeout Or the user get bored and just presses disconnect from the browser. In apache prefork land, these generate a SIGPROF or SIGPIPE respectively. If for some reason that happens to be while code inside the locked section is being executed, apache might kill PHP before the corresponding unlock is called. And that's when it all goes south into a lockup.

So, when I ran into this for the first time, I did what every engineer should - damage limitation. The signals were by-passed by installing dummy signal handlers and deferring the signals while in locked sections. Somebody needs to rewrite that completely clean-room, before it is going to show up in the pecl CVS. The corresponding cache slam in the opcode cache is controlled by checking for cache busy and falling back to zend_compile - but the user cache has no such fallbacks.

I wish that was the only thing that was wrong with this. But I was slightly misleading when I said the copy into shared memory was parallel. The shared memory allocator still has locks and the actual allocation looks somewhat like this for 3 processes.

The allocations are interleaved both in time and in actual layout in memory (the bottom bar). So adjacent blocks actually belong to different processes, which is not exactly a very bad thing in particular. But as the previous picture illustrates, every single apc_store() call removes the previous entry and free's the space it occupies. Assuming there are only three processes, the free operation happens as follows.

The process results in very heavy fragmentation, due to the large amount of overlap between the shared memory copy (apc_cache_store_zval) across processes. The problem neatly dove-tails into the previous one as the allocate & deallocate cost/time increases with fragmentation. Sooner or later, your apache, php and everything just gives up and starts dying.

There are a couple of solutions to this. Since APC 3.0.12, there is a new function apc_add which reduces the window for race conditions - after the first successful insertion of the entry, the execution time of the locked section is significantly reduced. But this still does not fix the allocation churn that happens before entering the locked section. The only safe solution is to never call an apc_store() from a user request context. A cron-job which hits a local URL to refresh cache data out-of-band is perfectly safe from such race conditions and memory churn associated.

But who's going to do all that ?

--
In the Beginning there was nothing, which exploded.
                -- Terry Pratchett, Lords and Ladies

posted at: 13:27 | path: /php | permalink | Tags: , ,

Thu, 26 Apr 2007:

For any php performance freak, include_once has been a pain in the neck for a long time. In a previous post I had talked about how the common workarounds affect something like APC. With the release of APC 3.0.14, there is a decent workaround which doesn't require any changes to the php code. But first, let me drag out another almost-workaround to the whole include_once problem.

rinclude_once: So we create a new function for the purpose. Let me just call it rinclude_once and use that everywhere. The function takes in the filenames, pushes it into a hash before including it.

function rinclude_once($file) 
{
  global $rinclude_files;
  
  if($file{0} != '/') return include_once($file);
  
  if(isset($rinclude_files[$file])) return;

  $rinclude_files[$file] = true;

  include($file);
}

This bit of code works - but only for absolute filenames. I put it through its paces without apc against include_once. Surprisingly, without APC, this code is slower than the php engine's include_once checks. That somewhat makes sense because the extra include for the rinc.php makes a slight dent into the compile and execution time, overshadowing the cycles wasted in the include_once syscall land.

Excluding the slight blip in performance, both the bits of code are nearly neck-to-neck. The real cost of include_once is only evident when you throw in APC. For every file which was included, include_once opens the file before checking for multiple inclusions. The extra system call shows up in the graph below. But the rinclude_once does not work at all for relative path includes (the second pair of bars) and therefore trails badly in this race for performance.

include_once_override: The solution to this problem is freakish. APC meddles with the brains of the Zend interpreter and inserts its own version of ZEND_INCLUDE_OR_EVAL opcode handler. The new handler does not indulge in the gratuitous fopen idiocy present in the default handler (but it does fopen once) and checks for the filename in EG(included_files) hash before doing a normal include() (thank pollita for that). And it should come as no surprise that the C module outperforms the php land equivalent.

But all of these only work for absolute paths as the pathetic numbers on the relative path includes shows. That is where the path canonicalization kicks in. In APC's nostat mode, the filenames have to be absolute paths for the mode to be useful. But rather than force everyone to modify their include lines, APC rewrites the constant strings into the corresponding path names, after lookup. Essentially it converts all relative path includes, such as those from the pear paths, into absolute path includes. This works well with the stat=0 mode because file modifications are ignored after caching.

At last, we see both the relative and absolute includes touching the same level of performance - because they are no different from each other in opcode land. But as you can clearly see, that does not improve the performance of the relative include for the rinclude_once because it does dynamic includes. The opcode cache cannot determine what the value of $file will be for the include_once($file); line and cannot optimise that. The performance actually takes a dive because the relative path name passed in has to be full resolved for every request.

But having said all this, the same benchmark with plain includes is faster than any of these. I think there is a fair bit of optimisation left in this beast, but what is needed for that is the rest of the world to disappear while I code. Deep hack is hard to achieve when you're a ....

--
You can't second-guess ineffability, I always say.
              -- Good Omens

posted at: 02:27 | path: /php | permalink | Tags: , ,

Thu, 05 Apr 2007:

APC 3.0.14 (code named "A bigger boy made me do it, sir") went out a couple of days ago - read the release announcements. The major things in the release is a fair bit of performance improvements for those don't use threads. Also I've figured out a quick way to limit memory fragmentation when APC user cache (apc_fetch/_store) is heavily used - the new fraglimit fixes should solve all the small fragment issues with 3.0.13. And following my recent obsession with drawing pretty graphs for everything, here's how the old version looks compared to the latest code (requests per second for an include_once benchmark).

To get to such levels of performance, the code has some configuration parameters that can be set. The apc.localcache creates a process (yes, not thread) specific lockless cache which is basically a layered shadow cache ontop of the same shm data. The apc.include_override_once is also now usable because of the appropriate checks put in to reduce the overhead of include_once. And now, when you enable apc.stat there's a bit of code which pre-computes the path of the included file so that it can be effective for includes with relative paths or from include_path dirs.

The release is hopefully stable enough to provide someone with enough ramp-up time to get started, if I stop working full-time on APC. I've spent a fair bit of time stabilizing basic functionality and have kept most of these optimisations optional, to be able to look at other work for a while.

--
Periods of productive stability, interrupted by bursts of test-bed change is much less disruptive than constant ripples of change.
              -- Fred Brooks Jr, "The Mythical Man Month"

posted at: 04:11 | path: /php | permalink | Tags: , ,

Fri, 30 Mar 2007:

So, there I was debugging what looked like a memory leak in APC - a perfectly straightforward bug at first glance. APC's internal allocator was leaving around a bunch of 40 byte fragments all over the place. The fragments were literally killing APC allocate and deallocate performance - with nearly 85k fragments lying around in the 128Mb cache that www.php.net uses. Even though the allocator is a first-fit based system, it still has to traverse a large number of blocks to locate the previous free block to free any particular allocated block.

Basically, it was having serious issues with memory performance. This had something to do with one of the changes I'd put into APC-3.0.13 - canary checks. The canary essentially increased the memory header size by one size_t exactly. This broke the default word alignment on x86, but I thought I had all bases covered when I put in the approriate word aligns.

24 Bytes: Now, the default allocation size in APC is 24 bytes on x86. That is 12 bytes (3 x sizeof(size_t)) plus padding to make it a multiple of 8 coming to a total of 16 bytes. Then put in the data area (say, 1 byte), which is padded up to 8 bytes. Add all of it together and the smallest block APC can allocate is 24 bytes.

Due to some strange quirk of code, 40 bytes seems to be a very unpopular size to allocate. The allocations for 17-24 bytes of data goes into the 40 byte block and for some strange reason that seems really rare. I ran through a bunch of the standard tests I run with APC to get some sane statistics out of it. After running through hundred odd random tests from the standard phpt files, I got a pile of data. Twenty minutes later I had pulled that data together into a rough histogram (which is nearly the same thing as a bar chart for discrete data, I suppose) by printing out SVG and applying styling in inkscape.

Maybe I'm just hooked into drawing pretty graphs. But it clearly indicates what is going wrong. There are not enough 40 byte allocations to consume all the spare chunks being created. But, is this not true for the 80 or 96 blocks, you might ask. Unlike the 40 byte block both 80 (32 + 48) and 96 (48 + 48) byte blocks are easily consumed by requirements for smaller blocks. The 40 byte block on the other hand cannot be split into any smaller block because it is smaller than 24x2.

Thus due to the lack of demand and the inabilty to compromise (*heh*), the 40 byte blocks remain unwilling to accept any commitments. Until a memory block nearby is free'd the block will sit around waiting for someone to allocate 40 bytes - which as the pretty graph shows, is not a popular choice.

Now to sleep on this problem and hope I wake up with a solution - clear and perfect.

--
If it breaks then you get to keep both pieces.
          -- Warranty disclaimer for the chat program.

posted at: 07:27 | path: /php | permalink | Tags: , ,

Thu, 15 Mar 2007:

For every other php programmer who reads Rasmus's no framework mvc, these following lines are what they often finally remember.

3. Fast
    * Avoid include_once and require_once
    * Use APC and apc_store/apc_fetch for caching data that rarely changes

Eventhough include_once has its performance hit, some people avoid it by some rather simple code borrowed from their C experience. Here is how the code looks in general.

<?php
if(defined(__FILE__)) return;

define(__FILE__, true);

This is nearly identical to what you would use in a C header to prevent inclusion checks. But as the emergence of precompiled headers shows, even those folks are trying to reduce the expenditure of including & pre-processing the same file multiple times.

I do not deny that the include check above works. But it checks for double inclusion during execution, which is exactly what was wrong with include_once as well. Even worse, it hits APC really badly. But the situation takes a bit of context to understand - let us pick a 'real' library fubar (name changed to keep my job) which has been avoiding include_once. Here is how the logical dependency graph looks like :

In a moment of madness, you decide to make all includes properly for design coherence, especially for that doxygen output to look purty. But instead of using include_once as a sane man should, you remember the wisdom of elders and proceed to do includes. And then kick it up a notch with inclusion checks as illustrated above. But this what Zend (and by design APC, too) actually compiles up.

The nodes marked in red are actually never used because of the inclusion checks, but they are compiled and installed. Zend pollutes the function table and class table for such with a bunch of mangled names for each function - APC serves up a local copy of the same cached file for multiple inclusions, which all have the same mangled name - by ignoring redeclaration errors.

If you were using include_once, these files would have never been compiled. But the above solution *seems* to work in APC land, but in reality does not play very nice at high cache loads. And while debugging *cough* fubar, I ran into a very corner case mismatch issue.

During a cache slam or expunge - when the cache is being written to by one process, other processes do not hit the cache and fall back to zend compile calls. Now imagine such a cache fail happening mid-way in one request.

Now the executor has two types of opcode streams to deal with, one which is Zend fresh ! and one which is from the APC (Opcodes in a Can) freezer. Even though only a couple of opcodes in the normal opcodes stream is executed, the pre-execution phase of installing classes and functions in their respective tables runs into issues unknown thanks to early binding and late binding combinations, which was behind that bug from hell in class inhertiance. But more annoyingly, I cannot reproduce them in ideal testing conditions - wasting about two nights of my sleep in the process.

So I implore, beg and plead - please do not write code like this to avoid include_once, it just makes it slower, heavier on your memory footprint and increases cache lock contention. At least don't do it in the name of performance - I wrote this blog entry just because the guy who wrote fubar said "I didn't know it worked this way". There are a bunch of other such gotchas, which is currently my talk proposal for OSCON '07.

And just out of curiousity, I'm wondering whether an apc.always_include_once might help such code. But on the other hand I hate optimising for bad code, much cleaner to drop such files from cache - after all "they don't the deserve the performance".

So, trust me when I say this ... leave include checks to the experts !

--
Too much is more than enough by definition.

posted at: 18:03 | path: /php | permalink | Tags: , ,

Tue, 27 Feb 2007:

After a roller coaster career, my first php extension has hit php cvs, with all the memory leaks settled, nearly all the features done and all promises kept. Hidef, as it was originally known acquires a new and improved slogan "Constants for real" as well as a place in the PECL packages. If you want to install the module, just run this simple command and have the binary built & dropped into your php install - except for the extension=hidef.so line.

bash# pecl install "channel://pecl.php.net/hidef-0.1.0"

If this actually makes a significant difference in your code's performance, I'd say that you've done a wonderful job otherwise. Most of the code is stolen off a template Rasmus had and packaging was thanks to Pierre ... which leaves me with a distinct feeling of having put a few legos together - but to my credit, at least the peices fit.

--
We demand rigidly defined areas of doubt and uncertainty!
                -- Vroomfondel, H2G2

posted at: 01:27 | path: /php | permalink | Tags: , ,

Mon, 26 Feb 2007:

APC released version 3.0.13. The last couple of months haven't produced too much code from me, so most of the changes in there are due to the efforts of shire & rasmus. But I've left a couple of booby traps in there for invalid free() calls, which should reduce a decent number of those random memory corruptions into a more decent error report.

I've been unsuccessful in making the Real World go away for long enough to actually rewrite the shm allocator - not even a patch job with a linked list, rather than the mythical lockless allocator I've been promising for three months.

I feel guilty, but there's so much to be done and I'm only ... *counts* ... one man.

--
After months of careful refrigeration, Debian 2.0 is finally cool enough to release.
          -- topic on #debian

posted at: 02:27 | path: /php | permalink | Tags: , ,

Sat, 03 Feb 2007:

After nearly a year of messing around with php extensions, I've finally sat down and written a full extension from scratch. I've used all the skeletons and ext_skel scripts, in the proper way to end up with a half-decent extension. It took me around 4 hours from an empty directory to end up with an extension which basically did what I wanted.

hidef: The define() call in php is slow. Previously the workaround to define a large chunk of constants was to use apc_load_constants, which pulled out stuff from the cache, but still had to define all constants for every one of the requests. Even beyond that the value replacement is at runtime, nearly as expensive as a $global['X']. A quick look with vld indicates the problem very clearly.

<?php
define('ANSWER', 42);
echo "The answer is ".ANSWER;
?>

	line     #  op           operands
----------------------------------------------------
   2     0  SEND_VAL        'ANSWER'
         1  SEND_VAL        42
         2  DO_FCALL        'define', 0
   3     3  FETCH_CONSTANT  ~1, 'ANSWER'
         4  CONCAT          ~2, 'The+answer+is+', ~1
         5  ECHO            ~2

For a lot of code with a lot of defines(), this is a hell of a lot of CPU wasted just putting data in & reading it out, where a substitution would be much better. But first things first, I got a basic extension which would parse a .ini file and define the constant with some magic flags - this is what you'd put into the ini file.

[hidef]
float PIE = 3.14159;
int ANSWER = 42;

The extension reads this once when apache starts up and puts into the php's constants section. The constant is pushed in with the CONST_PERSISTENT flag which means that the constant lives across requests. Recently, Dmitry had put in a new bit into this mix - CONST_CT_SUBST which marks constants as canditates for compile time substitution.

After adding compile-time substitution into the extension code, the code generator replaces constants as & when it runs into them. And here's what the bytecode looks like.

<?php
echo "The answer is ".ANSWER;
?>

line     #  op      operands
--------------------------------------------
   2     0  CONCAT  ~0, 'The+answer+is+', 42
         1  ECHO    ~0

You don't need to be a genius to figure out which one would be faster. But the other gopal had done some benchmarks which didn't seem to show enough difference between constants and literals. So, I wrote a quick & dirty benchmark with 320 defines and adding them all up in the next line. Here is the before and after numbers.

Before After
380.785 fetches/sec 930.783 fetches/sec
14.2647 mean msecs/first-response 6.30279 mean msecs/first-response

But the true significance of these few hundred lines of code fades a bit when you pull in APC into the mix. With APC enabled I was still expecting a significant difference in performance and here it is.

Before After
976.29 fetches/sec 1519.38 fetches/sec
4.95603 mean msecs/first-response 3.15688 mean msecs/first-response

The numbers are seriously biased, because for most code the major bottleneck is their DB and therefore I/O bound. But if this small bit of code helps shave off a few microseconds of CPU time, for a few hours of my hacking time, it is pretty good when you consider the scale factor.

So, without further ado - here's hidef 0.0.1 - should build fine for both php5 and php4. And if you feel the urge to fix something in there or write documentation, go for it ! :)

--
If you don't know what procrastination is just look up the definition tomorrow.

posted at: 02:45 | path: /php | permalink | Tags: , ,

Wed, 25 Oct 2006:

Weak symbols are a poor man's version of linker land polymorphism. A weak symbol can be overriden by a strong symbol when a linker loads an executable or shared object. But in the absence of a strong sym, the weak version is used without any errors. And that's how it has worked for a long long time in ELF binary land.

But then dlopen() went and changed the rules. When you load a shared library with RTLD_GLOBAL, the symbols became available to all the other shared objects. And the libc rtld.c had the runtime magic required to make this happen (and the unloading was even harder).

Then one fine day, Weak Symbols were empowered. In dynamic shared objects (DSO), there was no difference between weak and strong. It has been so since the glibc-2.1.91 release.

Now let me backtrack to my original problem. Once upon a time, there was a php extension which used an apache function, ap_table_set() to be precise. But for the same php extension to be loadable (though not necessarily useful) inside a php command line executable - the external symbol had to be resolved. That's where a weak symbol proves itself invaluable - a libapstubs.a could be created with a weak ap_table_set, so that as long as the extension is run outside apache, the stub function will get called.

But it wasn't working on linux (works on FreeBSD). And except for the extension, we weren't able to write a single test case which would show the problem. And then I ran into a neat little env variable - LD_DYNAMIC_WEAK. Just set it to 1 and the rtld relegates the weak symbols back into the shadows of the strong ones. But that raised a few other problems elsewhere and I personally was lost.

But now I know where exactly this went wrong. Php was using a glibc dl flag called RTLD_DEEPBIND (introduced in zend.h,1.270). This seems to be a glibc feature and the flag bit is ignored by the FreeBSD libc - which was running the php module happily. As you can read in that mail, it looks up the local shared object before it starts looking in the global scope. Since libapstubs was a static .a object, the local scope of the ext .so did contain the dummy ap_table_set() and since glibc rltd was ignoring the weak flags, that function was called instead of the real apache one.

I'm perfectly aware that RTLD_DEEPBIND can save a large amount of lookup time for shared objects built with -fPIC, because of the PLT (we've met before). But if you are trying to use it to load random binaries (like extension modules), here's a gotcha you need to remember.

Now to get back to doing some real work :)

--
A wise person makes his own decisions, a weak one obeys public opinion.

posted at: 17:12 | path: /php | permalink | Tags: , ,

Tue, 05 Sep 2006:

APC released version 3.0.12. Because I've been sitting at home, close to a well stocked refrigerator and with no pool table in walkable distance, I've got a fair bit of work done on APC [1].

And some out of this world patches in the pipeline too.

--
Is it better to abide by the rules until they're changed or help speed the change by breaking them ?

posted at: 10:11 | path: /php | permalink | Tags: , ,

Tue, 22 Aug 2006:

A lot of people have been complaining about APC's stability issues. In fact, they get angrier when I mention that it works for Yahoo!. During the FIFA slams on the servers, we put in a few extra things in APC to make it withstand the hammering. Now, a couple of these protections were borrowed from code that Y! already had lying around and a few more of them were BSD specific. But the short story is that I can never push those changes to the open source version. Nor can I even rewrite the same features after reading Y! © code which does the same, at least not while I'm here.

*But*, one feature that we borrowed was discussed quite a while back on the php-internals mailing list. If someone among you think that they know enough to understand what this means and implement it under the php license, maybe it might be accepted as a patch to php.

All it needs is some elbow grease and a bit of unix magic :)

--
signal(i, SIG_DFL); /* crunch, crunch, crunch */
               -- Larry Wall in doarg.c from the perl source code

posted at: 05:27 | path: /php | permalink | Tags: , ,

Thu, 17 Aug 2006:

APC released version 3.0.11. I've been hunting the entire codebase for memory issues and even laid to rest the bug from hell.

Now, all that remains is for all new bug reports to come in.

--
Your parity check is overdrawn and you're out of cache.

posted at: 18:27 | path: /php | permalink | Tags: , ,

Fri, 09 Jun 2006:

After much trials and tribulations, Wikipedia is finally using APC. They've been playing around with Turk MMcache and other accelerators for a while. But currently APC is the only one with the ball as far as caching is concerned. Recently, somebody did a benchmark on the common accelerators used in php land - read it here. But at that point APC just wins hands down, though my commit last night probably must've pushed APC below eAccelerator, it is required to run properly on a multi-CPU apache on high loads.

hw.php deserialize.php include-pma.php
eAccelerator 1093 160 86
apc 1100 163 83
PHP alone 886 157 28

Now, the next heavy user of PHP around is sourceforge.net who is apparently still using eAccelerator. Apc still has a few chinks in its armour, but it is still *my* work. And much more importantly it, for once, doesn't appear doomed :)

--
Real programs don't eat cache.

posted at: 22:14 | path: /php | permalink | Tags: , ,

Thu, 08 Jun 2006:

I recently discovered an easy way to inspect php files. So xdebug has a cvs module in their cvs called vle. This prints out the bytecode generated for php code. This lets me actually look at the bytecode generated for a particular php data or control structure. This extension is shining a bright light into an otherwise dark world of the ZendEngine2.

Let me pick on my favourite example of mis-optimisation that people use in php land - the HereDoc. People use heredoc very heavily and over the more mundane ways of putting strings in a file, like the double quoted world of the common man. Some php programmers even take special pride in the fact that they use heredocs rather than use quoted strings. Most C programmers use it to represent multi-line strings, not realizing php quoted strings can span lines.

<?php

echo <<<EOF
	Hello World
EOF;

?>

Generates some real ugly, underoptimised and really bad bytecode. Don't believe me, just look at what the vle dump looks like.

line     #  op            ext  operands
-------------------------------------------------
   3     0  INIT_STRING        ~0
         1  ADD_STRING         ~0, ~0, '%09'
         2  ADD_STRING         ~0, ~0, 'Hello'
         3  ADD_STRING         ~0, ~0, '+'
         4  ADD_STRING         ~0, ~0, 'World'
   4     5  ADD_STRING         ~0, ~0, ''
         6  ECHO                   ~0

That's right, every single word is seperately appended to a new string and after all the appends with their corresponding reallocs, the string is echoed and thrown away. A really wasteful operation, right ? Well, it is unless you run it through APC's peephole add_string optimizer.

Or the other misleading item in the arsenal, constant arrays. I see hundreds of people use php arrays in include files to speed up the code, which does indeed work. But a closer look at the array code shows a few chinks which can actually be fixed in APC land.

<?php

$a = array("x" => "z", 
		"a" => "b",
		"b" => "c",
		"c"	=> "d");
?>

Generating the following highly obvious result. Though it must be said that these are hardly different from what most other VMs store in bytecode, they are limited by the fact that they have to actually write the code (minus pointers) to a file. But Zend is completely in memory and could've had a memory organization for these arrays (which would've segv'd apc months before I ran into the default array args issue).

line     #  op                      ext  operands
-----------------------------------------------------------
   2     0  INIT_ARRAY                   ~0, 'z', 'x'
   3     1  ADD_ARRAY_ELEMENT            ~0, 'b', 'a'
   4     2  ADD_ARRAY_ELEMENT            ~0, 'c', 'b'
   5     3  ADD_ARRAY_ELEMENT            ~0, 'd', 'c'
         4  ASSIGN                           !0, ~0
   7     5  RETURN                           1
         6  ZEND_HANDLE_EXCEPTION            

This still isn't optimized by APC and I think I'll do it sometime soon. After all, I just need to virtually execute the array additions and cache the resulting hash as the operand of the assign instead of going through this stupidity everytime it is executed.

Like rhysw said, "Make it work, then make it work better".

--
Organizations can grow faster than their brains can manage them.
                    -- The Brontosaurus Principle

posted at: 16:22 | path: /php | permalink | Tags: , ,

Thu, 01 Jun 2006:

Some time in late 2002, I got to see a clear picture of what interpreter optimisation is all about. While I only wrote a small paragraph of the Design of the Portable.net Interpreter, I got a good look at some of the design decisions that went into pnet. The history of the CVM engine aside, more recently I started looking into the Php5 engine core interpreter loop. And believe me, it wasn't written with raw performance in mind.

The VM doesn't go either the register VM or stack VM way, there by throwing away years of optimisations which have gone into either. The opcode parameters are passed between opcodes in the ->result entry in each opcode and which are used as the op1 or op2 of the next opcode. You can literally see the tree of operations in this data structure. As much as it is good for data clarity, it means that every time I add two numbers, I write to a memory location somewhere. For example, I cannot persist some data in a register and have it picked up by the latter opcode - which is pretty easy to do with a stack VM.

Neither did I see any concept called verifiability, which means that I cannot predict output types or make any assumptions about them either. For example, the following is code for the add operation.

ZEND_ADD_SPEC_VAR_VAR_HANDLER:
{
    zend_op *opline = EX(opline);
    zend_free_op free_op1, free_op2;

    add_function(&EX_T(opline->result.u.var).tmp_var,
        _get_zval_ptr_var(&opline->op1, EX(Ts), &free_op1 TSRMLS_CC),
        _get_zval_ptr_var(&opline->op2, EX(Ts), &free_op2 TSRMLS_CC) TSRMLS_CC);
    if (free_op1.var) {zval_ptr_dtor(&free_op1.var);};
    if (free_op2.var) {zval_ptr_dtor(&free_op2.var);};
    ZEND_VM_NEXT_OPCODE();
}

Since we have no idea what type of zval is contained in the operators, the code has to do a set of conversion to number. All these operations involve basically a conditional jump somewhere (aka if) which are what we're supposed to be avoiding to speed up.

Neither could I registerify variables easily, because there was a stupid CALL based VM (which is flexible enough to do some weird hacks by replacing opcodes) which throws away all variables in every scope. That's some serious stack space churn, which I can't force anyone to re-do. At least, not yet. So inspite of having a CGOTO core, there was hardly anything I could do without breaking the CALL core codebase.

Basically, after I'd exhausted all my usual bag of tricks I looked a little closer at the assembly thrown out by the compiler. Maybe there was something that wasn't quite obvious happening in the engine.

.L1031:
    leal    -72(%ebp), %eax
    addl    $76, (%eax)
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
    movl    %eax, -2748(%ebp)
    jmp .L1194
.L203:
    leal    -72(%ebp), %eax
    addl    $76, (%eax)
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
    movl    %eax, -2748(%ebp)
    jmp .L1194
....
L1194: 
    jmp *-2748(%ebp)

As you can clearly see, the jump target is yet another jump instruction. For a pipelined CPU that's really bad news, especially when the jump is so long off. So I wrote up some assembly to remove the double jump and convert into a single one.

#ifdef __i386__
#define ZEND_VM_CONTINUE() do { __asm__ __volatile__ (\
        "jmp *%0" \
        :: "r" (EX(opline)->handler) ); \
    /* just to fool the compiler */ \
    goto * ((void **)(EX(opline)->handler)); } while(0)
#else
#define ZEND_VM_CONTINUE() goto *(void**)(EX(opline)->handler)
#endi

So in i386 land the jump is assembly code and marked volatile so that it will not be optimised or rearranged to be more "efficent".

.L1031:
    leal    -72(%ebp), %eax
    addl    $76, (%eax)
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
#APP
    jmp *%eax
#NO_APP
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
    movl    %eax, -2748(%ebp)
    jmp .L1194
.L203:
    leal    -72(%ebp), %eax
    addl    $76, (%eax)
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
#APP
    jmp *%eax
#NO_APP
    movl    -72(%ebp), %eax
    movl    (%eax), %eax
    movl    %eax, -2748(%ebp)
    jmp .L1194

The compiler requires a goto to actually realize it has to flush all the stack params inside the scope. I've learnt that fact a long time ago trying to do the same for dotgnu's amd64 unroller. Anyway, let's look at the numbers.

           
Before:                     After:
simple             0.579    simple             0.482
simplecall         0.759    simplecall         0.692
simpleucall        1.193    simpleucall        1.111
simpleudcall       1.409    simpleudcall       1.320
mandel             2.034    mandel             1.830
mandel2            2.551    mandel2            2.227
ackermann(7)       1.438    ackermann(7)       1.638
ary(50000)         0.100    ary(50000)         0.097
ary2(50000)        0.080    ary2(50000)        0.080
ary3(2000)         1.051    ary3(2000)         1.024
fibo(30)           3.914    fibo(30)           3.383
hash1(50000)       0.185    hash1(50000)       0.182
hash2(500)         0.209    hash2(500)         0.198
heapsort(20000)    0.616    heapsort(20000)    0.580
matrix(20)         0.500    matrix(20)         0.481
nestedloop(12)     0.953    nestedloop(12)     0.855
sieve(30)          0.499    sieve(30)          0.494
strcat(200000)     0.079    strcat(200000)     0.074
------------------------    ------------------------
Total             18.149    Total             16.750

This is in comparison to the default php5 core which takes a pathetic 23.583 to complete the tests. But there's more to the story. If you look carefully, you'll notice that there's a register indirection just before the move. But x86 does support an indirect indexed jump with a zero index.

   __asm__ __volatile__ ("jmp *(%0)",:: "r" (&(EX(opline)->handler))); 

That generates a nice jmp *(%eax); which is perfect enough for my purpose. Except for the fact that I can see in the assembly, the above fix didn't really do much for performance. For example, look at the following code :-

    leal    -72(%ebp), %eax
    addl    $76, (%eax)
#APP
    nop
#NO_APP
    movl    -72(%ebp), %eax
#APP
    jmp *(%eax)
#NO_APP

The EAX loader between the two custom asm statements is what I was trying to avoid. But the variable is re-loaded again from stack because there is no register variable cache for the handler pointer. One way around that is to do what pnet did, keep your PC (eqiv of handler var) in a register, preferably EBX and use it directly. The seperation between operands (stack) and operator (handler) makes it hard to optimize both in one go. The opline contains both together making it really really hard to properly speed up.

But there's this thing about me - I'm lazy.

--
Captain, we have lost entire left hamster section.
Now, pedal faster.

posted at: 17:55 | path: /php | permalink | Tags: , ,

Thu, 18 May 2006:

In php5 static variables in functions behave a little differently when the functions are member functions of a class. The problems start when the inheritance starts copying out functions into the child function table hashes. For instance, consider the following bit of code :

<?php
    class A {
        function f()
        {
            static $a = 10;
            $a++;
            echo __CLASS__.": $a\n";
        }
    }
    class B extends A { }
    
    $a = new A();
    $b = new B();
    $a->f();
    $b->f();
?>

Now, I'd assumed that it would obviously produce 11 12 as the output. Sort of ran into this while messing around with the zend_op_array reference counting code. The static members are killed off before the reference is checked.

gopal@knockturn:/tmp$ /opt/php5/bin/php -q test.php 
A: 11
A: 11

I definitely was slightly freaked and wrote up an almost identical bit of C++ code to just test out my preconceptions about static variables.

#include <stdio.h>

class A 
{
public:
    void f() {
        static int a = 10;
        a++;
        printf("A: %d\n", a);
    }
};

class B : public A {};

int main()
{
    A a;
    B b;
    a.f();
    b.f();
}

But static variables inside a function in C++ behave identical to whether it was declared inside a class or outside. There seems to be no special handling for member functions unlike what php shows.

gopal@knockturn:/tmp$ ./static-test 
A: 11
A: 12

I am not enough of a dynamic language geek to decide which one's the right way to do it. Actually if I really had my way, there wouldn't be any static variables in functions at all. They're actually too much like global variables in terms of life-time.

Anyway, using a class static should make it behave like C++.

--
In theory, there is no difference between theory and practice.
In practice, there is.

posted at: 02:01 | path: /php | permalink | Tags: , ,

Thu, 30 Mar 2006:

I've come to love php. No, not really - but some of their fan generated advertisements literally kill me. To quote Rasmus about what's happening out there :-

Being an open source project, we don't have an HR nor a marketing
department to bug us about political correctness. Nobody can get 
fired. Anybody can do whatever they want.

There are many such pics that are floating around the web with PHP splashed all over. There are a lot of other projects like firefox which have outrageous posters. Amidst all that, here's one that quite caught my attention - feel free to guess why.

Projects aren't really about code, they are about communities. Technical issues get solved in commercial companies as well, but FOSS communities take the rest of the being together, building together thing to the next level. Basically, writing code isn't the only way to have fun.

Do what's fun.

--
If God had meant for us to be naked, we would have been born that way.

posted at: 07:03 | path: /fun | permalink | Tags: , ,

Sun, 12 Mar 2006:

So there, I go from nobody to being lead developer of APC - it's official, there's no escape. As the commit message clearly says - the folks I have tricked into helping out. By the way, APC 3.0.10 was just released a few hours ago. Imagine two releases in the space of barely 7 days - 3.0.9 was on 4th.

Anyway, I tried to commit something today morning. Basically, it is a fix to the default arg array problems that I've run into twice. Still don't have a reliable test case, but a long drawn chase with gdb showed what was actually the segv'ing data structure. The zval in the constant array was being pulled around and modified in the php engine land. Somewhere the multiple modifications of the shared memory with no locks was ending up in an inconsistent state and the whole shebang goes for a toss. Fix was to just chuck the dangerous bits into the local memory and just let the engine do what it wants.

**** Access denied: insufficient karma (gopalv|pecl/apc)
cvs commit: Pre-commit check failed

Anyway, that was quickly resolved on irc and I got some karma (whatever that really means) and I was able to push in a huge merged patch (+832, -316) from HEAD into INH_FIX branch of apc. Thus, my first commit rolls into CVS - #5423. And hopefully that should break a few things here and there - can't make an omlette without breaking eggs.

--
"But the important thing is persistence."
       -- Calvin trying to juggle eggs

posted at: 11:22 | path: /php | permalink | Tags: , ,

Fri, 10 Mar 2006:

Last night, was an all nighter. I stayed up to hack out some javascript code for yahoo!. In the middle of all that, something new came up - bug #7070. You can read the bug report or you could see what happened on IRC. All this is leading up to something very important, at least to me, so read on.

<edink>   Rasmus: commenting out my_fetch_global_vars() and having auto_globals_jit = off 
          makes apc work on windows
<Rasmus>  edink: could you add that to the test case?
<Rasmus>  I'm busy breaking apc further
<edink>   I'll add comment to #7070
<Rasmus>  thanks

<Rasmus>  g0pz: edink updated bug 7070
<edink>   g0pz: seems that calling zend_is_auto_global() with any value from apc_copy_function_for_execution() 
          crashes the thing on windows

Ok, so I had a good long long look at the code and started guessing what went wrong. There's one thing I still don't understand about Zend engine - how does the TSRM stuff works. So following the path of ancestors, who relied on the dark and mysterious and of course, mostly unknown powers of evil to explain bad things happening to good people, I too blamed the unknown.

<g0pz>    edink: reall weird
<edink>   g0pz: yeah
<g0pz>    has something to do with tsrm ?
<edink>   g0pz: i cannot tell if its tsrm related
<g0pz>    because that looks like a bad address there in the tsrm ptr ?
<g0pz>    0x00d5b3f6 seems to be a little on the low side 
<g0pz>    sort of makes sense
<g0pz>    as the   apc_copy_function_for_execution_ex is passed as ht_copy_fun_t to copy_hashtable
<g0pz>    which just calls the apc_copy_function_for_execution_ex with 4 args
<g0pz>    apc_compile.c:926 needs to be fixed to pass the thread safety macros ?
<g0pz>    *but* I cannot test anything I fix and neither do I have any idea what any TSRM macro means
<g0pz>    so help !!! :)
<Rasmus>  TSRM just wraps all the globals in a struct
<edink>   g0pz: its just passing void ***tsrm_ls around
<g0pz>    so just a TSRMLS_FETCH() in scope is enough ?
<SaraMG>  g0pz: You don't need to understand TSRM.... TSRM understands you
<g0pz>    SaraMG: in soviet russia ...
<SaraMG>  >exactly<
<SaraMG>  Now you're getting it
<g0pz>    edink: as much as I'd like to help you, this thing needs a professional :)
<g0pz>    ok, here's how you fix it :)
<Rasmus>  ctrl-alt-del <insert Ubuntu cd>
<g0pz>    remove the TSRMLS_DC in apc_copy_function_for_execution_ex 
<edink>   so your func arglist should have TSRMLS_D (no other args) or TSRMLS_DC (other args) in function definition and
          TSRMLS_C or TSRMLS_CC when calling it
<g0pz>    and add a TSRMLS_FETCH(); as the first statment in that function
<g0pz>    now rebuild and hope it works
<SaraMG>  *ick*
<edink>   g0pz: TSRMLS_FETCH(); cannot be used if you have TSRMLS_D(C) in function declaration
<g0pz>    according to a significant proportion of my brain cells, that is how that could be fixed :)
<g0pz>    remove the declaration 
<g0pz>    you're anyway passing stack junk there
<g0pz>    the pointer you got was the apc_php_malloc in place of tsrm_ls
<SaraMG>  Ah, yes
<SaraMG>  Didn't realize that proto had to conform to a callback definition
<g0pz>    SaraMG: the shocking part is that it doesn't
<SaraMG>  The callback typedef being (Bucket*,va_list)
<g0pz>    that's the check, if I'm not wrong ?
<SaraMG>  (void*, void*, apc_malloc_t, apc_free_t)
<Rasmus>  hey hey, no peeking under the skirts unless you are going to dig in and fix stuff
<SaraMG>  apc_copy_function_for_execution_ex looks NOTHING like the callback's typdef
<SaraMG>  Like, not even close
<Rasmus>  details ;)
<edink>   :)

So, finally I still need to get the other guy to build and test it. Of course, the correctness of the patch has been verified in theory - it was still upto someone to figure out whether that was the only problem in the mix.

<g0pz>    edink: don't just stand there, make the changes and rebuild :)
<edink>   g0pz: made too many changes to my sources :)
<g0pz>    this is just one more :)
<SaraMG>  g0pz: So yeah, nix the _DC, use _FETCH, but also add some dummies to that declaration
          so it fits the calling semantics
<g0pz>    SaraMG: I haven't got commits
<SaraMG>  oh.... who are you again?
<edink>   Rasmus: just make him an accout :)
<Rasmus>  gah, just fill in your username and password
<Rasmus>  and garbage in the description field
<Rasmus>  those warnings don't apply to people who understand the guts of the engine
<g0pz>    tinker with != understand 
<Rasmus>  close enough

As usualy, we get into optimisations and all that... before it is actually tested.

<edink>   g0pz: are you not checking every var if it's an autoglobal?
<g0pz>    edink: if there's a better way, I'd love to know about it :)
<g0pz>    because that's what the fetch_simple_variable_ex in zend_compile.c does
<g0pz>    of course, I could optimize easily
<g0pz>    with an if(name[0] == '_') :)
<edink>   g0pz: is the sole purpose of it to load superglobals when jit is on?
<g0pz>    yes

Finally, the bug is closed - in less than a couple of hours after I saw the bug report.

<g0pz>    edink: with that fix, does APC work ?
<edink>   g0pz: yeah, like a charm
<edink>   let's see what Rasmus broke :)
<edink>   well, it compiles :)
<Rasmus>  everything most likely

End result was that I got commit access to PHP CVS. I am now gopalv of the php - Resistance is Futile. I am yet to be marked as the maintainer of anything, so I'm still in the zone where everything's convenient but nothing really bugs you. Haven't checked in anything yet. That's for a day when I am actually sane and not hopped up on coffee.

Look ma, I'm a php dev :)

--
It is not doing the thing we like to do, but liking the thing we have to do, that makes life blessed.
                   -- Goethe

posted at: 11:22 | path: /php | permalink | Tags: , ,

Fri, 24 Feb 2006:

I was reading through the php5 parser yesterday and something just hit me. The grammar author had created a static_scalar non-terminal which incidentally expands into a T_ARRAY '(' static_array_pair_list ')' which is wrong (IMHO, of course). The real problem was that who ever wrote the rest of the code assumed that a static_scalar would be a real scalar and it ended up with a parser which would parse the following bad code and even execute it.

<?php

function f( $a = array( 
                     array('x') => array('y')
          ))
{
    print_r($a);
}

f();
?>

Apparently andrei already tried to fix it last night (1.168) - but I don't think so. The right way to fix it would probably be to change the static_scalar expansion to exclude arrays and add a static_variable which has scalars and vectors. Then walk the whole codebase and replace as appropriate. Lot of work and lot of validation needed - which is exactly why I didn't submit a patch ;)

Still, php's the best web templating language around.

--
I never made a mistake in my life. I thought I did once, but I was wrong.

posted at: 15:23 | path: /hacks | permalink | Tags: ,