Waystination: YAPC 2005 Day 3

As before, these are my raw notes on the talks I attended at YAPC, day three. Perhaps tomorrow I'll summarize YAPC in a readable, pithy little journal entry.

Where did My Memory Go?

A useful tool for profiling memory use is Devel::Size, which offers both size() and total_size() methods. One does, and one does not, chase references to include their size in the total. If you hand it a reference, it knows to start "one level down" in reporting. It also knows enough not to follow circular references forever. Presenter Dan Sugalski proceeded to demonstrate the toolkit by finding out how large some scalars, arrays and hashes are. If you want to know, this is left as an exercise for the reader.

One note: "foreach (<FILE>)" uses more memory than "while (<FILE>)". Note that assigning a filehandle read to a scalar causes the entire file to be slurped into memory temporarily, so don't do that. (Double check that I caught the scenario correctly.)

Concerning garbage collection, recall that file-scoped lexicals are basically never cleaned up. The GC clears variables when they pass out of scope. If you have something very large, you should undef it using the undef function, not by assigning undef to the variable. If you assign an empty string to a scalar, for example, the storage used by that string is not cleared, in case it's needed. On the other hand, if your variable is going to end up big again, you might want to keep it around without deallocating it--profile, don't speculate. Generally, it's a pain to undef things, but you really should use it when your structure is hogging lots of memory.

Watch out for circular references, since things are only garbage collected when the reference count drops to zero. Either break the reference chain manually, or use weak references. Ise Scalar::Utils to make weak references with the weaken() method. Do note that perl will immediately garbage collect a structure if every reference to it is weak.

Also note that every version of Perl leaks memory. At least, you should use the latest version. Before 5.8, closures leaked. Parameters passed to new threads used to leak. Modifying @_ resulted in leaks before 5.6. Before 5.8.6, lots of ithread shared variables leaked.

The following will determine how much memory is being used by Perl in total.

        $Devel::Size::warn = 0;
        $size = total_size(*::);

You can also use Devel::LeakTrace to find some cases of unfreed variables. It tends to be whiny about globals, which makes it tricky to find the real leaks. It also slows runtime down considerably, so you should only use it in debugging.

Lazy Test Development

Joe McMahon promises to tell us about not only lazy test development, but also the necessary evils incurred in doing that. A common situation is stepping through the debugger to locate a problem, which is also a good time to think about creating a test. Joe created a module, Devel::TestEmbed, that pulls Test::More into the debugger, so you can create tests while debugging. An additional method allows you to save the tests when you're happy.

Building the module involved fiddling with the debugger, which isn't easy to change. It's about nine KLOC. Patching the debugger implies ongoing maintenance. Fortunately, the debugger offers an external interface that lets us write extensions without touching the debugger itself. There are some resources:

        * A .perldb file, included with a "do" into the debugger.
        * afterinit() is called right before the debugger prompt is first printed.
        * watchfunction() lives right inside the debugger's command loop, and is called before each prompt is printed.
        * @DB::typeahead allows you to stuff commands into the buffer
        * @DB::hist lets you look at prior commands.
        * Debugger's eval behavior can also be exploited: unrecognized commands are evaled.

Putting these together, a .perldb module is written that defines watchfunction() and afterinit(), and sets a magical $DB::trace to enable the watchfunction. The afterinit() stacks the "use Test::More qw(no_plan)" into the command buffer, so you don't have to type it. This was necessary to get the Test::More methods into the current namespace--if it were in the .perldb, it would import test methods into its namespace. The watchfunction dynamically imports tdump() into the current namespace in the program being debugged (so it follows you no matter where you are in the program). That's all watchfunction can do, because it runs outside the debugger's command loop.

Portable Perl - how to write code that will work everywhere

Ivor Williams started by mentioning some common misconceptions: although Perl is portable, a perl app may not be. Even if your app doesn't use XS, it may not be portable. At minimum, a motivation for writing portable Perl is that CPAN modules should be written portably. Exceptions usually have their own namespace, such as Win32::.

        * Be lazy--use existing portable modules whenever possible
        * Modularize--make plugins that wrap OS-specific stuff you can't avoid
        * Follow the rules in perldoc and perlport (on which this talk is based)

Filesystem Issues

The obvious portability issue in Perl is filenames. Luckily, you can completely ignore this, because POSIX paths work on Windows and VMS. The only problem with this is that there's no provision for a "volume" or "device" specifier in a path. There are also variations in allowed character sets. There is also the issue of case sensitivity.

The alternative to POSIX is to use native syntax. $^0 will tell you the OS name, so you can do what you must. But you can use the File::Spec module that handles this for you. This is OO, but you can also use File::Spec::Functions to import functions into the namespace.

Other issues include file permissions, which vary per OS, and symbolic/hard links, which are supported differently or not at all on some platforms.

Specifically in VMS, files are stored according to their version. That's why some modules say "while (unlink 'Foo')" for this reason.

Specifically on the Mac, files have resource forks. Ivor is too chicken to talk about them any further.

Invoking the Shell

Just don't do it. Commands vary between platforms, so invoking shell commands won't work portably. Shell globs will also be handled differently per OS. Environment variables can't be relied on either, such as HOME, TERM, SHELL, USER, etc. Even PATH isn't always set.

User Interaction

A script might be started with file descriptors redirected. If you need to interact with the user, you can't count on STDIN, STDOUT and STDERR. Reading from "/dev/tty" is very not portable. There's a better way specified in perlfaq 8. You can use Term::ReadKey for this purpose, though it doesn't successfully disable echo on Windows. A combination of Term::ReadKey and Term::ReadLine can be used to do the trick. Note that Term::ReadLine is a wrapper around either Term::ReadLine::Gnu or Term::ReadLine::Perl. The latter is included in the "CPAN bundle" that the cpan command installs for you.

Communications

Sharing files between machines with possibly different architectures, or communicating over the network, present portability challenges. Complying with some standard is helpful. Line-ending conventions are one example of this. Perl translates "\n" to be correct for the platform on which the script is running, which may not be the convention of the other endpoint.

Sadly you should use binmode for anything that isn't known to be an ASCII file, for portability. It matters on some platforms (though not on UNIX). This will also affect character counts depending on line conventions and applicable character set.

Endianness is an issue as well. Pack and unpack have a "network standard" format, specified with 'n' and 'N', which should be used.

Multitasking and Job Control

Beware of forks and threads, non-blocking I/O, etc. A portable multitasking package like POE should be used instead.

Perl Blue Magic - Creating Perl Modules and Using the CPAN

Famed comedian José Castro returns to the limelight for this talk. CPAN has over 5,000 modules, and over 2,000 active developers. There are ~200 developers with more than 5 modules.

First tip: PICK GOOD NAMES FOR YOUR MODULES! Nobody will use it if they can't find it! Here José gave a few humorous examples of useless and/or strange module names.

A module has lots of junk inside, but you don't have to make it yourself. You can use h2xs and other modules for this purpose. Just do one of these here:

        h2xs -XAn My::New::Module

It creates most of what you need, excluding a License or TODO file. There are other issues with h2xs modules. That's why José recommends Module::Starter instead:

        module-starter --module=My::New::Module --author="Me, Myself" --email="me@me.com"

You can also use ExtUtils::ModuleMaker. It prompts you through the creation process. You can also get help on module-authors@perl.org and modules@perl.org. But whatever you use, you should:

        * Have an idea - make sure it isn't already done
        * Document it - make sure you know how you plan to do it
        * Write tests
        * Code
        * Test
        * Ship
        * Maintain

Documentation should contain the following stuff: name; synopsis; other things generally provided in the template by the above-mentioned utilities. Make darn sure you include acknowledgements! Note that if the version number contains an underscore, CPAN marks it as a developers' version.

If you ask for a PAUSE ID, and are rejected, resubmit your application. The guy that handles the apps sometimes forgets.

Modules are "registered" when someone associated with CPAN decides they're "good". Given a choice, pick the registered modules.

Don't submit more modules than you can maintain.

Lightning Talks

This is a series of five-minute talks. The pace is supposed to be fast, so my notes will be pretty skeletal.

Five Development Tools I Can't Live Without
Casey West

        * SQL::Translator can convert schemas from one DB format to another. It can also create a diagram
        * HTTP::Server::Simple::Static provides a tiny web server without Apache
        * Devel::Cover for coverage analysis
        * podwebserver an index of the stuff you have installed
        * Perl::Tidy to neaten perl code
        * Test::Pod::Coverage
        * DBI::Shell
        * Module::Refresh refreshes changed modules in a running script
        * CPAN::Mini to provide a local copy of the latest CPAN modules

Refactoring Web Applications

        * What? Refactoring
        * When? Before adding a feature
        * Why? To simplify feature additions
        * How? WWW::Mechanize and AT

DBD::SQLLite Intro
Zachary Zebrowski

SQLLite is a small OSS DB in a single executable. It is ACID compliant and supports up to two terabytes of data. It has bindings for multiple languages and stores the entire DB in a single file. The file can be moved across platforms and still work. It's handy for rapid prototyping, local tests, etc. On the flip side, it's a single-user DB and isn't networked.

Thirty Seconds or Less
Larry something-or-other

Couldn't get co-workers to learn Wiki markup. He responded to this situation by adding even more markup to his preferred Wiki.

OpenGuides

An application for developing collaborative City guides in a wiki-like way.

Credit Cards

There are new security rules that apply to all merchants. Printed, they're an inch and a half thick. Compliance can be costly. Fines for non-compliance are also heavy.

CPAN Envy
Casey West(?)

Schwern tried to convince people to create a Javascript repository. CW decided to go ahead and create it.

Regexp::Compost
Paul Shields

A perl module that takes as its input a text and produces a list of regexes that match that (and similar) texts. It exhibits an interesting heuristic for trying to create a regex that "fuzzily" matches a sample corpus of texts.

What Has Meng Been Up To Lately?
Meng Weng Wong

Two years ago SPF was born at YAPC in FL. Microsoft decided to "embrace it and extend it to Sender ID, which will roll out in Hotmail and Outlook. On the subject of DK, Meng tried to throw FUD in the air by pointing out that PGP and S/MIME "didn't work". He's also fooling with methods of implementing "collaborative blacklists" called "Karma". From there he went on to describe something that boils down to IM2000.

Perlcast

A podcast dedicated to Perl. Go listen if you're interested.

Annotating CPAN

The idea is to permit users to annotate packages, particularly where they thing there's a gap in the documentation.

A Mail Server in Perl
Matt Sergeant

Matt maintains QPSMTPD. It reputedly handles ~1M messages per day on some hosts.The author bills it as mod_perl for email.

Waystination

Thursday, June 30, 2005

YAPC 2005 Day 3

No comments: