automated installations (3)

I got a reply from the debian-boot mailinglist that partman-auto indeed has a bug. So I thought about it and decided I would try to find this bug instead of
going ahead and writing my own partitioner. For 2 reasons: first, if the bug is fixed, I don't need to write custom software and the current software will be
maintained without me having to put effort in it. Second, if this bug is solved, maybe others will benefit from it too.

I must say, while I'm trying to understand this code, I'm learning a few things about shell programming. I'm also noticing the hard way why global variables
are a bad thing.

I started today's journey by looking at "decode_recipe()". What it does is read a recipe file, parse it and return everything in environment variables. More
precise:


decode_recipe () {
local ram line word min factor max fs -
unnamed=$(($unnamed + 1))
ram=$(grep ^Mem: /proc/meminfo | { read x y z; echo $y; }) # in bytes
if [ -z "$ram" ]; then
ram=$(grep ^MemTotal: /proc/meminfo | { read x y z; echo $y; })000
fi
ram=$(expr 0000000"$ram" : '0*\(..*\)......$') # convert to megabytes
name="Unnamed.${unnamed}"
scheme=''
line=''


In this initialisation, it fetches the amount of RAM you have from /proc/meminfo. "$ram" now contains the amount of megabytes of RAM you have.


for word in $(cat $1); do


Now, for start processing each "word" in the file passed as the first argument.
This thing is a statemachine. "$word" will contain the last read word, while "$line" will contain the line built up so far.


case $word in
:)
name=$line
line=''
;;


Everything since the last line is stored in "$name" if a ":" is encountered. For example "my little recipe : ..." will store "my little recipe" in "$name"


::)
db_metaget $line description
if [ "$RET" ]; then
name=$RET
else
name="Unnamed.${unnamed}"
fi
line=''
;;


If a "::" is encountered, the name is fetched from the db database. The metaget command fetches the description of a certain database entry.


.)


Now we get to the interesting part. When a "." is encountered, that means an entire partition definition is read in and stored in "$line".


# we correct errors in order not to crash parted_server
set -- $line


This is an interesting new thing I found :) The "set" command "parses" what is inside "$line" and returns it parts in "$1", "$2", "$3", etc.


if expr "$1" : '[0-9][0-9]*$' >/dev/null; then
min=$1
elif expr "$1" : '[0-9][0-9]*%$' >/dev/null; then
min=$(($ram * ${1%?} / 100))
else # error
min=2200000000 # there is no so big storage device jet
fi


The first argument is the minimum size. If it's not a number and not a percentage of RAM, it's set to 2 200 000 000 MB, which is 2200 PB, large enough if you
ask me.


if expr "$2" : '[0-9][0-9]*%$' >/dev/null; then
factor=$(($ram * ${2%?} / 100))
elif expr "$2" : '[0-9][0-9]*$' >/dev/null; then
factor=$2
else # error
factor=$min # do not enlarge the partition
fi


The second argument is the socalled "priority", here stored in "$factor". The same rules as for the minimum apply, except that if the priority is invalid, it
is set to the minimum. This should allow me to just enter "x" as a factor and it should still work.


if [ "$factor" -lt "$min" ]; then
factor="$min"
fi


Sanity check: make sure "$factor" is at least as big as "$min"


if expr "$3" : '[0-9][0-9]*$' >/dev/null; then
max=$3
elif expr "$3" : '[0-9][0-9]*%$' >/dev/null; then
max=$(($ram * ${3%?} / 100))
else # error
max=$min # do not enlarge the partition
fi


Argument 3 is the maximum size. Same rules apply. If invalid, it's set to the minimum.


if [ "$max" -lt "$min" ]; then
max="$min"
fi


Another sanity check: max should be at least as large the minimum.


case "$4" in # allow only valid file systems
ext2|ext3|xfs|reiserfs|jfs|linux-swap|fat16|fat32|hfs)
fs="$4"
;;
*)
fs=ext2
;;
esac


The 4th argument is the type of partition. If it's not in the list, it's set to ext2.


shift; shift; shift; shift
line="$min $factor $max $fs $*"
if [ "$scheme" ]; then
scheme="${scheme}${NL}${line}"
else
scheme="$line"
fi
line=''
;;


"shift" removes an argument from the input. 4 times shift removes "$1" "$2" "$3" and "$4", leaving everything behind "$4" in "$*". Now a "cleaned up" "$line"
is created with the sanitized values from above and the rest of the line ($*). It is stored on a new line in "$scheme". This is important. "$scheme" contain
s a multiline version of the recipe. (I've peeked ahead) In the future, the recipe will be read in line per line by setting $IFS to "\n".


*)
if [ "$line" ]; then
line="$line $word"
else
line="$word"
fi
esac


For any other word (not ":" "::" or "."), the word is glued at the end of the current "$line".
This completes the circle :)


done
}


And that was "decode_recipe()".
Afaik, this piece of code looks correct. It does what is expected and does some error correction in the process.
After this function, the following variables will have been set:


$name

The name of the scheme

$scheme

The parsed version of the recipe. It's a multiline string (separated by \n) with a partition definition on each line



Contrary to what I expected, "$ram" is not kept outside of this function (notice the "local ram ..." at the start of the function)


The next function I encountered (alot) is the "foreach_partition()" function. It's pretty clear what this function does, but I decided to look at it anyway b
ecause it could contain weirdness.


foreach_partition () {
local - doing IFS partition former last
doing=$1
IFS="$NL"
former=''
for partition in $scheme; do
restore_ifs
if [ "$former" ]; then
set -- $former
last=no
eval "$doing"
fi
former="$partition"
done
if [ "$former" ]; then
set -- $former
last=yes
eval "$doing"
fi
}


This looks pretty innocent. Like I said, the partitions in "$scheme" are separated by "\n". In this function, the separate partitions are loaded in "$partiti
on" one by one. The variable "$last" is set to "no" for all except the last partition, where it is set to "yes". Then "$doing" is eval'ed.

So, the first argument of "foreach_partition()" is the function you want to call on each partition. The partition details (min factor max fs and the rest) ar
e passed as "$1", "$2", "$3", ... etc.
Furthermore, there is a variable "$last" which indicates whether the current partition is the last one.

Interesting to note is that the IFS is not restored for the last partition. I'm not sure whether this is by design or if its a bug. (Notice the "restore_ifs(
)" call in the for loop)

[after checking]
It's not a bug :) The code works as expected. The only time it would not work is if there was nothing in "$scheme", but then "$former" would not be set eithe
r.

Both "open_dialog()" and "close_dialog()" are interesting as well as it seems they call the parted_server. On closer inspection, "open_dialog()" writes a com
mand in a fifo (destination parted_server) and reads the result from another fifo (origin parted_server). It then handles any errors that appear. "close_dial
og()" cleans up the fifo's.

The fifo's are called /var/lib/partman/infifo and /var/lib/partman/outfifo and are opened on descriptor 6 and 7 respectively.

The "read_line()" function reads is a wrapper around "read()" that reads from the outfifo (the output of parted_server)

Now here is a promising function: "pull_primary()"


pull_primary () {
primary=''
logical=''
foreach_partition '
if
[ -z "$primary" ] \
&& echo $* | grep '\''\$primary{'\'' >/dev/null
then
primary="$*"
else
if [ -z "$logical" ]; then
logical="$*"
else
logical="${logical}${NL}$*"
fi
fi'
}


This function acucmulates data in 2 variables: "$primary" and "$logical". Both are empty at start.
Then, it goes over every partition. If it's a primary partition, it is put in "$primary". Otherwise it's appended to "$logical".

Now what is interesting, is that only the first primary partition is put in "$primary". The rest is appended to "$logical".
Tests confirm it.

[At first glance, this seems to be an unfortunate namingscheme for the variables, nothing more.]

Some more functions: "partition_before()" and "partition_after()"
Both functions return the ID of the partition coming before or after another partition ID.
The sort order is defined by what "open_dialog PARTITIONS" returns.
The parted_server calls "ped_disk_next_partition()" over and over. I'm guessing it reads the ID's in order then. Now what is outputed is interesting:

/* Returns informational string about `part' from `disk'. Format:*/
/* Numberidlengthtypefspathname */
char *
partition_info(PedDisk *disk, PedPartition *part)
{
char const *type;
char const *fs;
char *path;
char const *name;
char *result;
assert(disk != NULL && part != NULL);
if (PED_PARTITION_FREESPACE & part->type) {
bool possible_primary = possible_primary_partition(disk, part);
bool possible_logical = possible_logical_partition(disk, part);
if (possible_primary)
if (possible_logical)
type = "pri/log";
else
type = "primary";
else if (possible_logical)
type = "logical";
else
type = "unusable";
} else if (PED_PARTITION_LOGICAL & part->type)
type = "logical";
else
type = "primary";

if (PED_PARTITION_FREESPACE & part->type)
fs = "free";
else if (PED_PARTITION_METADATA & part->type)
fs = "label";
else if (PED_PARTITION_EXTENDED & part->type)
fs = "extended";
else if (NULL == (part->fs_type))
fs = "unknown";
else
fs = part->fs_type->name;
if (0 == strcmp(disk->type->name, "loop")) {
path = strdup(disk->dev->path);
/* } else if (0 == strcmp(disk->type->name, "dvh")) { */
/* PedPartition *p; */
/* int count = 1; */
/* int number_offset; */
/* for (p = NULL; */
/* NULL != (p = ped_disk_next_partition(disk, p));) { */
/* if (PED_PARTITION_METADATA & p->type) */
/* continue; */
/* if (PED_PARTITION_FREESPACE & p->type) */
/* continue; */
/* if (PED_PARTITION_LOGICAL & p->type) */
/* continue; */
/* if (part->num > p->num) */
/* count++; */
/* } */
/* path = ped_partition_get_path(part); */
/* number_offset = strlen(path); */
/* while (number_offset > 0 && isdigit(path[number_offset-1])) */
/* number_offset--; */
/* sprintf(path + number_offset, "%i", count); */
} else {
path = ped_partition_get_path(part);
}
if (ped_disk_type_check_feature(part->disk->type,
PED_DISK_TYPE_PARTITION_NAME)
&& ped_partition_is_active(part))
name = ped_partition_get_name(part);
else
name = "";
asprintf(&result, "%i\t%lli-%lli\t%lli\t%s\t%s\t%s\t%s",
part->num,
(part->geom).start * PED_SECTOR_SIZE,
(part->geom).end * PED_SECTOR_SIZE + PED_SECTOR_SIZE - 1,
(part->geom).length * PED_SECTOR_SIZE, type, fs, path, name);
free(path);
return result;
}


Notice the "pri/log" stuff. I suspect it stirring trouble...

When I execute the PARTITIONS command myself in vmware by doing:

~ # cd /var/lib/partman
/var/lib/partman # echo PARTITIONS =dev=scsi=host0=bus0=target0=lun0=disc > infifo
/var/lib/partman # cat outfifo
OK
1 32256-2048094719 2048062464 primary linux-swap /dev/scsi/host0/bus0/target0/lun0/part1
2 2048094720-2113896959 65802240 primary ext3 /dev/scsi/host0/bus0/target0/lun0/part2
-1 2113896960-5362882559 3248985600 pri/log free /dev/scsi/host0/bus0/target0/lun0/part-1


(not copy pasted ;)

The id "-1" on the 3rd line is interesting. also the "pri/log" and "free".

After some looking trough logs, I believe the "-1" stands for empty diskspace.
I find the following text in the logs (partman has logs!!!!!! look in /var/log/partman):

/bin/perform_recipe: IN: NEW_PARTITION =dev=scsi=host0=bus0=target0=lun0=disc primary ext3 2113896960-5362882559 beginning 3255600001
parted_server: Read command: NEW_PARTITION
parted_server: command_new_partition()
parted_server: Note =dev=scsi=host0=bus0=target0=lun0=disc as changed
parted_server: Opening outfifo
parted_server: requested partition with type primary
parted_server: requested partition with file system ext3
parted_server: add_primary_partition(disk(10485760),4128705-10488080)
parted_server: OUT: Error

parted_server: OUT: Can't have a partition outside the disk!

parted_server: OUT:

parted_server: OUT: Cancel
...


I think I may have tripped over the bug here.
According to parted_server, there is 3248985600 bytes free for a new partition, but partman wants to create a partition of 3256000001 bytes.

I tried with the following sizes and results:

3253000000 OK
3255000000 NOK
3254000000 OK
3254500000 OK
3254800000 OK
3254900000 NOK
3254850000 NOK
3254830000 NOK
3254815000 NOK
3254807500 OK
3254810000 OK
3254812500 NOK
3254811250 OK
3254811850 OK
3254812000 OK
3254812250 NOK
3254812125 OK
3254812200 NOK
3254812180 NOK
3254812150 OK
3254812165 NOK
3254812160 NOK
3254812155 OK
3254812157 OK
3254812159 OK

I'm not sure where the 3254812159 (max partition size) - 3248985600 (free on disk) = 5826559 bytes come from ?

In any case, since the commandline contains the word "beginning", the points of origin are limited to 2 places in the perform_recipe script.

And since we defined it as primary, it is most likely that the problem is located here :


...
while
[ "$free_type" = pri/log ] \
&& echo $scheme | grep '\$primary{' >/dev/null
do
pull_primary
set -- $primary
open_dialog NEW_PARTITION primary $4 $free_space beginning ${1}000001
read_line num id size type fs path name
...


The "$free_space" variable is calculated from the last pass through the while loop (when sda2 was created).


...
neighbour=$(partition_after $id)
if [ "$neighbour" ]; then
open_dialog PARTITION_INFO $neighbour
read_line x1 new_free_space x2 new_free_type fs x3 x4
close_dialog
fi
if
[ -z "$neighbour" -o "$fs" != free \
-o "$new_free_type" = primary -o "$new_free_type" = unusable ]
then

... [we don't get in here]

fi
shift; shift; shift; shift
setup_partition $id $*
primary=''
scheme="$logical"
free_space=$new_free_space
free_type="$new_free_type"


"partition_after()" returns "2113896960-5362882559" and it is stored in "$neighbour".
Then some PARTITION_INFO is queried for this neighbour. The following variables are set:


$new_free_space

2113896960-5362882559

$new_free_type

pri/log

$fs

free



The condition of the if() doesn't match, so that can be skipped. What remains is "setup_partition()" and some variable transfers.
Although "setup_partitions()" seems to contain some interesting code (!), it does not alter any non-local variables.

We now know where all variables come from in this line:

open_dialog NEW_PARTITION primary $4 $free_space beginning ${1}000001



$4

This is the 4th argument of the partition definition of our current partition to create. It's value is "ext3"

$free_space

This is the value stored in $new_free_space of the previous while()-run. The value is obtained from a call to "partition_after()". It's value is 21138969
60-5362882559

${1}

This is the first argument of the partition definition. It's the minimum size of the partition, but it went through a number of calculations and election
s and actually contains the size to create the partition with.



The minimum size is calculated as being 3256 MB:
The entire disk is 5368709120 bytes. This is rounded to 5368MB.
Then we lose 2048MB and 64MB and a potential 500MB, which gives 2756MB unallocated.

We enter the calculation loop with the following data:

sda1: 2048 0 2048
sda2: 64 0 64
sda3: 500 0 1000000000
The second number is the "factor" calculated from the "priority"

free_space is 5368MB and unallocated is 2756.


oldscheme=''
while [ "$scheme" != "$oldscheme" ]; do
oldscheme="$scheme"
factsum=$(factor_sum)
unallocated=$(($free_size - $(min_size)))
if [ $unallocated -lt 0 ]; then
unallocated=0
fi
scheme=$(
foreach_partition '
local min fact max newmin
min=$1
fact=$2
max=$3
shift; shift; shift
newmin=$(($min + $unallocated * $fact / $factsum))
if [ $newmin -le $max ]; then
echo $newmin $fact $max $*
else
echo $max 0 $max $*
fi'
)
echo "XXX[$unallocated]"
echo "[$oldscheme][$scheme]";
echo
done


The key formula here is newmin=$(($min + $unallocated * $fact / $factsum))

For sda1 and sda2, this doesn't change anything because fact is 0.
For sda3, unallocated is 2756, fact is 100 and factsum is also 100 (and I suppose factsum will always be 100 in any case, since it's 100%)
So the minimum size for sda3 is calculated to be 500 + 2756*100/100, which is 3256MB

OK, let's look at the real data again, because I've been using simulated data.
At the start of perform_recipe, the "$free_size" is calculated by means of a PARTITION_INFO command. The result is 5368709120 bytes, rounded to 5368MB.

By looking at how partitions are created, I notice an odd thing:
The first partition (2048MB) is created starting from offset 32256 and going to 2048094719 having a reported size of 2048062464 bytes (asked for 2048000001 b
ytes). The second one is created from 2048094720-2113896959 with reported size of 65802240 bytes (asked for 64000001 bytes).
The last partition is obviously not created with a requested size of 3256000001 bytes)

The reported sizes added together: 2048062464 + 65802240 + 3254812159 = 5368676863
(The 3254812159 is the maximum amount we were allowed to use for the 3rd partition)
The disk is 5368709120 bytes, so what remains (full disk - whats in use) = 5368709120 - 5368676863 = 32257 bytes.

This happens to be the amount missing from the front of the disk. Probably not coincidence.
Maybe the partman-auto author overlooked this.

Now that I think of it, 32KB is really not that much... but maybe enough to throw everything out of whack. On the other side, the partition only has to be 1
byte to big for parted_server to fail.
Considering that there is a lot of going back and forward between bytes and MB and that megabytes are not calculated cleanly (1000000 bytes instead of 1024*1
024), I could be dealing with rounding errors.