Age | Commit message (Collapse) | Author |
|
|
|
On error, we must free the temporary stack segment
while holding up->seglock.
|
|
postnotepg()
Half NERR stack to 32.
When posing a note to a large group, avoid allocating Notes
for each individual process, but post the reference instread.
factor out process interruption into procinterrupt().
Avoid allocation of notes in alarmkproc, just posting the
same note to everyone.
|
|
de-bloat the proc structure by allocating notes
with on the heap instead of embedding them in
the proc structure.
This saves around 640 bytes per process.
|
|
Our #! line length is very short, and the naïve quoting
makes it difficult to pass more complicated arguments to
the programs being run. This is fine for simple interpreters,
but it's often useful to pass arguments to more complicated
interpreters like auth/box or awk.
This change raises the limit, but also switches to tokenizing
via tokenize(2), rather than hand rolled whitespace splitting.
The limits chosen are arbitrary, but they leave approximately
3 KiB of stack space on 386, and 13k on amd64. This is a lot
of stack used, but it should leave enough for fairly deep
devtab chan stacks.
|
|
|
|
joe7)
devproc allows changing the noteid of another process
which opens a race condition in sysrfork(), when deciding
to inherit the noteid of "up" to the child and calling
pidalloc() later to take the reference, the noteid could
have been changed and the childs noteid could have been
freed already in the process.
this bug can only happen when one writes the /proc/n/noteid
file of a another process than your own that is in the
process of forking.
the noteid changing functionality of devproc seems questinable
and seems to be only used by ape's setpgrid() implementation.
|
|
The old strategy of wait and retry doesnt seem to
work very well as it keeps all the forking parents
stuck waiting in the kernel worsening the situation.
The idea with this change is to have rfork() return
error quickly; and without whining; as most callers
would just react with a sysfatal() which might be
better for surviving this.
|
|
behaviour in a.out(6)
For 64-bit architectures, the a.out header has the HDR_MAGIC flag set
in the magic and is expanded by 8 bytes containing the 64-bit virtual
address of the programs entry point. While Exec.entry contains physical
address for kernel images.
Our sysexec() would always use Exec.entry, even for 64-bit a.out binaries,
which worked because PADDR(entry) == entry for userspace pointers.
This change fixes it, having the kernel use the 64-bit entry point
and document the behaviour in the manpage.
|
|
we might as well handle the per process cycle
counter in the portable part instead of duplicating the code
in every arch and have inconsistent implementations.
we now have a portable kenter() and kexit() function,
that is ment to be used in trap/syscall from user,
which updates the counters.
some kernels missed initializing Mach.cyclefreq.
|
|
progarg[0] can be assigned to elem directly as it is a
copy in kernel memory, so the char proelem[64] buffer
is not neccesary.
do the close-on-exit outside of the segment lock. there
is no reason to keep the segment table locked.
|
|
in case the calling process changes its arguments under us, it could
happen that the final argument string lengths become bigger than
initially calculated. this is fine as we still make sure we wont
overflow the stack segment, but we could overrun into the tos
structure at the end of the stack. so change the limit to the
base of the tos, not the end of the stack segment.
|
|
devproc assumes that when we hold the Proc.debug qlock,
the process will be prevented from exiting. but there is
another race where the process has already exited and
the Proc* slot gets reused. to solve this, on process
creation we also have to acquire the debug qlock while
initializing the fields of the process. this also means
newproc() should only initialize fields *not* protected
by the debug qlock.
always acquire the Proc.debug qlock when changing strings
in the proc structure to avoid doublefree on concurrent
update. for changing the user string, we add a procsetuser()
function that does this for auth.c and devcap.
remove pgrpnote() from pgrp.c and replace by static
postnotepg() in devproc.
avoid the assumption that the Proc* entries returned by
proctab() are continuous.
fixed devproc permission issues:
- make sure only eve can access /proc/trace
- none should only be allowed to read its own /proc/n/text
- move Proc.kp checks into procopen()
pid reuse was not handled correctly, as we where only
checking if a pid had a living process, but there still
could be processes expecting a particular parentpid or
noteid.
this is now addressed with reference counted Pid
structures which are organized in a hash table.
read access to the hash table does not require locks
which will be usefull for dtracy later.
|
|
|
|
replace machine specific userinit() by a portable
implemntation that uses kproc() to create the first
process. the initcode text is mapped using kmap(),
so there is no need for machine specific tmpmap()
functions.
initcode stack preparation should be done in init0()
where the stack is mapped and can be accessed directly.
replacing the machine specific userinit() allows some
big simplifications as sysrfork() and kproc() are now
the only callers of newproc() and we can avoid initializing
fields that we know are being initialized by these
callers.
rename autogenerated init.h and reboot.h headers.
the initcode[] and rebootcode[] blobs are now in *.i
files and hex generation was moved to portmkfile. the
machine specific mkfile only needs to specify how to
build rebootcode.out and initcode.out.
|
|
Proc.parent
for better system diagnostics, we *ALWAYS* want to record the parent
pid of a user process, regardless of if the child will post a wait
record on exit or not.
for that, we reverse the roles of Proc.parent and Proc.parentpid so
Proc.parentpid will always be set on rfork() and the Proc.parent
pointer will point to the parent's Proc structure or is set to nil
when no wait record should be posted on exit (RFNOWAIT flag).
this means that we can get the pid of the original parent process
from /proc, regardless of the the child having rforked with the
RFNOWAIT flag. this improves the output of pstree(1) somewhat if
the parent is still alive. note that theres no guarantee that the
parent pid is still valid.
the conditions are unchanged:
a user process that will post wait record has:
up->kp == 0 && up->parent != nil && up->parent->pid == up->parentpid
the boot process is:
up->kp == 0 && up->parent == nil && up->parentpid == 0
and kproc's have:
up->kp != 0 && up->parent == nil && up->parentpid == 0
|
|
make exec() clear the per process error string
to avoid spurious errors and confusion.
the errstr() syscall used to always swap the
maximum buffer size with memmove(), which is
problematic as this gives access to the garbage
beyond the NUL byte. worse, newproc(), werrstr()
and rerrstr() only clear the first byte of the
input buffer. so random stack rubble could be
leaked across processes.
we change the errstr() syscall to not copy
beyond the NUL byte.
the manpage also documents that errstr() should
truncate on a utf8 boundary so we use utfecpy()
to ensure proper NUL termination.
|
|
segattach(), set SG_RONLY in data2txt()
the user should not be able to change the cache
attributes for a segment in segattach() as this
can cause the same memory to be mapped with
conflicting attributes in the cache.
SG_TEXT should always be mapped with SG_RONLY
attribute. so fix data2txt() to follow the rules.
|
|
|
|
|
|
was only implemented by the pc kernel. does not
account pages used by the mount cache.
|
|
on HZ 100 systems like pc and pc64, the minium sleep time
was 10ms, which is quite high. the cap isnt really needed
as arch specific timerset() enforces its own limit, but on
a higher resolution.
background:
from Charles Forsyth:
I haven't really got an opinion on it. The 10ms interval was first used on
machines that were much slower.
I thought someone did set HZ to a bigger value, partly to support better
in-kernel timing. I haven't done it because I never had a need for it.
If I were doing (say) protocol implementation in user mode, I'd certainly
reconsider. Sleep itself forces at best ms granularity,
and for some applications that's too big.
initial mail from qwx raising the issue:
> Hello,
>
> I found out recently that sleep(2)'s resolution on 386 and 9front's amd64
> kernel is 10 ms rather than 1 ms. The reason is that on those kernels,
> HZ is set to 100 rather than say 1000. In syssleep, we get 1 tich every
> 10 ms.
>
> What is unclear is why.
>
> To paraphrase cinap_lenrek's answer to my question:
>
> In syssleep:
> if(ms < TK2MS(1))
> ms = TK2MS(1);
> tsleep(&up->sleep, return0, 0, ms);
>
> "TK2MS(1)" can be replaced with just "1", and the arch specific
> timerset() routine would do its own capping of the period if it's too
> small for the timer resolution, and make better decisions based on what
> the minimum timer period should be given the latency overhead of the
> given arch's interrupt handling and performance characteristics.
>
> Alternatively, HZ could be raised to 500 or 1000.
>
> It seems it's just trying to prevent excessive context switches and
> interrupts, but it seems somewhat arbitrary. A ton of syscalls can be
> done in 1 ms, and it's the lowest we can go without changing the unit.
>
>
> What do you think?
>
> Thanks in advance,
>
> qwx
|
|
|
|
we might wake up on a different cpu after the sleep so
delta from machX->ticks - machY->ticks can become negative
giving spurious timeouts. to avoid this always use the
same mach 0 tick counter for the delta.
|
|
- return distinct error message when attempting to create Globalseg with physseg name
- copy directory name to up->genbuf so it stays valid after we unlock(&glogalseglock)
- cleanup wstat() handling, allow changing uid
- make sure global segment size is below SEGMAXSIZE
- move isoverlap() check from globalsegattach() into segattach()
- remove Proc* argument from globalsegattach(), segattach() and isoverlap()
- make Physseg.attr and segattach attr parameter an int for consistency
|
|
|
|
we have to validaddr() and vmemchr() all argv[] elements a second
time when we copy to the new stack to deal with the fact that another
process can come in and modify the memory of the process doing the
exec. so the argv[] strings could have changed and increased in
length. we just make sure the data being copied will fit into the
new stack and error when we would overflow.
also make sure to free the ESEG in case the copy pass errors.
|
|
argv[] strings get copied to the new processes stack segment, which
has a maximum size of USTKSIZE, so limit the size of the strings to
that and check early for overflow.
|
|
this moves the name validation out of segattach() to syssegattach()
to make sure the segment name cannot be changed by the user while
segattach looks at it.
|
|
|
|
|
|
when executing a script, we did advance argp0 unconditionally
to replace argv[0] with the script name. this fails when
argv[] is empty, then we'd advance argp0 past the nil terminator.
the alternative would be to *not* advance if *argp0 == nil, but that
would require another validaddr() check for a case that is unlikely
to have been anticipated in most programs being invoked as
libc's ARGBEGIN macro assumes argv[0] being non-nil as it also
unconditionally advances the argv pointer.
to keep us sane, we now reject an empty argv[]. on entry, we
verify that argv[] is valid for at least two elements:
- the program name argv[0], has to be non-nil
- the first potential nil terminator in argv[1]
when argv[0] == nil, we throw Ebadarg "bad arg in system call"
|
|
rebootcmd() handle AOUT_MAGIC macro
|
|
|
|
|
|
- reject files smaller or equal to two bytes, they are bogus
- fix out of bounds access in shargs() when n <= 2
- only copy the bytes read into line buffer
- use nil for pointers instead of 0
|
|
fixed segments are continuous in physical memory but
allocated in user pages. unlike shared segments, they
are not allocated on demand but the pages are allocated
on creation time (devsegment). fixed segments are
never swapped out, segfreed or resized and can only be
destroyed as a whole.
the physical base address can be discovered by userspace
reading the ctl file in devsegment.
|
|
|
|
the "to" address can overflow in syssegfree() causing wrong
number of pages to be passed to mfreeseg(). with the current
implementation of mfreeseg() however, this doesnt cause any
data corruption but was just freeing an unexpected number of
pages.
this change checks for this condition in syssegfree() and
errors out instead. also mfreeseg() was changed to take
ulong argument for number of pages instead of int to keep
it consistent with other routines that work with page counts.
|
|
ignore physical segments in mcountseg() and mfreeseg(). physical
segments are not backed by user pages, and doing putpage() on
physical segment pages in mfreeseg() is an error.
do now allow physical segemnts to be resized. the segment size
is only checked in segattach() to be within the physical segment!
ignore physical segments in portcountpagerefs() as pagenumber()
does not work on the malloced page structures of a physical segment.
get rid of Physseg.pgalloc() and Physseg.pgfree() indirection as
this was never used and if theres a need to do more efficient
allocation, it should be done in a portable way.
|
|
WHAT WHERE THEY *THINKING*??!?!
unlike seek, the (new) nsec syscall (not used in 9front libc)
returns the time value in register (from nix), so do the same
for compatibility.
|
|
noswap flag on exec
|
|
Lock
make the Page stucture less than half its original size by getting rid of
the Lock and the lru.
The Lock was required to coordinate the unchaining of pages that where
both cached and on the lru freelist.
now pages have a single next pointer that is used for palloc.head
freelist xor for page cache hash chains in Image.pghash[].
cached pages are not on the freelist anymore, but will be reclaimed
from images by the pager when the freelist runs out of pages.
each Image has its own 512 hash chains for cached page lookup. That is
2MB worth of pages and there should be no collisions for most text images.
page reclaiming can be done without holding palloc.lock as the Image is
the owner of the page hash chains protected by the Image's lock.
reclaiming Image structures can be done quickly by only reclaiming pages from
inactive images, that is images which are not currently in use by segments.
the Ref structure has no Lock anymore. Only a single long that is atomically
incremented or decremnted using cmpswap().
there are various other changes as a consequence code. and lots of pikeshedding,
sorry.
|
|
the new syscall is added under the symbol _nsec() for
binary compatibility.
nsec() is still a library function reading /dev/bintime.
|
|
|
|
we free the wrong pointer in the waserror() block.
|
|
|
|
|
|
this change is in preparation for amd64. the systab calling
convention was also changed to return uintptr (as segattach
returns a pointer) and the arguments are now passed as
va_list which handles amd64 arguments properly (all arguments
are passed in 64bit quantities on the stack, tho the upper
part will not be initialized when the element is smaller
than 8 bytes).
this is partial. xalloc needs to be converted in the future.
|
|
sysexec()
|