Free v4 frontend for your v6 site
Sponsored link
Disclaimer: Any use of information and code found on this page is at your own risk. You may write to me if you have any problems, but I will not promise to help you.

Unofficial comp.os.linux.development.* FAQ

On this page I will try to answer a few of the questions that I often see on comp.os.linux.development.apps and comp.os.linux.development.system. If anybody knows an official FAQ for these newsgroups please tell me about it. This page is in an urgent need for cleanup and reorganization, I just don't feel I have the time to do that right now. You may also mail me if you have suggestions for additions or some of the currently missing answers.

Did you find this page on google? I can see in the webserver log that a lot of people reach this page that way. Some of you will find the answer to your question on this page and some of you will not. If you have read the entire page without finding the answer to your question, the obvious next step would be to come along and ask on the appropriate of the two newsgroups. If you have additional questions after reading this page, go ahead and ask. I cannot answer questions unless you ask. Don't do like that guy who first found this page by searching for information about the return value of kernel_thread just to find out that it was the same as sys_fork. Three hours later he was back searching for information about the return value of sys_fork. Did he really waste three hours searching for the answer to a question I could have answered in five minutes?

Unless stated otherwise code snippets on this page are copyrighted by me, and may be used under the GPL version 2.
  1. Where can I download Linux?
    1. http://kernel.org/#newtolinux
    2. http://google.com/linux?q=distributions
    3. http://lwn.net/Distributions/

  2. I want to write a kernel module, where do I start?
  3. Can you recommend me some books or sources of information about kernel hacking?
    These suggestions were supplied by Daniel Versick:
  4. Where do I find online man pages?
    Here are a few locations:
  5. Why do I get unresolved symbols when trying to load a kernel module?

  6. What are modversions?
    http://www.kernelnewbies.org/faq/#compmod.xml

  7. Where does the messages from my printk statements go?
    If you are using a textmode VC they will usually go directly to the screen. If you are using X this will not be the case. In most cases debugging kernel modules is best done without X. If you insist on debugging your kernel modules under X, there is a couple of places to look for the output: Notice that the logfilename depends on distribution:
    The loging directly to the screen depends on the serverity of the message. There are eight levels, the most important is level 0 and is used only if the kernel is essentially crashed. The least important is level 7 which is used for debuging output. The level is indicated with three chars at the start of each line "<" level ">". There are defines in <linux/kernel.h> for every level. If no level is specified a default will be used. The kernel has a variable specifying how important a message must be to get logged directly to the screen. Usually the klogd process will change this variable so fewer messages makes it to the screen, but all messages will be sent to the syslgd process for logging in files.

  8. What is the difference between a kernel thread and a kernel module?
    I'm glad you asked, too many newbies assume they are the same without even asking.
  9. Can I start kernel threads from a kernel module?
    Yes you can do that. Most modules don't need to create any kernel threads. If you are writing a driver and think you need a kernel thread you could easily be wrong. Anyway if you find that you really need a kernel thread it can easily be created with the kernel_thread function:
    #include <fixme>
    extern int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags);
    
    This will create a new kernel thread, the return value from kernel_thread is interpreted in the same way as the return value from sys_fork. That is a positive number is the pid of the created thread, and a negative number indicates an error condition. A return value of zero should be impossible, because the child terminates after calling fn and does not return to the caller of kernel_thread. The action performed by the new thread is equivalent to the statement exit(fn(arg));. The arg parameter is often used to pass a pointer to a struct to the newly created thread, take care to ensure that the struct still exist when the thread needs it. If the struct is a local variable the function calling kernel_thread could easily have returned before the struct has been read. If arg is not needed it is usually just filled with the value NULL. The first you want to do in a new kernel thread usually is this:
    int my_thread(void *arg)
    {
            daemonize("kmyd");
            while(1) {
                    /* Insert your own code here */
            }
    }
    

    Notice that like all other processes kernel threads will become zombies until the parent has seen the status. If a kernel thread is created in the initialization function the parent will be the module loader. When the module loader terminates the process will get init as its new parent. In that case init will take care of waiting for the terminated threads. In all other cases you have to ensure that the zombies are taken care of. Kernel threads cannot ignore their children like userspace programs can, when a userspace program wants to ignore its children the kernel will do the waiting and just not tell the userspace program.

    Removing a module using kernel threads is very difficult to do without creating race conditions. The safest solution is to create a module that cannot be removed. In the initialization call MOD_INC_USE_COUNT to get a usecount of 1. Don't change the usecount anywhere else. For an example that can be removed look on the kernel usb driver linux/drivers/usb/hub.c. It has also had its race conditions, but they have presumably been fixed. Here goes another (untested) example which I believe is race free:

    static DECLARE_COMPLETION(mythread_exited);
    static atomic_t time_to_quit = ATOMIC_INIT(0);
    
    int my_thread(void *arg)
    {
    #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,0)
            daemonize("kmyd");
    #else
            daemonize();
            strcpy(current->comm, "kmyd");
    #endif
            while(!atomic_read(&time_to_quit)) {
                    /* Insert your own code here */
            }
            complete_and_exit(&mythread_exited, 0);
    }
    
    static void __exit exit_mymodule(void)
    {
    	atomic_inc(&time_to_quit);
    	/* Insert code here to wake up the thread if necessary */
    	wait_for_completion(&mythread_exited);
    }
    
  10. Can I read and write files from a kernel module?
    Yes, but it is a little tricky. You should not use the usual user space functions open, read, write, and close. Instead you should use filp_open and the methods in the returned struct. Always remember to cleanup after yourself. Here is an example: (FIXME: Put the updated version here) kcp.c. If you need more examples I suggest you take a look on the sourcecode for the khttpd or tux webserver. Always remember that file access is only possible in a process context. When doing file access from within the kernel, think carefully about your design. In many cases it is a bad idea. Configuration files should never be read by kernel code, instead pass arguments to the kernel or the module loader. If it is something complicated write a userspace utility to read and parse the configuration file, and then pass it to the kernel in some appropriate way. Possibly by building up a kernel structure through a sequence of calls. One example doing this is the iptables-restore command. Firmware which a driver needs to transfer to a device is often stored as an array of chars in a header file. This is usually init data, so the memory is freed right before /sbin/init is started. If you want to load firmware from a file, that can also be done by a user mode utility. If you still want to access a file from within the kernel, at least don't hardcode the path. The filename can be passed as an argument to the kernel or the module loader.

  11. Can I add a systemcall from a module?
  12. Can I use C++ code in the kernel and modules?
    No, read more in the kernel mailing list FAQ.

  13. Does Linux have CPU affinity?
    The answer below is mostly outdated. In kernel version 2.6 user mode processes can use the sched_setaffinity and sched_getaffinity system calls. There is a patch which backports them to 2.4.

    All recent Linux versions will try to keep processes as long time as possible on the same CPU. But processes will be moved if the CPUs are not equally loaded. True CPU affinity was introduced in 2.4.0, but is only available in kernel mode. It can be used from kernel modules, but was not used by the kernel itself. Starting in 2.4.7-pre5 it is used by ksoftirqd to start one instance for each CPU. In 2.5.8-pre3 a systemcall to set CPU affinity was introduced.
    I have written a module implementing a userspace interface for the CPU affinity in 2.4.x kernels. WARNING: this is untested code cpus_allowed.tgz

  14. How do I use files larger than 2GB?
    To use files of 2GB or more on a 32bit architecture, you need to pass the O_LARGEFILE flag to the open system call. If anybody knows which header file to include to get this macro please tell me about it. Meanwhile you can use these few lines in your code:
    #ifndef O_LARGEFILE
    #define O_LARGEFILE 0100000
    #endif
    
    When calling open you can do like in this example:
    fd=open(filename,O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE,0666);
    
    If you need to use large files through stdio you need a version where fopen uses the O_LARGEFILE flag when calling open. Otherwise a possible workaround is to avoid fopen, you can instead build your own fopen implementation using open followed by fdopen.

    A helpful person told me a little more about using large files and devices:

    Hello.

    I needed to essentially treat an entire hard drive as one large file, and discovered I couldn't get past 2 GB into it until I defined the following macros in my source:
    #define _GNU_SOURCE
    #define _LARGEFILE_SOURCE  
    #define _LARGEFILE64_SOURCE  
    #define _FILE_OFFSET_BITS 64
    
    The _LARGEFILE_SOURCE macro permits the use of fseeko() and ftello(), among other things. _LARGEFILE64_SOURCE actually permits the use of 64-bit functions like fseeko64() and ftello64() and, finally, _FILE_OFFSET_BITS determines which interface will be used by default. If the macro isn't defined or is defined as 32 then the default interface is 32 bits. If it is defined as 64 then the default interface is the 64-bit interface, and a call to fseeko() will use this 64 bit interface.

    I cribbed from the file NOTES that accompanies glibc v2.2.5 on my Slackware 8.1 box to write this E-mail but I know these extensions exist in glibc v2.2.3 as well.

    HTH

  15. How do I use more than 2GB of virtual memory in one process?
    The only real solution to your problem is using a 64bit architecture, but there are workarounds that will help you. Normally the 4GB address space is split into 4 equal sized sections for different purposes. The first is used for executable and brk/sbrk allocations. The second is used for mmaps, shared libraries and malloc uses mmap. The third is used for stack, the fourth is used for the kernel itself. Since mmaps grows bottom up and stack grows top down, the part not used by the one may be used by the other. You may want to change the split between brk and mmap by changing the TASK_UNMAPPED_BASE define in linux/include/asm/processor.h. Or you may try this patch for Linux version 2.4.17. The patch still needs more testing. You can also change the split between userspace and kernelspace, that is the define PAGE_OFFSET_RAW in linux/include/asm/page_offset.h. Notice that incorrect settings could make the kernel unusable. You should always leave at least 8MB plus 3% of your physical memory for the kernel. The size for the kernel probably needs to be divisible by 4MB.

    A completely different approach has been suggested on the kernel mailing list (linux-kernel@vger.kernel.org). You can allocate a number of shared memory segments using SysV or POSIX shared memory. Then map and unmap segments as needed. (This is a little similar to EMM known from DOS.) The segments will be limited in size, but you can have a lot of segments with a total size even going beyond 4GB if you have enough physical ram or swap space. It will also be possible to mmap files or parts of files in your process. Memory mapped files behaves in most cases like shared memory.

  16. Why does Red Hat Linux 7.3, 8.0, and 9 load glibc on address 42000000?
    On Red Hat Linux 7.3, 8.0, and 9 the C library is compiled for a fixed address rather than a dynamic address. The constant RedHat has chosen is a little above the default TASK_UNMAPPED_BASE, so even though a few memory mappings have been made before libc is mapped, it can get the address it wants. In most cases this fixed address for libc is a good idea, but it does have a single disadvantage. If you change TASK_UNMAPPED_BASE to get more contiguous address space, the choice of 42000000 is very bad. You can install the source rpm and change this address in the spec file to something else like 07000000, you should also add your initials to the release field so it is always clear that this is a version you changed. Then you can build a new rpm file and install it. On RedHat 7.3 an easier, but not as good solution is to install the i386 version of glibc. (Does this work on RH8.0 as well?)

  17. How do I use posix shared memory in Linux
    You might find the shm_open(3) man page a help. What it does not mention is that you need to also include <sys/fcntl.h>. And remember you must compile with -lrt which is easy to miss in the man page. Basically the Linux implementation of shm_open just prepends /dev/shm to the name and calls open(2), the /dev/shm directory must exist and should be the mountpoint of a tmpfs filesystem (This filesystem used to be called shm, but that name is now obsolete). Notice that with Fedora Core 1 running on an i586 programs compiled with -lrt will not work reliably.

  18. How do I read single chars from a terminal
    Floyd Davidson wrote a nice posting on comp.os.linux.development.apps about this.

  19. How do I prevent zombie processes?
    Zombie processes are dead children ignored by their parents. If you want to create a child process and don't want to care about it anymore, you can use the double fork trick. In the simple case, where you just want to know success or failure, it can be done like this (untested code):
    int doublefork()
    {
    	pid_t pid=fork();
    	int status;
    	switch(pid) {
    		case 0:
    			switch(fork()) {
    				case 0: return 0;
    				case -1: _exit(1);
    				default: _exit(0);
    			}
    		case -1:
    			return -1;
    	}
    	waitpid(pid,&status,0);
    	if (status) return -1;
    	return 1;
    }
    
    In the more complicated case, where you need the pid of the newly created process and/or the correct errno value, inter process communication is also needed (untested code):
    pid_t doublefork()
    {
    	struct {
    		pid_t pid2;
    		int errno_;
    	} data;
    	int pipefd[2];
    	pid_t pid1;
    
    	if (pipe(pipefd)) return -1;
    	pid1=fork();
    	switch(pid1) {
    		case 0:
    			close(pipefd[0]);
    			data.pid2=fork();
    			data.errno_=errno;
    
    			if (data.pid2) {
    				write(pipefd[1],&data,sizeof(data));
    				_exit(0);
    			}
    			close(pipefd[1]);
    			return 0;
    		case -1:
    			close(pipefd[0]);
    			close(pipefd[1]);
    			return -1;
    	}
    
    	close(pipefd[1]);
    	assert(read(pipefd[0],&data,sizeof(data))==sizeof(data));
    	close(pipefd[0]);
    	waitpid(pid1,NULL,0);
    	errno=data.errno_;
    	return data.pid2;
    }
    
    Some systems have better ways, but the above is probably the most portable. Also read this section in the Unix Programming FAQ

  20. How can I wait for termination of a process not my child?
    It would often be convenient to have a way to get notified about termination of an arbitrary process not necessarily your own child. Unfortunately there is no standard way to do that. There are a few different possible tricks that may help:
    1. If the process you want to wait for is a relative to yourself you might be able to get away with a pipe trick. This trick requires that a pipe was created by a common ancestor through the use of the pipe() system call. (In some cases that could be one of the involved processes because the one is an ancestor of the other.) The process to be waited for has to be the writer, and the process that is to wait shall be the reader. The trick is, that reading a pipe gives you an EOF when there are no more writers. For this trick to work it is important to close the write side of the pipe in all other processes than the one to be waited for. There are a few nice ways to use the pipe for more than just waiting for a single process to terminate:
      • You can wait for all processes in some group to terminate by having all of them being writers of the pipe.
      • Multiple processes can be waiting for the same process to terminate because multiple readers will work as expected.
      • You can make use of the close on exec flag on the write end of the pipe to indicate if you want to be notified if the process calls execve and not only if it terminates.
      • You can even use select on the read end of the pipe.
    2. You can create a loop where you check once every half second if the process in question still exists. This is not a good solution, but it is very portable and probably the best that can be done if the scenario does not allow you to use the pipe trick.
    3. Finally you can make use of ptrace(), but that is rarely a good idea because:
      • It will affect the process, which might not behave exactly as it is intended to.
      • It can slow down the process being ptraced.
      • A process can have only one tracer.
      • You can only trace your own processes.

  21. How do I detect if a pid exists?
    You can do that by using the kill() system call. If you use 0 in the signal field no signal will be sent, but error checking is still done. A return value of 0 indicates the process exists, and you could send a signal to it. A return value of -1 indicates that it does not exist or is a process you could not send a signal to, in that case use errno to find out what is the case. A value of ESRCH indicates the process does not exist. A value of EPERM indicates the process does exist, but you are not allowed to send a signal to it. See the kill(2) man page for more information.

  22. Why do I get a SIGSEGV but no core dump?
    There are many possible reasons, I will try to list them all, but I could easily have forgotten some:
  23. How do I change the name and location of the core file?
    The two pseudofiles /proc/sys/kernel/core_uses_pid and /proc/sys/kernel/core_pattern controls the naming of core dumps in Linux 2.4 and later. If core_uses_pid is set to 1, the pid will be appended to the end of any core dump. By default it is only appended for multithreaded programs.

    You can get more control over the name by changing core_pattern. You can either use it to just change the name of the file, which will still be dumped in the current directory. You can also specify a full path to have it dumped in a fixed location rather than the current directory. That can be handy because some file systems are not nice to core dump to. (The root file system, tmpfs, and nfs are examples of places where a core dump can hurt). You can use certain escape sequences in the name specification
    %%A % character
    %pThe pid (if you use this, it will no longer be appended).
    %uThe user id as a number
    %gThe group id as a number
    %sThe signal causing the dump
    %tThe time when the dump started
    %hThe hostname
    %eThe name of the executable
    For example you could set the path as /mnt/bigfs/coredumps/%u/%e which would put core dumps from different users in different directories and name the core dumps according to the program name. If used in this way you would have to create the directories beforehand and chown them to the appropriate user. If you wanted to access them by user name rather than uid, you could create symlinks.

  24. How do I get a core dump from a running program?
    If you just want a program to terminate now and dump core, you can use the SIGABRT signal. This signal can be send from another process using kill. Or it can be send by the process itself using, kill, raise, or abort. If you want a core dump without killing the process, things start getting more tricky. You can create a child process by using the fork system call, and let the child dump core. The init program actually does this in its signal handlers. From the outside, the kernel offers no simple way to get a core dump from a process without killing it. But gdb have a gcore command that will do the hard work. On Fedora Core you can also call gcore from your shell (in which case it is just a script that call gdb).

  25. How can I make a suid executable dump core?
    If you have the source you can easily add a statement early in main to enable the dumpable flag.
    #include <sys/prctl.h>
    ...
    /* The last three arguments are just padding, because the
     * system call requires five arguments.
     */
    prctl(PR_SET_DUMPABLE,1,42,42,42);
    
    If anybody know a clean and secure way to get core dumps from suid executables without a recompile please tell me about it. Meanwhile I hacked myself a module to use as a last resort when you really need a core dump from a suid executable and don't want to go through the entire compile process. When loaded with the pid symbol set to a number, the dumpable flag of that process will be enabled. The module will always return an error code to get unloaded. (FIXME: Find out what the purpose of /proc/sys/kernel/core_setuid_ok is)

  26. Why can't I delete /proc/kcore?
    Files in /proc are not really files, they are pseudo files. Kcore can be used to debug the running kernel. The format is similar to a usual core file. You don't have to worry about this file, it will always exist and doesn't indicate any problem. If you use ls, the file will appear to have a size close to the amount of kernel memory in the system. But it does not take up any disk space or memory.

  27. How do I find a user's home directory?
    There are different ways depending on the situation. If you just want to find the current user's home directory, you can use the HOME environment variable. Simply use getenv("HOME") in your program. You should verify that the returned pointer is not null and that the string it points to is not empty. If the HOME environment variable is not properly set up print an error and abort the program. In emacs and most shells, the current users home is also called "~". Here is a code example:
    char *myhome()
    {
    	char *r=getenv("HOME");
    	if ((!r)||((*r) != '/')) {
    		fprintf(stderr,"Fix the HOME environment variable.\n");
    		return "/";
    	}
    	return r;
    }
    
    Finding a named users home directory is a different matter. To do that you need to use the getpwnam function. Given a username it returns a struct which amongst others contains a field with the home directory. You have to verify the return code to know if the user exists. But if you get a non NULL pointer, you can safely assume the returned structure is indeed valid. In emacs and most shells, this is called "~username". Here is a code example:
    char *usershome(const char * name)
    {
    	struct passwd *tmp=getpwnam(name);
    	if (!tmp) return NULL;
    	return tmp->pw_dir;
    }
    
    In the rare case of a suid executable needing the current users home directory, you should not use the HOME environment variable. In general a suid executable should not use anything from the environment. In this case you can use the getpwuid function. Here is a code example:
    char *myhome_suid_version()
    {
    	struct passwdd *tmp=getpwuid(getuid());
    	if (!tmp) {
    		fprintf(stderr,"You don't exist! Go away!\n");
    		exit(1);
    	}
    	return tmp->pw_dir;
    }
    
    If you are not writing a suid executable, you should not worry about the user changing the HOME environment variable. In fact you should respect the users wish to use another home directory. Respecting the changed home is not a bug it is a feature. Not respecting HOME would be a bug.

  28. How do I use crypt to generate and verify a password
    Notice this only describes the use of crypt, to really make use of this under Linux you also need to understand how PAM works.

    crypt takes two strings as arguments and outputs a pointer to one string. The same function can be used in two different ways to generate and verify a password. The first argument string is the password typed by the user. The second argument string is the salt. You are allowed to pass a string with something appended to the salt, crypt knows how long the salt is, and will only use the start of the salt string if it is too long. The output is the salt concatenated with the "encoded" password. This output string is what is actually stored in the password file.

    To verify a password give the password typed by the user, and the entry from the password file as arguments to crypt. Compare the output against the entry from the password file. Because the salt is a prefix of the password file entry, crypt will find the salt in the string.
    	if (strcmp(crypt(typed_password,
    	                 passwd_file_entry),
    	           passwd_file_entry))
    	{
    		fprintf(stderr,"Invalid login\n");
    		exit(1);
    	}
    
    To generate an entry for a password file, your program must choose a random salt. Actually there are different formats for the salt depending on the type of encoding. Conventionally UNIX systems have used a variant of DES. That is not very secure, so Linux have another more secure version based on MD5, which is unfortunately less portable. Since crypt look on the salt to find out which type of password is being used, the above verification code works without modifications for both types of passwords. When generating the password you however have to make the decission as you generate a salt. The salt must contain some random chars. To generate random chars using /dev/random is the recomended approach. The random chars are taken from the set of upper and lower case letters, digits, period, and slash. The DES salt is just two random chars. The MD5 salt is dollar one dollar eight random chars dollar. Here is some example code.
    #define _XOPEN_SOURCE
    #include <stdlib.h>
    #include <string.h>
    #include <stdio.h>
    #include <stdint.h>
    #include <unistd.h>
    #include <fcntl.h>
    
    char *encode(char *password)
    {
    	int i=0;
    	uint8_t rand_buf[8];
    	/* Changed to $6$ to use the more secure SHA-512 hash instead of MD5 */
    	char salt[13]="$6$________$";
    	int fd=open("/dev/random",O_RDONLY);
    	if (fd==-1) {
    		perror("/dev/random");
    		exit(1);
    	}
    	while(i<8)
    		i+=read(fd,rand_buf,8-i);
    	close(fd);
    	for (i=0;i<8;++i)
    		salt[i+3]="./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"[rand_buf[i]&63];
    	return crypt(password,salt);
    }
    
    Notice that crypt returns a pointer to a static buffer. That means you don't need to call free on the pointer returned by crypt. OTOH the string will be overwritten by the next crypt call. For that reason you should not expect crypt to be thread safe. A program using crypt need to be linked with -lcrypt

  29. How do I find the amount of memory on the system?
    On Linux you can use sysconf(_SC_PHYS_PAGES) to get the number of physical pages available on the system. This number does exclude some reserved pages, but it basically gives you the right number to use in your applications. You can use sysconf(_SC_PAGE_SIZE) to find out how large each page is. Notice that you cannot just multiply the two numbers to find the number of bytes of physical memory, as that multiplication might cause an overflow. The number of free pages can be found with sysconf(_SC_AVPHYS_PAGES).

    Take care before using these values. Most programs shouldn't care how much memory is in the system. Just allocate as much as you need and use it. The system will take care of the rest. Never forget to verify the return value from malloc.

    Some algorithms need to know how much physical memory is available to perform optimally. In such cases keep in mind, that your program is not the only one running. Allocating a litle less is usually a good idea, as a litle unused memory is better than trashing the swap partition because you tried to allocate too much. Put an upper limit on the number of pages you are going to allocate, as in some cases like for example 32-bit architectures with more than 4GB of physical memory, a single process cannot allocate all of it because of limited address space. You could use a code snippet similar to this.
       int allocpages=(sysconf(_SC_PHYS_PAGES)*3)>>2;
       if (allocpages>400000) allocpages=400000;
       allocation=malloc(allocpages*sysconf(_SC_PAGE_SIZE));
    
    Rather than trying to use as much physical memory as possible you could try to get the best possible performance from what memory you get. If there is a good cache oblivious algorithm for the problem you are trying to solve, you should use it.

    A lot of kernel versions still overcommits memory, that means it will allow you to allocate more than possible and start killing processes as problems appear. To avoid processes getting killed unexpectedly, you must make sure there is enough available swap space. My recommendation is to make the swap space three times the size of the physical memory.

  30. How do I get a list of network interfaces and their IP and MAC addresses?
    First a nice example from Floyd Davidson. Below follows some smaller examples.
    /* display info about network interfaces */
    
    #define _BSD_SOURCE
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <unistd.h>
    #include <string.h>
    #include <errno.h>
    #include <sys/ioctl.h>
    #include <sys/types.h>
    #include <net/if.h>
    #include <net/if_arp.h>
    #include <arpa/inet.h>
    
    #define inaddrr(x) (*(struct in_addr *) &ifr->x[sizeof sa.sin_port])
    #define IFRSIZE   ((int)(size * sizeof (struct ifreq)))
    
    static int
    get_addr(int sock, char * ifname, struct sockaddr * ifaddr) {
    
      struct ifreq *ifr;
      struct ifreq ifrr;
      struct sockaddr_in sa;
    
      ifr = &ifrr;
      ifrr.ifr_addr.sa_family = AF_INET;
      strncpy(ifrr.ifr_name, ifname, sizeof(ifrr.ifr_name));
    
      if (ioctl(sock, SIOCGIFADDR, ifr) < 0) {
        printf("No %s interface.\n", ifname);
        return -1;
      }
    
      *ifaddr = ifrr.ifr_addr;
      printf("Address for %s: %s\n", ifname, inet_ntoa(inaddrr(ifr_addr.sa_data)));
      return 0;
    }
    
    int
    main(void)
    {
      unsigned char      *u;
      int                sockfd, size  = 1;
      struct ifreq       *ifr;
      struct ifconf      ifc;
      struct sockaddr_in sa;
    
      if (0 > (sockfd = socket(AF_INET, SOCK_DGRAM, IPPROTO_IP))) {
              fprintf(stderr, "Cannot open socket.\n");
        exit(EXIT_FAILURE);
      }
    
      ifc.ifc_len = IFRSIZE;
      ifc.ifc_req = NULL;
    
      do {
        ++size;
        /* realloc buffer size until no overflow occurs  */
        if (NULL == (ifc.ifc_req = realloc(ifc.ifc_req, IFRSIZE))) {
          fprintf(stderr, "Out of memory.\n");
          exit(EXIT_FAILURE);
        }
        ifc.ifc_len = IFRSIZE;
        if (ioctl(sockfd, SIOCGIFCONF, &ifc)) {
          perror("ioctl SIOCFIFCONF");
          exit(EXIT_FAILURE);
        }
      } while  (IFRSIZE <= ifc.ifc_len);
    
      /* this is an alternate way to get info... */
      {
        struct sockaddr ifa;
        get_addr(sockfd, "ppp0", &ifa);
      }
    
      ifr = ifc.ifc_req;
      for (;(char *) ifr < (char *) ifc.ifc_req + ifc.ifc_len; ++ifr) {
    
        if (ifr->ifr_addr.sa_data == (ifr+1)->ifr_addr.sa_data) {
          continue;  /* duplicate, skip it */
        }
    
        if (ioctl(sockfd, SIOCGIFFLAGS, ifr)) {
          continue;  /* failed to get flags, skip it */
        }
    
        printf("Interface:  %s\n", ifr->ifr_name);
        printf("IP Address: %s\n", inet_ntoa(inaddrr(ifr_addr.sa_data)));
    
        /*
          This won't work on HP-UX 10.20 as there's no SIOCGIFHWADDR ioctl. You'll
          need to use DLPI or the NETSTAT ioctl on /dev/lan0, etc (and you'll need
          to be root to use the NETSTAT ioctl. Also this is deprecated and doesn't
          work on 11.00).
    
          On Digital Unix you can use the SIOCRPHYSADDR ioctl according to an old
          utility I have. Also on SGI I think you need to use a raw socket, e.g. s
          = socket(PF_RAW, SOCK_RAW, RAWPROTO_SNOOP)
    
          Dave
    
          From: David Peter <dave.peter@eu.citrix.com>
         */
    
        if (0 == ioctl(sockfd, SIOCGIFHWADDR, ifr)) {
    
          /* Select which  hardware types to process.
           *
           *    See list in system include file included from
           *    /usr/include/net/if_arp.h  (For example, on
           *    Linux see file /usr/include/linux/if_arp.h to
           *    get the list.)
           */
          switch (ifr->ifr_hwaddr.sa_family) {
          default:
            printf("\n");
            continue;
          case  ARPHRD_NETROM:  case  ARPHRD_ETHER:  case  ARPHRD_PPP:
          case  ARPHRD_EETHER:  case  ARPHRD_IEEE802: break;
          }
    
          u = (unsigned char *) &ifr->ifr_addr.sa_data;
    
          if (u[0] + u[1] + u[2] + u[3] + u[4] + u[5]) {
            printf("HW Address: %2.2x.%2.2x.%2.2x.%2.2x.%2.2x.%2.2x\n",
                 u[0], u[1], u[2], u[3], u[4], u[5]);
          }
        }
    
        if (0 == ioctl(sockfd, SIOCGIFNETMASK, ifr) &&
            strcmp("255.255.255.255", inet_ntoa(inaddrr(ifr_addr.sa_data)))) {
          printf("Netmask:    %s\n", inet_ntoa(inaddrr(ifr_addr.sa_data)));
        }
    
        if (ifr->ifr_flags & IFF_BROADCAST) {
          if (0 == ioctl(sockfd, SIOCGIFBRDADDR, ifr) &&
              strcmp("0.0.0.0", inet_ntoa(inaddrr(ifr_addr.sa_data)))) {
            printf("Broadcast:  %s\n", inet_ntoa(inaddrr(ifr_addr.sa_data)));
          }
        }
    
        if (0 == ioctl(sockfd, SIOCGIFMTU, ifr)) {
          printf("MTU:        %u\n",  ifr->ifr_mtu);
        }
    
        if (0 == ioctl(sockfd, SIOCGIFMETRIC, ifr)) {
          printf("Metric:     %u\n",  ifr->ifr_metric);
        }
        printf("\n");
      }
    
      close(sockfd);
      return EXIT_SUCCESS;
    }
    
    This program written by Frank Becker does part of the job:
    #include <net/if.h>
    #include <sys/ioctl.h>
    #include <stdio.h>
    #include <stdlib.h>
    
    int main( int argc, char *argv[])
    {
         char *ifname = "eth0";
         struct ifreq ifr;
         int fd;
         int i;
    
         if( argc==2 ) ifname = argv[1];
    
         fd = socket(AF_INET,SOCK_DGRAM, 0);
         if (fd >= 0) {
             strcpy(ifr.ifr_name, ifname);
             if (ioctl(fd, SIOCGIFADDR, &ifr) == 0) {
                 printf( "%s : ", ifname);
                 for( i=2; i<6; i++)
                 {
                     unsigned char val =
                         (unsigned char)ifr.ifr_ifru.ifru_addr.sa_data[i];
                     printf( "%d%s", val, i==5?" ":".");
                 }
                 printf( "\n");
             }
         }
    
         return 0;
    }
    
    Another program written by Gary Desrosiers will print the MAC address of an interface:
    #include <string.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <netinet/in.h>
    #include <net/if.h>
    #include <sys/types.h>
    #include <arpa/inet.h>
    #include <sys/ioctl.h>
    #include <sys/socket.h>
    #include <netdb.h>
    
    int main(int argc,char *argv[])
    {
      int r;
      struct protoent *proto;
      int sock;
      struct ifreq ifr;
    
      if(argc < 2)
      {
        printf("Usage: dumpmac ifname\n");
        exit(1);
      }
      proto = getprotobyname("tcp");
      sock = socket(PF_INET,SOCK_STREAM,proto->p_proto);
      memset(&ifr,0,sizeof(struct ifreq));
      strcpy(ifr.ifr_name,argv[1]);
      r = ioctl(sock,SIOCGIFHWADDR,&ifr);
      if(r != 0)
      {
        printf("Couldn't get mac address\n");
        exit(1);
      }
      printf("%2.2X:%2.2X:%2.2X:%2.2X:%2.2X:%2.2X\n",
      (unsigned char)ifr.ifr_hwaddr.sa_data[0],
      (unsigned char)ifr.ifr_hwaddr.sa_data[1],
      (unsigned char)ifr.ifr_hwaddr.sa_data[2],
      (unsigned char)ifr.ifr_hwaddr.sa_data[3],
      (unsigned char)ifr.ifr_hwaddr.sa_data[4],
      (unsigned char)ifr.ifr_hwaddr.sa_data[5]);
      return(0);
    }
    
  31. How can I send my debugging output to a new xterm
    This example could be a help: debugxterm.c. A better solution is to make use of the -S option for xterm, see the xterm man page for more information.

  32. I want to write a filesystem, how do I start?
    To write a filesystem you have to learn two things: The datastructures you need to use, and the API used in the Linux kernel.

    First of all you could take a look on the source code of the simplest of the existing filesystems. The simplest of all Linux filesystems is ramfs. There are two types of filesystems, those using a blockdevice and those not using a blockdevice. Ramfs is one of those not using a blockdevice, this group also contains all the network filesystems and pseudo filesytems like procfs.

    The filesystems you actually store on disks are those using blockdevices, they can be divided into two groups. The simplest are those that can be implemented using a get_block/bmap function, this includes ext2, minix, and fat. The simplest from this group and yet fully posix compliant is minix. If the data on disk is stored in some more compact way, the simple solution will not be possible. At this point it obviously start getting a little complicated, there does however still exist a quite simple readonly filesystem which has it's own readpage implementation, that is romfs.

    Before implementing the filesystem as a kernel module, you should get a feeling with the datastructures you will be using. If you are using a blockdevice, you should write usermode tools to manipulate the datastructures on the disk. You are going to need these pieces of code anyway, because you will eventually need three tools for your filesystem: mkfs, fsck, and debugfs. Once you have working usermode code for the basic tasks, you can start doing it in the kernel. If you are writing another kind of filesystem, there is still a few things to do before writing kernel code. If you are writing a networking filesystem, you want to know the protocols you are going to use, and you want to test them in usermode first.

  33. Does my kernel leak memory?
    If you didn't modify the kernel, it probably doesn't leak memory. Of course eventually kernels do get released with a bug that could cause a leak. Before you conclude that you have found a bug, you will need to know how to find and interpret informations about the memory usage. The kernel will attempt to use most of the otherwise free memory for caching disk conents. This means that there will normally be very litle memory, that is actually free. Take a look on this output from the free command:
                 total       used       free     shared    buffers     cached
    Mem:        448980     443196       5784          0      36604     303404
    -/+ buffers/cache:     103188     345792
    Swap:      2096440          0    2096440
    
    The first line tells me, that only 5784KB of my 448MB of memory are free. But in the second line where the 332MB of memory used for buffer and cache memory is substracted from the used memory I see, that only 100MB of the memory is used for other allocations. Some of those 100MB will be used for slabs. You can use slabtop to see information about this memory usage. Another possibility is to use my script which will list slab allocations (requires kernel version 2.4). Look on a few selected lines from the output:
    inode_cache         4863 pages  19452 KB
    dentry_cache        1231 pages   4924 KB
    buffer_head         1476 pages   5904 KB
    ----------------------------------------
    Total               8554 pages  34216 KB
    ----------------------------------------
    
    Here we see, that most of the 33MB used for slabs is actually also cached disk contents and management for the buffers. This is quite normal, and those slabs will get freed if memory gets tight. The rest of the memory in my system is mostly used by applications. Even if we sum up all the allocations I have talked about, there will still be a few KB of memory that is not accounted for. That is also normal, there are different types of allocations that use get_free_pages() directly. Those pages are only listed as used, and does not figure anywhere else, but that doesn't mean they have leaked. Also notice that there is a difference between the 448MB I have and the 438MB free shows. That is because the mem_map will use 1.7% of my physical memory, and the kernel image itself also use a few MB. Those are not part of the total memory as reported by the kernel.

    So now that you know a litle about the many different ways memory is allocated under Linux, how do you tell, if there is a leak? First of all if you suspect some action causes a leak, you should repeat it over and over again. If memory is allocated only the first time, it is probably not a leak. If there really is a leak, it will allocate more memory each time. Look on the slabs, if one type of slabs keep growing, you might have found a leak. Ignore the three slabs I told you about earlier. If the leaked memory is not allocated through slabs, it must be allocated at least one page at a time, in which case it will quickly grow large enough to be easilly noticed. Eventually the system will die when it really runs out of memory.

    In addition to just monitoring the memory usage, you can also try to force the system to free memory. Let a program allocate a lot of virtual memory and touch each page. That will force the kernel to free anything that can be freed. An easy way to do this is by letting tail read from /dev/zero.
    tail -c400m /dev/zero
    
    This will allocate 400MB and access it over and over again.

  34. I get SIGSEGV in malloc, calloc, realloc, or free. Are they buggy?
    No, they are probably not buggy. These four functions are some of the most often used code in the entire system. If there were any major bugs, they would have been found and fixed long ago. When one of these functions produce a SIGSEGV in more than 99% of the cases it will be a bug in your code. Imagine one of the following situations: In all these cases, you damage the internal data structures used by malloc and friends. This is the real reasons for those functions to give a SIGSEGV.

    There are tools for debugging this kind of bugs. First of all, you can try setting MALLOC_CHECK_=2 in the environment. That may give you a core dump to debug at the first occurrence of a memory-management related problem. (Notice that with some library versions suid executables will ignore MALLOC_CHECK_ unless a file named /etc/suid-debug exists.) The second simplest tool to use is electric fence, just add -lefence to the command line while compiling. Your program will still give you a SIGSEGV, but it no more happens in malloc, now it happens in code closer to the real bug. Notice there are a few environment variables which you must try different combinations of to spot all memory access bugs. Don't use electric fence in your final program, it will make the program slower and increase the memory usage.

    Another quite new tool is valgrind. Other tools worth mentioning are: mpatrol, Insure++, and purify.

  35. My program doesn't work, will you look at my source?
    No, we will not look at your source, and you probably don't want us to see your source anyway. Take a copy of your entire source and start stripping it down, remove all parts unrelated to your problem. When you have an absolutely minimal program still demonstrating your problem, then consider posting it. But before posting it read it over a few times, you might actually find the bug without our help, and you might actually find that it can get stripped even more down.

    When you have a small piece of code that doesn't work, and you don't understand why, then you can ask us. Use cut'n'paste when writing your posting. Typing it all in again is a waste of your time, and it is a waste of our time too, because we will just see the typos you made and never find the real problem.

    You should provide us with source that we can compile without warnings and test. (Unless of course your question is about compilation problems.)

  36. Why is do { } while (0) used everywhere?
    http://www.kernelnewbies.org/faq/index.php3#dowhile This is also discussed in question 10.4 in the C FAQ.

  37. How do I move the cursor in a terminal?
    Floyd Davidson has also written a posting on comp.os.linux.development.apps about how to use ncurses to do this. Later he has also send a larger example to me.

  38. How do I use /dev/ptmx?
    Take a look on the Pseudo-Terminals section in the GNU lib C manual.

    One of the simplest pieces of example code you can find is the script program from the util-linux package. script.c is only around 8KB. It chooses between two different aproaches depending on the HAVE_openpty define. Using openpty is the best aproach and should work on any recent Linux distribution. (If you have problems compiling script.c try removing the localization stuff and add "#define _".)

  39. Where is the itoa function?
    There is no itoa function in ansi C. If you know a standard specifying itoa please tell me about it. Here are a few suggestions on how to use sprintf when you want to convert an integer to a string:
    1. char *itoa(int i)
      {
      	char *s=malloc(42); /* Enough for a 128 bit integer */
      	if (s) sprintf(s,"%d",i);
      	return s;
      }
      
    2. char * itoa(int n)
      {
      	char *r;
      	int l,t=n/10;
      	if (t<0) t=-t;
      	for (l=4;t;++l) t/=10;
      	r=malloc(l);
      	if (r) sprintf(r,"%d",n);
      	return r;
      }
      
    3. Here follows yet another version written by Lew Pitcher:
       /*
       ** The following two functions together make up an itoa()
       ** implementation. Function i2a() is a 'private' function
       ** called by the public itoa() function.
       **
       ** itoa() takes three arguments:
       ** 1) the integer to be converted,
       ** 2) a pointer to a character conversion buffer,
       ** 3) the radix for the conversion
       ** which can range between 2 and 36 inclusive
       ** range errors on the radix default it to base10
       */
      
      static char *i2a(unsigned i, char *a, unsigned r)
      {
      	if (i/r > 0) a = i2a(i/r,a,r);
      	*a = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ"[i%r];
      	return a+1;
      }
      
      char *itoa(int i, char *a, int r)
      {
      	if ((r < 2) || (r > 36)) r = 10;
      	if (i<0) {
      		*a = '-';
      		*i2a(-(unsigned)i,a+1,r) = 0;
      	} else *i2a(i,a,r) = 0;
      	return a;
      }
      
  40. What is the difference between LinuxThreads and NPTL?
    This answer was provided by Loic Domaigne

    For a starter:

    To tell which one your compiler suite and distro is using, use one of the following two methods

    1. getconf ( works only on recent glibc )
      [loic@neumann loic]$ getconf GNU_LIBPTHREAD_VERSION
      NPTL 0.61             # that's NPTL v0.61
      
      [loic@neumann loic]$ getconf GNU_LIBPTHREAD_VERSION
      linuxthreads-0.10    # that's LinuxThread v0.10
      
    2. ldd (always works)

      Assuming that you want to know with libpthreads the binary /your_bin/ is using, then run the command:

      `ldd your_bin | grep libc.so.6 | cut -d' ' -f 3` | egrep -i 
      'linuxthreads|nptl'
      

      Example with your_bin=/bin/ls:

      If you have LinuxThreads:
      [loic@neumann loic]$ `ldd /bin/ls | grep libc.so.6 | cut -d' ' -f 3` | 
      egrep -i 'linuxthreads|nptl'
             linuxthreads-0.10 by Xavier Leroy
      
      Or if you have  NPTL:
      [loic@neumann loic]$ `ldd /bin/ls | grep libc.so.6 | cut -d' ' -f 3` | 
      egrep -i 'linuxthreads|nptl'
             NPTL 0.61 by Ulrich Drepper
      

      Note: the version number "0.61" for NPTL, resp "0-10" for LinuxThreads might differ depending on your distro/release.

  41. How many shared objects can I dlopen simultanously?
    This answer was provided by Basile Starynkevitch

    many of them. The only limitation is the process address space (which grows slowly on each dlopen, which mmap-s it text & data segments). In practice, on linux x86 with libc6, a program can dlopen 30 000 -thirty thousands- different shared objects (and probably much more). See http://starynkevitch.net/Basile/manydl.c for an example

  42. Does dlopen-ing a shared object need a file descriptor?
    This answer was provided by Basile Starynkevitch

    not permanently (after the dlopen succeed). But internally, dlopen needs to open the shared object (but closes it after having mmap-ed it).

  43. What are vdso and linux-gate.so?
    It is a virtual dynamically-linked shared object, which implements the gate used to perform system calls. Depending on the architecture and kernel version there are different ways to do it. The kernel provides a page with a version suitable for the actual setup, which can then be used by all processes. On this page you can read more about linux-gate.so

  44. What is a misc device?
    There are two types of device special files representing the two different types of devices: block devices and character devices. A device identifier is split into two numbers, the major and the minor number. The major number is in the range 1-255 and the minor number is in the range 0-255. This allows for a total of 130560 devices 65280 of each type.

    Though the number of identifiers may seem large, we can still run short of device numbers. This is due to the way they are allocated. Usually a driver will register a major number, and thus this single driver will own all 256 minor numbers under this major.

    Because many drivers only needs a single device number, a special character device, the misc device, has allocated one major of which the minors can be allocated individually.

    In /proc/devices you can see the list of character- and block-drivers currently loaded, the misc device will always be on the list of character devices. In /proc/misc you can see the list of misc-drivers currently loaded. The assigned numbers is listed in the file linux/Documentation/devices.txt.

    FIXME: Write the rest of this.

  45. Can I block interrupts from user mode?


  46. Where can I find the sourcecode?
    That depends on what distribution you are using, and what component of the system you want the source for. Most components of the system are available from independent websites. The kernel sources can be found on kernel.org, and other components can be found on other websites.

    Often the sources used in a distribution are not the original sources, but rather a patched version. If you want those sources, you will need to know how to find them for your particular distribution. Availability of sources differs between distributions, but of course all distributors must respect the license. Many components are licensed under GPL, and for those components the modified sources must be available.

    Sometimes you will get the sources with the distribution, sometimes you can order them separately, and sometimes you can download them from the net.
  47. What is the difference between the kernel-source rpm and the kernel source-rpm?
    In Red Hat Linux and Fedora Core there is both a kernel-source rpm file and a kernel source-rpm file. There is a difference, and you need to know which one to use for what. Read this usenet posting.

  48. Why does Linux claim my dual Xeon computer has 4 CPUs?
    It claims so because your computer has 4 CPUs. Some Xeon CPUs contain two logical CPUs each, this technique is called hyperthreading.
Other stuff FIXME: Mention /proc/sys/kernel/randomize_va_space (and friends) somewhere FIXME: Mention /proc/sys/kernel/core_* FIXME: glibc cannot be statically linked FIXME: What does kernel thread mean

This page is now valid HTML again. This time HTML 4.01 Transitional. The xmp tags are now converted to pre tags using a small program written for the purpose: xmp2pre.l.

Valid HTML 4.01!