Created attachment 148707 [details] changes on /sys/arm/broadcom/bcm2835/bcm2835_fb.c Hi, I have done some speed optimisations on the Raspberry Pi's console driver. Please give this to the Raspberry Pi developers. Benchmarks (time taken to print and sroll 1 milion lines at 1440x900) : --Original FreeBSD driver-- 16 Bit per Pixel: 660 Secends 24 Bit per Pixel: 817 Secends 32 Bit per Pixel: 1086 Secends --My modifications-- 16 Bit per Pixel: 319 Secends 24 Bit per Pixel: 648 Secends 32 Bit per Pixel: 336 Secends I have tested (working) this resolutions, all BpP each : 640x480, 800x600, 1024x768, 1440x900 This is my first work on FreeBSD kernel, and i hope to fit all reqirements. Greetings Stefan Berndt
This seems to be pretty straightforward. Good job on your first patch, hope it gets approved!
Could you please upload a unified diff?
Created attachment 148738 [details] unified diff on /sys/arm/broadcom/bcm2835/bcm2835_fb.c Sure. Here comes the diff. Unified form this time.
Just for fun and for learning i have tried to get more speed with pure assembler code. After all i was impressed how good the GCC already does the job. The only way to get it faster than GCC was to hardcode every pixel of the character set. Its made for 16 BpP only and reached 1 milion lines at 251 Secends. A very poor result for this crazy huge pice of code. Only for your entertainment i will show you the resultig file.
Created attachment 149418 [details] maybe the fastest way to draw characters, but crazy huge not a real choice but even impressing
Hm, I'm curious about this stuff. It looks like there's various kinds of aligned and unaligned assignments into graphics memory. Maybe the right thing to do is do all 32 bit aligned accesses? That may give the most speed?
I think it is not related to alignment. Arm code can not be aligned wrong, this cpu cannot do unaligned work. Its more simple, done in my first post: - one 32bit access is faster than two 16bit or four 8bit access - using pre-calculated color values, not need to shrink 32bit colors to 24 or 16bit on every draw of a pixel - moving calculations of not changing values out of a loop - sorting loops in optimal order The last thing i have made in this assembler file is to remove every kind of loops, not load the font image but use special code for every character. Instead of my fist post this one should not be used in official kernel. The resulting code is more than 100 times larger! And you cannot change the font anymore. Its just for fun, education and impression.
Ok, so I finally got around to this! FreeBSD-HEAD is using vt now, not syscons - I'll still merge your stuff at some point, but your code is for the syscons console. For vt, it exposes a straight simple mapped framebuffer to the vt code that then uses the code in sys/dev/vt/hw/fb/ to draw things. So, it also does mostly what you've done, and it's doing it 8, 16, or 32 bits at a time depending upon the bpp depth. So, I figured I'd write something that just mmap'ed /dev/fb0 into userland and tried 8, 16 and 32 bit stores to see what's faster. #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <time.h> #include <sys/mman.h> #include <sys/types.h> #include <err.h> //fb0: 1184x624(0x0@0,0) 16bpp #define WIDTH 1184 #define HEIGHT 624 #define BPP 16 // Not true - need to know "stride". // but treat this as if it's in bytes #define FB_SIZE (1184*624*2) struct timespec ts_diff(struct timespec start, struct timespec end) { struct timespec temp; if ((end.tv_nsec-start.tv_nsec)<0) { temp.tv_sec = end.tv_sec-start.tv_sec-1; temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec; } else { temp.tv_sec = end.tv_sec-start.tv_sec; temp.tv_nsec = end.tv_nsec-start.tv_nsec; } return temp; } void fill_1byte(char *fb, char val) { int i; for (i = 0; i < FB_SIZE; i++) fb[i] = val; } void fill_2byte(char *fb, uint16_t val) { uint16_t *f = (void *) fb; int i; for (i = 0; i < FB_SIZE / 2; i++) { f[i] = val; } } void fill_4byte(char *fb, uint32_t val) { uint32_t *f = (void *) fb; int i; for (i = 0; i < FB_SIZE / 4; i++) { f[i] = val; } } int main(int argc, const char *argv[]) { char *fb = NULL; int fd; int i; struct timespec tv_start, tv_end, tv_diff; fd = open("/dev/fb0", O_RDWR); if (fd < 0) { err(1, "%s: open", __func__); } fb = mmap(NULL, FB_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (fb == MAP_FAILED) { err(1, "%s: mmap", __func__); } clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_start); for (i = 0; i < 100; i++) fill_1byte(fb, i); clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_end); tv_diff = ts_diff(tv_start, tv_end); printf("8 bit: 100 runs: %lld.%06lld sec\n", (long long) tv_diff.tv_sec, (long long) tv_diff.tv_nsec); clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_start); for (i = 0; i < 100; i++) fill_2byte(fb, i); clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_end); tv_diff = ts_diff(tv_start, tv_end); printf("16 bit: 100 runs: %lld.%06lld sec\n", (long long) tv_diff.tv_sec, (long long) tv_diff.tv_nsec); clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_start); for (i = 0; i < 100; i++) fill_4byte(fb, i); clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_end); tv_diff = ts_diff(tv_start, tv_end); printf("32 bit: 100 runs: %lld.%06lld sec\n", (long long) tv_diff.tv_sec, (long long) tv_diff.tv_nsec); exit(0); } .. and the output: root@raspberry-pi:~ # ./test 8 bit: 100 runs: 4.15364000 sec 16 bit: 100 runs: 2.107316000 sec 32 bit: 100 runs: 1.12614000 sec root@raspberry-pi:~ # .. so: * Your work is good and it's still good for people using syscons, but you should double-check what's in sys/dev/vt/hw/fb/ to see if there's any optimisation there; * To get really fast speed, we should be doing 32 bit stores, not lots of 8 or 16 bit stores. The above test filled the same region of memory but with 8, 16 and 32 bit stores. The difference between 8, 16 and 32 bit is quite substantial.
It's no surprice one 32bit operation is faster than four 8bit operations, since computers having data bus systems wider than 8bit. Even the ~35 years old 8086 already has a 16bit wide data bus... I have got the 11.0-CURRENT on my raspberry, and made the speedtest again. (time taken to print and sroll 1 milion lines at 1440x900 in VT mode) 16 Bit per Pixel: 918 Secends 24 Bit per Pixel: 1133 Secends 32 Bit per Pixel: 903 Secends It seems there is some space for optimisations. I will show at the vt code in near future.