Bug 194635 - Speed optimisation for framebuffer console driver on Raspberry Pi
Summary: Speed optimisation for framebuffer console driver on Raspberry Pi
Status: Closed Overcome By Events
Alias: None
Product: Base System
Classification: Unclassified
Component: arm (show other bugs)
Version: 10.0-RELEASE
Hardware: arm Any
: Normal Affects Many People
Assignee: freebsd-arm (Nobody)
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-10-27 17:28 UTC by Stefan Berndt
Modified: 2018-09-22 16:30 UTC (History)
6 users (show)

See Also:


Attachments
changes on /sys/arm/broadcom/bcm2835/bcm2835_fb.c (3.38 KB, patch)
2014-10-27 17:28 UTC, Stefan Berndt
no flags Details | Diff
unified diff on /sys/arm/broadcom/bcm2835/bcm2835_fb.c (3.91 KB, patch)
2014-10-28 16:07 UTC, Stefan Berndt
no flags Details | Diff
maybe the fastest way to draw characters, but crazy huge (262.14 KB, text/plain)
2014-11-14 21:09 UTC, Stefan Berndt
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Berndt 2014-10-27 17:28:37 UTC
Created attachment 148707 [details]
changes on /sys/arm/broadcom/bcm2835/bcm2835_fb.c

Hi,

I have done some speed optimisations on the Raspberry Pi's console driver.
Please give this to the Raspberry Pi developers.

Benchmarks (time taken to print and sroll 1 milion lines at 1440x900) :
--Original FreeBSD driver--
16 Bit per Pixel: 660 Secends
24 Bit per Pixel: 817 Secends
32 Bit per Pixel: 1086 Secends

--My modifications--
16 Bit per Pixel: 319 Secends
24 Bit per Pixel: 648 Secends
32 Bit per Pixel: 336 Secends

I have tested (working) this resolutions, all BpP each :
640x480, 800x600, 1024x768, 1440x900

This is my first work on FreeBSD kernel, and i hope to fit all reqirements.

Greetings
Stefan Berndt
Comment 1 Felix 2014-10-27 20:59:32 UTC
This seems to be pretty straightforward. Good job on your first patch, hope it gets approved!
Comment 2 Rui Paulo freebsd_committer freebsd_triage 2014-10-28 01:06:07 UTC
Could you please upload a unified diff?
Comment 3 Stefan Berndt 2014-10-28 16:07:50 UTC
Created attachment 148738 [details]
unified diff on /sys/arm/broadcom/bcm2835/bcm2835_fb.c

Sure. Here comes the diff. Unified form this time.
Comment 4 Stefan Berndt 2014-11-14 21:05:31 UTC
Just for fun and for learning i have tried to get more speed with pure assembler code. After all i was impressed how good the GCC already does the job. The only way to get it faster than GCC was to hardcode every pixel of the character set. Its made for 16 BpP only and reached 1 milion lines at 251 Secends. A very poor result for this crazy huge pice of code. Only for your entertainment i will show you the resultig file.
Comment 5 Stefan Berndt 2014-11-14 21:09:06 UTC
Created attachment 149418 [details]
maybe the fastest way to draw characters, but crazy huge

not a real choice but even impressing
Comment 6 Adrian Chadd freebsd_committer freebsd_triage 2014-11-15 00:02:17 UTC
Hm, I'm curious about this stuff. It looks like there's various kinds of aligned and unaligned assignments into graphics memory.

Maybe the right thing to do is do all 32 bit aligned accesses? That may give the most speed?
Comment 7 Stefan Berndt 2014-11-15 07:47:38 UTC
I think it is not related to alignment. Arm code can not be aligned wrong, this cpu cannot do unaligned work. 

Its more simple, done in my first post:
- one 32bit access is faster than two 16bit or four 8bit access
- using pre-calculated color values, not need to shrink 32bit colors to 24 or 16bit on every draw of a pixel
- moving calculations of not changing values out of a loop
- sorting loops in optimal order

The last thing i have made in this assembler file is to remove every kind of loops, not load the font image but use special code for every character. Instead of my fist post this one should not be used in official kernel. The resulting code is more than 100 times larger! And you cannot change the font anymore. Its just for fun, education and impression.
Comment 8 Adrian Chadd freebsd_committer freebsd_triage 2014-12-29 06:41:25 UTC
Ok, so I finally got around to this!

FreeBSD-HEAD is using vt now, not syscons - I'll still merge your stuff at some point, but your code is for the syscons console. For vt, it exposes a straight simple mapped framebuffer to the vt code that then uses the code in sys/dev/vt/hw/fb/ to draw things.

So, it also does mostly what you've done, and it's doing it 8, 16, or 32 bits at a time depending upon the bpp depth.

So, I figured I'd write something that just mmap'ed /dev/fb0 into userland and tried 8, 16 and 32 bit stores to see what's faster.

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <time.h>
#include <sys/mman.h>
#include <sys/types.h>

#include <err.h>

//fb0: 1184x624(0x0@0,0) 16bpp

#define WIDTH   1184
#define HEIGHT  624
#define BPP     16

// Not true - need to know "stride".
// but treat this as if it's in bytes
#define FB_SIZE (1184*624*2)

struct timespec
ts_diff(struct timespec start, struct timespec end)
{
        struct timespec temp;

        if ((end.tv_nsec-start.tv_nsec)<0) {
                temp.tv_sec = end.tv_sec-start.tv_sec-1;
                temp.tv_nsec = 1000000000+end.tv_nsec-start.tv_nsec;
        } else {
                temp.tv_sec = end.tv_sec-start.tv_sec;
                temp.tv_nsec = end.tv_nsec-start.tv_nsec;
        }
        return temp;
}

void
fill_1byte(char *fb, char val)
{
        int i;
        for (i = 0; i < FB_SIZE; i++)
                fb[i] = val;
}

void
fill_2byte(char *fb, uint16_t val)
{
        uint16_t *f = (void *) fb;
        int i;

        for (i = 0; i < FB_SIZE / 2; i++) {
                f[i] = val;
        }
}

void
fill_4byte(char *fb, uint32_t val)
{
        uint32_t *f = (void *) fb;
        int i;

        for (i = 0; i < FB_SIZE / 4; i++) {
                f[i] = val;
        }
}

int
main(int argc, const char *argv[])
{
        char *fb = NULL;
        int fd;
        int i;
        struct timespec tv_start, tv_end, tv_diff;

        fd = open("/dev/fb0", O_RDWR);
        if (fd < 0) {
                err(1, "%s: open", __func__);
        }

        fb = mmap(NULL, FB_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
        if (fb == MAP_FAILED) {
                err(1, "%s: mmap", __func__);
        }

        clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_start);
        for (i = 0; i < 100; i++)
                fill_1byte(fb, i);
        clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_end);
        tv_diff = ts_diff(tv_start, tv_end);
        printf("8 bit: 100 runs: %lld.%06lld sec\n",
            (long long) tv_diff.tv_sec,
            (long long) tv_diff.tv_nsec);

        clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_start);
        for (i = 0; i < 100; i++)
                fill_2byte(fb, i);
        clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_end);
        tv_diff = ts_diff(tv_start, tv_end);
        printf("16 bit: 100 runs: %lld.%06lld sec\n",
            (long long) tv_diff.tv_sec,
            (long long) tv_diff.tv_nsec);

        clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_start);
        for (i = 0; i < 100; i++)
                fill_4byte(fb, i);
        clock_gettime(CLOCK_MONOTONIC_PRECISE, &tv_end);
        tv_diff = ts_diff(tv_start, tv_end);
        printf("32 bit: 100 runs: %lld.%06lld sec\n",
            (long long) tv_diff.tv_sec,
            (long long) tv_diff.tv_nsec);

        exit(0);
}

.. and the output:

root@raspberry-pi:~ # ./test 
8 bit: 100 runs: 4.15364000 sec
16 bit: 100 runs: 2.107316000 sec
32 bit: 100 runs: 1.12614000 sec
root@raspberry-pi:~ # 

.. so:

* Your work is good and it's still good for people  using syscons, but you should double-check what's in sys/dev/vt/hw/fb/ to see if there's any optimisation there;
* To get really fast speed, we should be doing 32 bit stores, not lots of 8 or 16 bit stores. The above test filled the same region of memory but with 8, 16 and 32 bit stores. The difference between 8, 16 and 32 bit is quite substantial.
Comment 9 Stefan Berndt 2015-01-04 11:53:00 UTC
It's no surprice one 32bit operation is faster than four 8bit operations, since computers having data bus systems wider than 8bit. Even the ~35 years old 8086 already has a 16bit wide data bus...

I have got the 11.0-CURRENT on my raspberry, and made the speedtest again.
(time taken to print and sroll 1 milion lines at 1440x900 in VT mode)

16 Bit per Pixel: 918 Secends
24 Bit per Pixel: 1133 Secends
32 Bit per Pixel: 903 Secends

It seems there is some space for optimisations. I will show at the vt code in near future.