kerio | and the ludicrous intel DC P3608 4TB gets 3GB/s of sequential writes | 00:00 |
---|---|---|
kerio | (PCIe 3.0 x8) | 00:00 |
DocScrutinizer05 | highly irrelevant | 00:01 |
kerio | it's only 8959.99 USD on amazon.com | 00:01 |
DocScrutinizer05 | the point is that writing zeroes isn't a working replacement for TRIM | 00:03 |
kerio | indeed | 00:03 |
DocScrutinizer05 | the controller needs to receive a hint that the block doesn't contain valid data | 00:04 |
kerio | it should be "easy" to test if that MMC ERASE command works | 00:04 |
kerio | write random data over the whole thing | 00:04 |
kerio | then write random data again, measuring the speed | 00:05 |
kerio | then ERASE everything | 00:05 |
kerio | then write random data again and measure the speed | 00:05 |
DocScrutinizer05 | exactly | 00:05 |
kerio | do we have a debug utility for the eMMC? | 00:05 |
DocScrutinizer05 | also exactly what I had in mind, for swap | 00:05 |
DocScrutinizer05 | no | 00:05 |
DocScrutinizer05 | afaik | 00:05 |
kerio | would it need a more recent kernel? | 00:05 |
DocScrutinizer05 | prolly needs backport of the ERASE (or TRIM) ioctl command to mmc_core | 00:06 |
DocScrutinizer05 | or a more recent kernel, freemangordon coult test it | 00:06 |
DocScrutinizer05 | could* | 00:06 |
DocScrutinizer05 | http://lxr.free-electrons.com/source/drivers/mmc/core/core.c#L2198 | 00:07 |
DocScrutinizer05 | (wildly guessing there, no kernel developer) | 00:09 |
DocScrutinizer05 | modinfo mmc_core | 00:11 |
DocScrutinizer05 | objdump -t /lib/modules/2.6.28-omap1/mmc_core.ko dunno | 00:13 |
DocScrutinizer05 | freemangordon: could you test fstrim on emmc? | 00:37 |
*** DrCode has quit IRC | 00:38 | |
DocScrutinizer05 | (,ake sure eMMC volume isn't mounted -o discard) | 00:38 |
kerio | why shouldn't it | 00:39 |
kerio | fstrim should still work, right | 00:39 |
DocScrutinizer05 | otherwise we have unsolicited TRIM in between | 00:39 |
*** trx has quit IRC | 00:40 | |
DocScrutinizer05 | so any such test would be rather meaningless with -o discard, no? | 00:40 |
kerio | oh, performance tests | 00:41 |
kerio | yeah, if it worked | 00:41 |
DocScrutinizer05 | <kerio> write random data over the whole thing then write random data again, measuring the speed then ERASE [rm -r *; fstrim] everything then write random data again and measure the speed | 00:44 |
*** trx has joined #maemo-ssu | 00:44 | |
*** DrCode has joined #maemo-ssu | 00:53 | |
*** handaxe has joined #maemo-ssu | 00:57 | |
*** handaxe has quit IRC | 01:01 | |
*** freemangordon has quit IRC | 01:05 | |
*** freemangordon has joined #maemo-ssu | 01:12 | |
ShadowJK | I have the impression that there isn't all that much sophistication that can be squeezed into emmc, that trim is mostly a NOOP unless you give it a full 8MB block properly aligned that it can erase | 01:33 |
ShadowJK | Or however big it gets reported as in /sys/block/.../preferred_erase_size or something like that | 01:33 |
DocScrutinizer05 | ShadowJK: TRIM is not about erase | 01:40 |
DocScrutinizer05 | https://www.youtube.com/watch?v=x6lqYU4j7no | 01:42 |
DocScrutinizer05 | when controller copies an erase page to change one block in it, it can leave out resp skip copying of the blocks tagged as TRIMed | 01:43 |
DocScrutinizer05 | so those are fresh unused blocks on the new page, ready to take new data | 01:44 |
DocScrutinizer05 | worst case when all blocks been used to write some (possibly already obsolete) data to them, each write of one block (to overwrite the obsolete old content) involves copy of one complete erase page just to replace that one block | 01:46 |
DocScrutinizer05 | if all the blocks of the page been tagged as TRIMed, the copy would result in just one used and many free blocks in the new erase page | 01:47 |
DocScrutinizer05 | when you fill the complete MMC with one file and then delete that file on fs level, subsequent writes to the device to fill it again completely with data would either cause $number-of-blocks page copies without TRIM, or only $number-of-erasepages copies with TRIM | 01:49 |
DocScrutinizer05 | to accomplish that on controller level, you need just one bit per block in metadata | 01:51 |
kerio | that's how things should go | 01:58 |
kerio | on the other hand, hardware manufacturers will likely do the absolute bare minimum for anything | 01:59 |
kerio | i mean | 01:59 |
kerio | actual SSDs that *do* advertise ATA TRIM support actually fuck it up | 01:59 |
kerio | because of firmware bugs | 01:59 |
DocScrutinizer05 | that's a completely different story | 02:00 |
kerio | do you really expect a MMC firmware to handle a barely used feature correctly and in a way that enhances performance | 02:00 |
kerio | maybe you can ask about it for the neo900 | 02:01 |
DocScrutinizer05 | hardware manufacturers try to create as good a product as possible from a given amount of resources. A two bits per block used to tag free blocks with either 00 or 11 while used blocks are 01 doesn't cost them anything and will provide a selling point in datasheet | 02:02 |
DocScrutinizer05 | barely used is nonsense | 02:02 |
kerio | well, is it a point in the n900 emmc datasheet? | 02:02 |
DocScrutinizer05 | obviously all android phones use that | 02:02 |
kerio | i mean, i'd pay more for it | 02:02 |
kerio | but nokia probably didn't | 02:02 |
DocScrutinizer05 | the point is you won't have to pay more for it | 02:03 |
DocScrutinizer05 | it's a mere one-shot effort to implement it in controller firmware | 02:03 |
DocScrutinizer05 | so the cost per chip ~= zilch | 02:04 |
DocScrutinizer05 | and microsoft obviously even specifies a max duration a single block write may take, or something along that line, which is only achievable with proper TRIM support | 02:08 |
DocScrutinizer05 | the datasheet for eMMC in N900 says it's >>Full compliance w ith JEDEC/MMCA Ver. 4.3<<, so all you have to do is to find the specs JEDEC only publishes for registered users | 02:12 |
Pali | normal trim command cannot be send in queue for ATA disks | 02:23 |
Pali | so before sending trim, you need to wait until queue of commands are empty | 02:23 |
Pali | and so using trim can slow down read/write operations of disks | 02:24 |
Pali | yes, there is also queud trim ATA command, but it is not supported by Microsoft and Apple systems | 02:25 |
Pali | and so if something advertise that supports it, it is buggy | 02:26 |
Pali | before playing with discard on linux, look at this loooong table: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/ata/libata-core.c#n4270 | 02:26 |
kerio | samsung controller botched queued trim => queued trim is broken for every SSD forever | 02:31 |
kerio | seems good | 02:31 |
DocScrutinizer05 | I'm not interested in discard. TRIM however will be mad useful | 02:31 |
DocScrutinizer05 | -o discard is arguably not the best way to do TRIM anyway | 02:32 |
Pali | discard is just linux API for trim for FS | 02:32 |
kerio | he wants to use trim or something like it for the swap partition | 02:32 |
DocScrutinizer05 | please read about batch trim vs online trim | 02:32 |
kerio | (i'm not entirely sure it's a thing on linux tbh) | 02:32 |
DocScrutinizer05 | refer to fstrim | 02:33 |
DocScrutinizer05 | for example | 02:33 |
Pali | discard is good idea, but only useful when all layers supports queued trim && queued trim is implemented correctly in FW | 02:33 |
kerio | i'm pretty sure that both me and Pali understand the difference, doc | 02:33 |
DocScrutinizer05 | no, queued trim is only needed for -o discard aka online trim | 02:34 |
kerio | Pali: the real best way to "do TRIM" is to aggressively reuse sectors, anyway | 02:34 |
DocScrutinizer05 | huh? | 02:34 |
kerio | so you don't need separate commands except when you absolutely have to | 02:34 |
DocScrutinizer05 | sorry that's absolute nonsense | 02:34 |
kerio | DocScrutinizer05: on a SSD, TRIM means "i don't need this LBA address anymore" | 02:35 |
DocScrutinizer05 | the whole point about trim is that you _cannot_ 'reuse sectors' | 02:35 |
kerio | wut | 02:35 |
kerio | the controller will remap your writes all over the place anyway | 02:35 |
DocScrutinizer05 | each "reuse sector" means you need to do a erase page copy | 02:36 |
kerio | ...no it doesn't | 02:36 |
kerio | unless there's no more space | 02:36 |
DocScrutinizer05 | which is exactly whyt TRIM accomplishes: free space | 02:36 |
*** dafox has quit IRC | 02:37 | |
kerio | yes, but if you're deleting a file and creating a new one | 02:37 |
kerio | you can just put the new one on the same logical address of the old one | 02:37 |
DocScrutinizer05 | so what? | 02:37 |
kerio | and you'll have the same effect without having to issue a separate command | 02:37 |
DocScrutinizer05 | no you can't use the same physical address | 02:37 |
kerio | yes | 02:37 |
kerio | which is why i said logical address | 02:38 |
DocScrutinizer05 | please, read e.g. http://www.thessdreview.com/daily-news/latest-buzz/garbage-collection-and-trim-in-ssds-explained-an-ssd-primer/ | 02:38 |
DocScrutinizer05 | you using same logical address means the complete physical erase page needs to get copied to change your one sector you write to | 02:39 |
Pali | anyway, it is not better to have direct access to NAND erase blocks and use e.g. ubifs on NAND directly? | 02:39 |
kerio | DocScrutinizer05: that's for the ssd controller to decide | 02:39 |
DocScrutinizer05 | so "agressively reuse sectors" is meaningless at best, worst case more likely | 02:39 |
DocScrutinizer05 | Pali: we talk about eMMC | 02:40 |
DocScrutinizer05 | not NAND aka mtd | 02:40 |
kerio | Pali: in theory, sure | 02:40 |
kerio | in practice, separation of concerns has proven to be more successful | 02:40 |
DocScrutinizer05 | actually ubifs implements pretty much exactly the same scheme on application processor which the controller of emmc uses for TRIM and wear leveling | 02:41 |
kerio | i'd trust a SSD plus ZFS over UBIFS on raw flash, if only because the tools are better | 02:41 |
Pali | is not eMMC some flash or nand memory with own software on it? | 02:42 |
DocScrutinizer05 | on MMC your only way to have some control over page erases is to use ERASE/TRIM | 02:42 |
kerio | yeah, and in theory more control should yield better results | 02:42 |
kerio | which is the same argument for software raid over hardware raid | 02:42 |
kerio | however that hasn't become the case in modern computer hardware | 02:43 |
kerio | probably because SSD controllers that take the raw flash and turn it into a perfect block device are Good Enough | 02:43 |
DocScrutinizer05 | only as long as they can keep the erase pages for all concurrent write(pointer)s in buffer RAM | 02:49 |
DocScrutinizer05 | as soon as the buffer RAM gets filled they need to write back one erase page sized chunk of data to make space for reading in another page so the next sector/block write can modify it | 02:51 |
DocScrutinizer05 | and depending on several other system parameters you don't want to keep large amounts of dirty buffers all the time, since... powerfail | 02:52 |
DocScrutinizer05 | I guess some SSDs even have their own battery to write back dirty buffers on powerfail (incl regular power down powerfail) | 02:53 |
DocScrutinizer05 | generally speaking you have little to no problems with SSD and TRIM and performance impact therefrom as long as you do a single sequential write, since all controllers can keep a single erasepage in RAM buffer | 02:55 |
DocScrutinizer05 | and they usually won't write it back until another erase page gets accessed or a certain timeout expired between interface write() commands | 02:56 |
DocScrutinizer05 | so the trivial controller can always read an erase page into RAM1, wait until all blocks of that page got modified by sequential write() commands from system, then on next write() after that read in the next erasepage into RAM2 with the sector to modify and writre back RAM1_dirty to flash while RAM2 gets modified by further sequential write() cmds | 03:00 |
DocScrutinizer05 | for random access writes stuff soon starts to get cluttered | 03:01 |
DocScrutinizer05 | so to really test TRIM, you's probably want 500 files size 1/500 of SSD capacity each, thentruncate them to 1 byte length each, do trim, and then append to each of them concurrently again | 03:05 |
DocScrutinizer05 | while this is the 100% worst case scenario, in normal operation similar scenarios will happen more often than not, while strictly sequential write is the rather unlikely usecase | 03:06 |
DocScrutinizer05 | a pretty everyday scenario for almost worst case: swap | 03:08 |
DocScrutinizer05 | unless you configure kswapd in a way so it always writes complete aligned erasepage-sized chunks | 03:10 |
DocScrutinizer05 | as soon as you have one byte misalignment, optimum case turns into worst case where every swapped out page involves two erasepage read modify erase write cycles | 03:12 |
DocScrutinizer05 | btw on HDD you see similar effects when your drive always read/modify/writes a complete track instead of on-the-fly insert-writing a single sector | 03:15 |
DocScrutinizer05 | just a page erase on SSD takes much longer that one platter stack spin in a HDD | 03:16 |
DocScrutinizer05 | ((strictly sequential write is the rather unlikely usecase)) also think fragmentation which happens on a FS level and is not visible/understandable by SSD/MMC controller | 03:41 |
DocScrutinizer05 | writing a file sequentially into a fragmented filesystem also is pretty much random access write | 03:42 |
DocScrutinizer05 | note that some SD-card controllers even are known to understand FAT fs and do shadow trim by locating and observing the used blocks table(s) | 03:44 |
DocScrutinizer05 | I don't know how risky that is, I guess they must have implemented quite some heuristics and safeguard monitors to stop this as soon as the slightest doubt about the FS used comes up | 03:46 |
*** merlin1991 has quit IRC | 03:53 | |
*** Pali has quit IRC | 04:21 | |
*** DrCode has quit IRC | 05:32 | |
*** DrCode has joined #maemo-ssu | 05:52 | |
*** DocScrutinizer05 has quit IRC | 07:00 | |
*** DocScrutinizer05 has joined #maemo-ssu | 07:00 | |
*** Sicelo has quit IRC | 07:59 | |
*** NIN101 has quit IRC | 07:59 | |
*** dos1 has quit IRC | 07:59 | |
*** handaxe has joined #maemo-ssu | 11:02 | |
*** handaxe has quit IRC | 11:06 | |
*** LauRoman|Alt has joined #maemo-ssu | 11:13 | |
*** dos1 has joined #maemo-ssu | 11:53 | |
*** Sicelo has joined #maemo-ssu | 11:54 | |
*** NIN101 has joined #maemo-ssu | 11:54 | |
ShadowJK | I've always wanted to have this on my phones: https://lwn.net/Articles/518988/ | 12:23 |
*** handaxe has joined #maemo-ssu | 12:28 | |
*** handaxe has quit IRC | 12:29 | |
*** LauRoman|Alt has quit IRC | 12:45 | |
*** Pali has joined #maemo-ssu | 13:18 | |
*** merlin1991 has joined #maemo-ssu | 14:01 | |
*** dafox has joined #maemo-ssu | 14:44 | |
*** dafox has quit IRC | 15:16 | |
*** dafox has joined #maemo-ssu | 15:52 | |
DocScrutinizer05 | hehe > It seems that as hardware gets smarter, we need to make even more clever software to manage that "smartness"<< | 16:07 |
DocScrutinizer05 | hmmm >>One area of difficulty is that the shape of an f2fs (such as section and zone size) needs to be tuned to the particular flash device and its FTL; vendors are notoriously secretive about exactly how their FTL works. f2fs also requires that the flash device is comfortable having six or more concurrently "open" write areas.<< | 16:10 |
DocScrutinizer05 | 6 yeafrs old? | 16:31 |
DocScrutinizer05 | years even | 16:31 |
DocScrutinizer05 | oh nope, only 4 | 16:32 |
ShadowJK | I have some cards that are comfortable with 12 open areas | 18:38 |
ShadowJK | In any case, 6 open areas is always better than random | 18:38 |
kerio | hm | 18:40 |
kerio | can we replace the internal eMMC with something based on ram? | 18:40 |
kerio | surely making things non-volatile is harder | 18:40 |
bencoh | ? | 18:57 |
DocScrutinizer05 | kerio: I'm searching for mixed RAM/flash chips with a sincle storage interface (aka ramdisk) since ages, nothing found so far | 20:40 |
DocScrutinizer05 | single* | 20:40 |
DocScrutinizer05 | I mean, how hard could it be to implement one-plus GB RAM buffer in FTL and operate it in a dedicated overlay mode where you define a start address where buffer is used instead of ever writing back stuff to flash? | 20:42 |
kerio | and then you remove the flash | 20:43 |
DocScrutinizer05 | FTLs already implement write protect afaik | 20:43 |
DocScrutinizer05 | I would prefer still keeping flash in same chip behind same interface/bus | 20:44 |
kerio | but who cares about flash | 20:45 |
DocScrutinizer05 | everybody? | 20:45 |
kerio | we have microSDs for that | 20:45 |
*** LauRoman|Alt has joined #maemo-ssu | 22:31 |
Generated by irclog2html.py 2.15.1 by Marius Gedminas - find it at mg.pov.lt!