Wikipedia:Reference desk/Archives/Computing/2012 September 25

= September 25 =

PDF formats
I'm working with a bunch of .pdf files that are scanned documents, and for some reason they seem to "blur" in before displaying, akin to texture pop-in. They're also glacially slow to print. I'm guessing that this is some sort of compression? Is there any way to fix this and turn it into a "normal" .pdf? If so, is there any way to do so in a batch so I can do entire directories at once without manually telling it each time? I realize this is a vague question, but that's why I can't just use "the Google." 150.148.0.65 (talk) 16:20, 25 September 2012 (UTC)


 * It sounds like the images which are embedded in the image have been scanned at an unnecessarily high resolution. This answer describes how to use GhostScript to reduce the dots-per-inch of the images in a PDF.  You could use an Optical character recognition program to read the PDFs and (hopefully) turn the text shown in images there into actual PDF text - with the right setting, it can delete the images and emit a comparatively tiny PDF document. But OCR tends to be slow itself, and it's never perfect, so that might not be the solution for you. -- Finlay McWalterჷTalk 16:26, 25 September 2012 (UTC)
 * Could the images be something like progressive JPEG? 69.228.171.70 (talk) 16:33, 25 September 2012 (UTC)


 * It's annoyingly difficult to translate PDFs with one image encoding into PDFs with another (e.g. even just trying to convert from color to grayscale, much less grayscale to black and white — note also that the PDF format itself is more or less the "envelope" that can hold a number of different image formats within it). Acrobat or Ghostscript can downsample the images, but that might not be what you're looking for — the problem may not be that they're too large, but that they're just encoding in a weird way. In the past the only way I've been able to accomplish this has been to extract all of the pages as images, manipulate said images, and then reconvert them into a PDF. You can do this with ImageMagick+GhostScript, but it's all command-line interfaces and a bit of a pain. The other alternative is to try switching to a different PDF reader and see if that improves things; some are faster than others when it comes to rendering different format. --Mr.98 (talk) 17:02, 25 September 2012 (UTC)


 * It's a work computer and work-related files, so I can't install other programs and I can't move them to another computer. I've had some success by printing the files to file and re-creating the .pdf that way, it's just incredibly slow, gives errors, and results in files of double the original size.  I'll try OCR, it may be slow but I can just set it up and work on something else while it cooks.  150.148.0.65 (talk) 17:11, 25 September 2012 (UTC)


 * You could try opening the PDFs in a text editor (Wordpad or MS Word would probably work, but I recommend against Notepad since it doesn't understand Unix line endings) and searching for /Height. The number after /Height tells you the height of that image in pixels, the number after /Width (within the same matching << >> brackets) is the width, and the symbol after /Filter is the compression method. Scanned documents often use JBIG2, in which case you'll see /Filter /JBIG2Decode, while JPEG is /Filter /DCTDecode. I don't know if this will really help you but it will give some idea of what you're dealing with. -- BenRG (talk) 18:57, 25 September 2012 (UTC)

DosBox questions
I have now bought the original Star Control and Star Control II on CD-ROM from eBay and have successfully been able to play them on DosBox. But I have some questions:
 * Do I have to type  every single time I start up DosBox, or can I somehow configure DosBox to do this automatically?
 * The old 1990s-era DOS screen seems rather tiny on modern monitors. To enhance my gaming experience, I'd like it to take up more of the screen. Can I run DosBox in full-screen mode, or at least 2×2 magnified? J I P  &#124; Talk 18:22, 25 September 2012 (UTC)


 * The dosbox.conf file contains a section at the end headed [autoexec], with the comment "Lines in this section will be run at startup". Fullscreen mode is reached by pressing alt+enter. (This is all in the readme file, I think.) Card Zero  (talk) 19:26, 25 September 2012 (UTC)
 * Yes I'm sure it's in the docs. The man page also says you can basically do  .  You'll almost certainly want to play with the resolution/aspect ratio in the config, however. ¦ Reisio (talk) 19:30, 25 September 2012 (UTC)
 * Thanks, I managed to get both of these working. Star Control II seems to natively use a resolution of about 640×400, I set DosBox to use a window resolution of 1280×800, which fills up most (but not all) of my desktop and shows the game's screen very well, but a bit pixelated. I have so far got as far as collect radioactives from Mercury and give those to help the starbase orbiting Earth, and now I have to go fight Spathi and Ilwrath on the Moon. Now I remember reading in an issue of Pelit that this game was going to be released for the Amiga, possibly for the Atari ST as well. It was never released for either. Am I dreaming this, or were there indeed such plans? J I P  &#124; Talk 19:15, 26 September 2012 (UTC)

8-level flash
A few years ago Sandisk started marketing consumer flash devices (USB sticks, SD cards, etc.) with 8-level flash cells (3 bits per cell) instead of the existing MLC (4 levels, 2 bits per cell) for increased density at a cost in speed and data retention. The logo was the number "3" in a circle and it was presented as a high-tech accomplishment. They seem to have phased this out–my guess is that they realized it made no sense to come out and advertise that their product was using cheaper, worse-performing parts than the other guy's.

Meanwhile I've recently seen a drop in consumer flash prices: below 50 cents/GB on newegg is now common, but there are more complaints than I'm used to about speed and reliability. Meanwhile I also see USB 3.0 sticks that advertise high speed, but cost about 2x as much per GB, comparable to SSD's.

I'm wondering: is the low cost stuff now using 8-level flash, while the higher cost stuff is using 4-level? (2-level (SLC) seems to have retreated into some ultra expensive "enterprise" devices). Are there known reliability problems? Is there a way to tell either electrically or by physical inspection whether a device uses 8-level chips? I even see some mention online of 16-level cells, and am wondering too if those are in widespread use yet.

Thanks.

69.228.171.70 (talk) 18:42, 25 September 2012 (UTC)


 * It's probably more to do with demand and how much cheaper things can be priced when they're more certain to be sold. ¦ Reisio (talk) 19:18, 25 September 2012 (UTC)

Programming and Video Editing
Viewing videos as a sequence of 2d arrays of pixels there are a lot of things I'd like to experiment with, unfortunately the computer does not store them as such. I am looking for a language/program/library that would abstract things to this level; a good analogy to what I want is how excel works with spread sheets, you see what appears to be a spread sheet and can edit it however a spread sheet can be edited, yet the underlying file structure is exceedingly complex. Of course, if no such thing exists, I'd be willing to deal with something more complicated, I just don't want to spend my time worrying over file structure since this isn't the aspect I'm interested in (though, I don't want to sacrifice the power to do what I want.) Thank you for any responses:-)Phoenixia1177 (talk) 19:21, 25 September 2012 (UTC)


 * Do you want to start from existing video or create your own videos from scratch ?


 * A) In the first case, I suggest you use your favorite video capture software to output individual frames. These can then be converted into a PPM ASCII file (not PPM binary), using a convert program like the one supplied with the free ImageMagick utility.  PPM ASCII is human readable, but is absolutely huge (several MB per full-screen frame), so you won't want to load many frames at that size.  You could then load such an image file into a spreadsheet.


 * B) In the second case, your program or spreadsheet can output in PPM ASCII format, then you can use the ImageMagick convert function to stitch together multiple frames to create an animated GIF. The animated GIF format is no longer human readable, and is more compact, being binary data, although nowhere near as compact as other video formats, as it doesn't appear to use any type of compression.  There is one big limitation on animated GIF files, though, they are limited to 256 colors per frame.  This can be usable for computer graphics animations, although you do need to consider this in the programming, since, if you use more than 256 colors, it starts getting patches of the wrong color.  If you click on my name to go to my home page, you can open up a number of animated GIFs (at the bottom) I've created in this way.  You could also convert these into another format, if you like.  In my case, I've used a Fortran program I wrote to create the PPM ASCII files.  While you could do so by manually editing a spreadsheet, the time it would take to create even a tiny animation, in this manner, quickly becomes too much.  StuRat (talk) 19:36, 25 September 2012 (UTC)


 * Thank you for your response:-) Although, I think I was being unclear about what I was looking for with my analogy. I meant I want something that allows me to treat videos as abstract objects out the gate, rather than having to spend a whole bunch of time writing a program to construct one. I would be more than happy with a library for C# or Ruby or whatever, I'm fine with learning a new language. For example, I would like to be able to type something like "video = LoadFile(FILEPATH)/colour = video.get_pixel(FRAME, x, y)/for frames in video/frame.set_pixel(colour) if ..." and so on. Sorry for my initial poor description:-)Phoenixia1177 (talk) 21:18, 25 September 2012 (UTC)


 * You really can't have a fully expanded video stored in memory as ARRAY(FAME, X, Y), due to the size. Consider a 1000×1000 pixel frame, with 24 bit color (3 bytes).  That's 3 MB per frame.  If you have 30 frames per second, that's 90 MB per second.  A minute is 5.4 GB.  An hour is 324 GB.  A long movie might be a terabyte.  So, the code would need to extract the frames as you worked on them, then recompress it all after each change, as you can't store more than a few frames in memory at a time.  Now, whether code exists that can do all that extraction and recompression behind the scenes, quickly; that I don't know. StuRat (talk) 21:46, 25 September 2012 (UTC)


 * In fact, StuRat, management of that massive volume and rate of data is exactly what video software engineering entails. It is not merely possible: it is necessary to enable videos that record and play on programmable computers.  Nimur (talk) 23:56, 25 September 2012 (UTC)


 * Right, but not by having the entire video in active memory at once. This is my point. StuRat (talk) 00:25, 26 September 2012 (UTC)


 * Phoenixia, you want to learn a multimedia API. For example, GStreamer (available online) is a free and open-source, high-level programming library for multimedia, including video and audio.  It works best on Linux-like platforms, but can be made to work on any operating system.  GStreamer is widespread and has good support for abstracted implementations of major file-formats (container format), video codecs, and audio codecs; so if you learn to use it, you'll be equipping yourself with a powerful and versatile skillset.  If you're a Windows or Mac or iOS programmer, you may find the native video libraries and frameworks more suitable and fully-featured on those platforms.  I also recommend FFmpeg, another free and open-source platform, comprising a command-line video processing tool, and a free/open-source-software library for working with video.  It is less abstracted than GStreamer, so you'll have to worry yourself with more subtle details of codecs and container formats; but FFmpeg is more stable, robust, and fully-featured than GStreamer.  Similarly, native libraries tend to be more robust and stable.  On Windows, you will learn DirectX, particularly its video and multimedia libraries; and on Apple platforms, most video programmers use the AV Foundation framework. Here is an overview of the Core Video and Core Media technologies used by Mac OS X: Media Layer technologies.  Nimur (talk) 23:58, 25 September 2012 (UTC)


 * Thank you:-) GStreamer looks like just the thing I was looking for; it's extra awesome that it has a Ruby binding, that makes my week.Phoenixia1177 (talk) 03:21, 26 September 2012 (UTC)

Understanding Unicode UTF-8 and such
This text let me puzzled: In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF -8, or a Python string encoded as CP-1252. ... UTF -8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.

I thought that any Unicode text in Python 3 (or in other place) had to be in bytes somewhere, so it must be UTF-8, UTF-16, UTF-32 or whatever somewhere, am I right? So, what is meant by "there is no such a thing as a Python sting encoded in ...". Unicode for me is a way of defining characters, and UTF-x is a way of storing them (sometimes more efficient, sometimes less), therefore, if you have text, you'll need both, won't you? You have your Unicode table telling you that U+0058 is X, and you'll have a two byte series of 0s and 1s somewhere. OsmanRF34 (talk) 23:29, 25 September 2012 (UTC)
 * I think it might be helpful to draw a distinction between Python the language and Python the implementation (aka CPython). The text you quote is referring to Python-the-language, not any particular Python implementation. While CPython does indeed need to encode a string as a sequence of bytes in memory, it does not necessarily have to expose what internal encoding it uses to the Python-the-language interface. Indeed, CPython can use one internal encoding, and Jython another, and IronPython a third, without changing one iota the behavior of a given program written in Python-the-language. Python-the-language treats (or does its best to treat) strings as a series of abstract Unicode code characters, behaving independently of whatever internal encoding is used by the interpreter to represent them. -- 205.175.124.30 (talk) 23:57, 25 September 2012 (UTC)


 * (ec)It is bytes somewhere, but not in a programmer visible way (just as Python programmers don't get to see the raw bytes that represent classes or longs or whatever). When you're looking at chars, you don't get to see the bytes, only the char abstraction. To see bytes, you have to encode.  So in Python3, if you had a string (a string of chars) like 'αβγ' and you encode it in utf8 ('αβγ'.encode('utf-8') the thing you get back is not a string, it's a bytes object ("an immutable sequence of integers"). That's different from Python2, where the equivalent code u'αβγ'.encode('utf-8') does return a string. In python3 there's still a collection of data that represents the UTF8 encoded data, but it's a collection of ints, so it's not called a string. -- Finlay McWalterჷTalk 00:01, 26 September 2012 (UTC)


 * And this 'αβγ' isn't encoded in utf-8 (or whatever)? Or it simply doesn't matter, it could be anything, since it's a string object, which is composed of characters? OsmanRF34 (talk) 00:24, 26 September 2012 (UTC)
 * It's probably UTF-8, but it doesn't matter. A portable language tries to hide the details of the underlying machine from the user, so that programs run the same on different machines, and so that the implementor has the freedom to optimize. (maybe it's more efficient to detect text that has a lot of wide characters and store it in a fixed-width encoding instead!) Paul (Stansifer) 01:50, 26 September 2012 (UTC)
 * On some implementations it's UTF-16 and it does matter:

>>> len(u'\U0010FFFF') 2           >>> u'\U0010FFFF'[0], u'\U0010FFFF'[1] (u'\udbff', u'\udfff')
 * That's CPython 2.7 on Windows, and I'm pretty sure 3.2 on Windows is the same. On the other hand 2.7 and 3.2 on Linux both treat <tt>'\U0010FFFF'</tt> as one character, so I guess they use UTF-32, which surprises me since it's a huge waste of space. I think PEP 393 (mentioned below) takes effect in 3.3. -- BenRG (talk) 19:41, 26 September 2012 (UTC)
 * Unicode strings are sequences of code points, which if you have to think of them as a machine type, are basically integers in the range 0...1114111. 1114111 is 17*2^16-1 because there are seventeen 16-bit "planes" for historical reasons.  Recent Python 3.x versions use multiple internal representations depending on what characters actually occur in the string.  See PEP 393 for details. 69.228.171.70 (talk) 07:31, 26 September 2012 (UTC)