Wikipedia:Reference desk/Archives/Mathematics/2020 September 22

= September 22 =

Multiple set data compression
I have no clue what to call what I have been thinking about. Therefore, I am going to describe it. I know it is set theory, which I know very little about. I want to know what the proper terminology is so I know what books to check out when I go to the library.

In this scenario, I have a population. Make it easy. I have 5,000 people. All of them have some attribute. Let's say it is tested for covid. So, every person is flagged as covid or not covid. The population has other attributes that I can use as a filter. Again, make it easy with binary choices. There is male/female. There is black/white. There is adult/child. These filters are not correlated or evenly distributed. So, I have many mini-sets: male+black+adult, male+black+child, male+white+adult, male+white+child, etc... I want to pick and choose filters at will. I might want to look at all female. I might want to look at male+black. I might want to look at female+child. There is another twist. If I had more options, such as male/female/non-binary, I might want to look at male+non-binary. Finally, the problem: I am only allowed to store the count of covid and count of non-covid for each filter selection. I am not allowed to keep the actual set of people. So, my data is only allowed to contain: "all,2874,2126", "female,1468,1082", "female+black,421,542", etc...

Is it possible to store only the main divisions (male/female, black/white, adult/child) and somehow calculate the multiple selections? Can I somehow overlap the male/female and black/white people counts and calculate male+black? I personally don't think so, but I thought it might be a matrix problem. I know little about set theory and less about matrix stuff. If I have failed miserably to describe this problem, please let me know. I'm trying to cram my overall thoughts into a well-defined example. 97.82.165.112 (talk) 12:29, 22 September 2020 (UTC)
 * In the general case, this is clearly impossible. For instance, if I know the set contains 2 people, 1/2 has covid and 1/2 is an adult, the set composition could be either (covid adult, no-covid child) or (no-covid adult, covid child). I doubt there is much you can do except in very special cases where one of the filter is trivial (i.e. there is no child or no covid in the set, in which case combining the associated filter with another is easy). Tigraan Click here to contact me 13:37, 22 September 2020 (UTC)


 * I personally assume it is not possible, but I have been repeatedly surprised by what it is possible. 97.82.165.112 (talk) 13:48, 22 September 2020 (UTC)


 * If the assignment to classes is truly independent for different divisions, you can estimate the population size of an intersection of classes. If a fraction $p_{m}$ is a male, and a fraction $p_{c}$ is a child, then the expected fraction $p_{m∧c}$ of male children is equal to the product $p_{m}p_{c}$. Using population sizes of classes, where the total population size is $N$ and the size of some class $X$ is denoted by $N_{X}$, the corresponding fraction of individuals assigned to class $X$ equals $N_{X}/N$, so for the expected size of the intersection class $X_{m∧c}$ we find $N_{m∧c} = N_{m}N_{c}/N$.
 * Now assume $N = 2000$, $N_{m} = 1000$, and $N_{c} = 600$. Then the expected value of $N_{m∧c} = 300$. If all males have a fever, then all male children also have a fever. If 900 out of 1000 males have a fever, and there are 300 male children as estimated, then at least 200 also have a fever. So you can compute some bounds for some class intersections. In most cases they will be so weak as to be useless.  --Lambiam 20:36, 22 September 2020 (UTC)


 * I think this is something like an OLAP cube, or certainly OLAP could be used to implement it. This frequently comes up with large databases. Let's say you have 1,000,000,000 records of purchases over the last year and you want to know, for marketing purposes, what percentage of people who bought refried beans were women. If there is enough need for this type of analysis (the A in OLAP) then it's worth while to process the raw transaction data into groups so it can be easily filtered by product type, customer age, gender, location etc. Typically the raw transaction data is optimized for fast entry and retrieval of individual records, aka OLTP, but this method of storage is slow for analytical processing. --RDBury (talk) 02:35, 23 September 2020 (UTC)

How many people and how many attributes do you really have? DBMS and search engines are quite fast at computing intersections of indexed sets. On the other hand, if your data is really enormous and you're willing to accept approximate answers, you could run your query to get an exact answer for a random subset, and scale from that. What is the actual application you're trying to solve, if you're able to say? That can probably get more helpful answers than trying to abstract out what you think are the important parts of the problem. Identifying the important parts often takes considerable experience. 2601:648:8202:96B0:0:0:0:DDAF (talk) 09:12, 24 September 2020 (UTC)
 * There is no project. I'm retired. I just spend a lot of time reading journals at the library. I've had the idea in my head for a while that it should be possible to create a nearly perfect abstract of a large data set. I read multiple textbooks about creating data cubes. But, the examples they give don't cover huge data sets with thousands of degrees of freedom. They do small data sets with 2 or 3 degrees of freedom. If I remember correctly (and I rarely do these days), the idea was triggered by studying eigenvectors and singular value decomposition one day. You can factor a large matrices. That made me think about "factoring" a large data set such that it is trivial to rebuild the original data almost perfectly. The example I gave above is a tiny data set with very few degrees of freedom. I'm really just looking for the next topic. I try to keep studying because when I have to do work and I don't study for a few days, I find that I have no clue what day it is, where I'm at, or what I'm supposed to be doing. 97.82.165.112 (talk) 13:46, 25 September 2020 (UTC)
 * Just after I posted this, I remembered where the idea came from. I was reading about compressing 3D data to 2D. Specifically, the theory that with two dimensional data the surface of a sphere, you can calculate the data inside the sphere. Then, that concept can be translated to other dimensions such as compressing 2 dimensions to 1 or 4 dimensions to 3. That all came from a theory that the entire universe can be created by a projection into two dimensions. That led to studying svd, which required eigenvectors, which then connected to large databases. 97.82.165.112 (talk) 13:51, 25 September 2020 (UTC)
 * Re the idea of calculating data inside for the interior of a sphere from surface data, are you thinking of the holographic principle? It is based on untested ideas and is itself untested. In general this is impossible; try calculating the interior temperature of Earth from surface data. This has the form of a boundary value problem, which could be solved using Laplace's equation, but only for a stationary situation. --Lambiam 16:16, 25 September 2020 (UTC)
 * You're talking about several things which are related at some abstract level, but afaik these abstract connections have yet to be utilized on a practical level. The holographic principle is one that I hadn't though of. BVPs do reduce the number of dimensions of data you need to store, but they require a differential equation and this won't be the case in general.


 * You can do data cubes for small databases as well as large, in fact most spreadsheet programs have a Pivot Table function which is the same basic idea. But preprosessing the data to improve the performance of this kind of analysis is only necessary with large data sets; as pointed out above, modern database management systems are usually fast enough to produce an answer without that step when the size of the data is relatively small. The examples given in textbooks for this kind of thing are often trivially small and can be misleading. But one generally can't include a realistically sized data set in a textbook.


 * In 2 dimensions you can have raster based graphics and vector base graphics. In the first case you store information about each individual pixel while the second is basically a set of instructions for drawing a picture, for example draw circle with a given color with center P and radius r. I think vector based graphics may be an example of what you're talking about with reducing the dimensions of data, an image in this case. But the data in vector based image is still two dimensional, though might say it consists of mapping one-dimensional objects (line segments) into a two dimensional space. Not my area of expertise but I imagine 3D modeling uses a similar process except that it maps triangles to three dimensional space. Raster and vector graphics both have applications, but generally raster graphics are not converted to vector graphics for data compression. Instead, some sort of transform is applied, usually a Discrete cosine transform, data that is unlikely to be perceived by a human is thrown away, and the rest is compressed using more conventional techniques. Internet streaming uses this technique and a small increase in efficiency would result in a great deal of savings, so streaming services are highly motivated to use the most efficient compression methods available.


 * You were also talking about sparse matrices but afaik these use yet different techniques to reduce the amount of stored information.


 * The concept of database normalization seems relevant here as well, specifically Fourth normal form. Tables which are not in Fourth normal form do occasionally appear in complex databases, but it's not necessarily a good idea to fully normalize them. Imo this is mainly of theoretical interest since the standard normalization techniques almost always produce tables which are already in Fourth normal form, if not then it's probably a sign that there are problems with your specifications, not the design itself. --RDBury (talk) 16:41, 25 September 2020 (UTC)
 * Thank you. That is helpful. 97.82.165.112 (talk) 18:43, 25 September 2020 (UTC)