(original bug report)
There’s a bug in Python 2’s C API function PyUnicode_FromFormatV() (and indirectly in its wrapper PyUnicode_FromFormat()) in Objects/unicodeobject.c which can lead to overflowing both a stack-based and a heap-based buffer.
This happens because the code that ought to compute the size of two buffers, fails to execute. This is the size computation routine:
760 /* step 3: figure out how large a buffer we need */ 761 for (f = format; *f; f++) { 762 if (*f == '%') { 763 const char* p = f; 764 width = 0; 765 while (isdigit((unsigned)*f)) 766 width = (width*10) + *f++ - '0'; 767 while (*++f && *f != '%' && !isalpha((unsigned)*f))
The expressions on line 762 and line 765 can never both be true; if *f == ‘%’, isdigit((unsigned)%f) can not evaluate as true. Even if it would execute, there’s still code that computes the precision component of the format specifier. Later on, in the loop that actually processes the individual format specifiers, the code is done right:
917 const char* p = f++; 918 int longflag = 0; 919 int size_tflag = 0; 920 zeropad = (*f == '0'); 921 /* parse the width.precision part */ 922 width = 0; 923 while (isdigit((unsigned)*f)) 924 width = (width*10) + *f++ - '0'; 925 precision = 0; 926 if (*f == '.') { 927 f++; 928 while (isdigit((unsigned)*f)) 929 precision = (precision*10) + *f++ - '0'; 930 }
The actual, current bug comes down to this: both types of padding parameters (width and precision) in the format specifier are ignored when computing the size of the buffers designated to hold sprintf’s output, while in the actual calls to sprintf, these are accounted for.
947 case 'd': 948 makefmt(fmt, longflag, size_tflag, zeropad, width, precision, 'd'); 949 if (longflag) 950 sprintf(realbuffer, fmt, va_arg(vargs, long)); 951 else if (size_tflag) 952 sprintf(realbuffer, fmt, va_arg(vargs, Py_ssize_t)); 953 else 954 sprintf(realbuffer, fmt, va_arg(vargs, int)); 955 appendstring(realbuffer); 956 break; 957 case 'u': 958 makefmt(fmt, longflag, size_tflag, zeropad, width, precision, 'u'); 959 if (longflag) 960 sprintf(realbuffer, fmt, va_arg(vargs, unsigned long)); 961 else if (size_tflag) 962 sprintf(realbuffer, fmt, va_arg(vargs, size_t)); 963 else 964 sprintf(realbuffer, fmt, va_arg(vargs, unsigned int)); 965 appendstring(realbuffer); 966 break; 967 case 'i': 968 makefmt(fmt, 0, 0, zeropad, width, precision, 'i'); 969 sprintf(realbuffer, fmt, va_arg(vargs, int)); 970 appendstring(realbuffer); 971 break; 972 case 'x': 973 makefmt(fmt, 0, 0, zeropad, width, precision, 'x'); 974 sprintf(realbuffer, fmt, va_arg(vargs, int)); 975 appendstring(realbuffer); 976 break;
makefmt constructs a format specifier string based on width, precision and other parameters. Subsequently, this format specifier string is supplied to sprintf which will write a padded string to ‘realbuffer’ as requested. Since realbuffer points to the stack-based ‘char buffer[21]’, this will cause a stack-based overwrite:
894 if (abuffersize > 20) { 895 abuffer = PyObject_Malloc(abuffersize); 896 if (!abuffer) { 897 PyErr_NoMemory(); 898 goto fail; 899 } 900 realbuffer = abuffer; 901 } 902 else 903 realbuffer = buffer;
The ‘abuffersize > 20’ condition can never be true, since ‘abuffersize’ is defined earlier on as:
808 if (width < 20) 809 width = 20; 810 n += width; 811 if (abuffersize < width) 812 abuffersize = width;
which will always cause abuffersize to be 20, since width is always 0, since the code that ought to compute ‘width’ never runs.
After the sprintf, which causes the stack-based buffer overflow, there’s this:
955 appendstring(realbuffer);
693 #define appendstring(string) 694 do { 695 for (copy = string;*copy; copy++) { 696 *s++ = (unsigned char)*copy; 697 } 698 } while (0)
‘s’ is space allocated based on ‘n’:
908 string = PyUnicode_FromUnicode(NULL, n); 909 if (!string) 910 goto fail; 911 912 s = PyUnicode_AS_UNICODE(string); 913 callresult = callresults;
and ‘n’ never accounts for any width or precision parameters either, so ‘s’ is always too small if width and optionally precision parameters are present. Thus, appendstring(realbuffer); results in a heap-based overflow.
Here’s a patch:
diff -r baa5258bef22 Objects/unicodeobject.c --- a/Objects/unicodeobject.c Sat Dec 13 16:06:19 2014 -0500 +++ b/Objects/unicodeobject.c Sun Dec 14 22:14:39 2014 +0100 @@ -760,12 +760,18 @@ /* step 3: figure out how large a buffer we need */ for (f = format; *f; f++) { if (*f == '%') { - const char* p = f; + const char* p = f++; width = 0; + precision = 0; while (isdigit((unsigned)*f)) width = (width*10) + *f++ - '0'; - while (*++f && *f != '%' && !isalpha((unsigned)*f)) - ; + if (*f == '.') { + f++; + while (isdigit((unsigned)*f)) + precision = (precision*10) + *f++ - '0'; + } + while (*f && *f != '%' && !isalpha((unsigned)*f)) + f++; /* skip the 'l' or 'z' in {%ld, %zd, %lu, %zu} since * they don't affect the amount of space we reserve. @@ -805,11 +811,9 @@ This isn't enough for octal. If a width is specified we need more (which we allocate later). */ - if (width < 20) - width = 20; - n += width; - if (abuffersize < width) - abuffersize = width; + n += (width + precision) < 20 ? 20 : (width + precision); + if (abuffersize < (width + precision) ) + abuffersize = width + precision; break; case 's': {
Aside from general memory corruption errors caused by this bug, there’s the additional danger of code execution in instances where a user or entity can control the ‘format’ parameter of the PyUnicode_FromFormat()/PyUnicode_FromFormatV functions, and of sidetracking audits of code that uses these functions because of the unexpected behavior it can entail.